Beyond Monitoring: Proactive Server Preservation in an HPC Environment
Loading...
Authors
Feller, Chad
Issue Date
2012
Type
Thesis
Language
Keywords
Environmental Monitoring , HPC , IPMI , System Monitoring
Alternative Title
Abstract
Monitoring has long been the challenge of a server administrator. Monitoring diskhealth, system load, network congestion, and environmental conditions like temperature are all things that can be tied into monitoring systems. Monitoring systemsvary in scope and capabilities, and many can fire off alerts for just about any configuration. The sysadmin then has the responsibility of weighing the alert and decidingif and when to act. In a High Performance Computing (HPC) environment, someof these failures can have a ripple effect, affecting a larger area than the physicalproblem. Furthermore, some temperature and load swings can be more drastic in anHPC environment than they would be otherwise. Because of this a timely, measuredresponse is critical. When a timely response is not possible, conditions can escalaterapidly in an HPC environment, leading to component failure. In this situation, anintelligent, automatic, measured response is critical. Here we present such a system, anovel approach to server monitoring using integrated server hardware operating independently of the operating sytem, and capable not only of monitoring temperatures,but also automatically responding to temperature events. Our proactive response system leverages standard HPC software and integrated server hardware. It is designedto intelligently respond to temperature events from a High Performance Computingperspective, looking at both compute jobs and server hardware.
Description
Citation
Publisher
License
In Copyright(All Rights Reserved)
