Beyond Monitoring: Proactive Server Preservation in an HPC Environment

Loading...
Thumbnail Image

Authors

Feller, Chad

Issue Date

2012

Type

Thesis

Language

Keywords

Environmental Monitoring , HPC , IPMI , System Monitoring

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Monitoring has long been the challenge of a server administrator. Monitoring diskhealth, system load, network congestion, and environmental conditions like temperature are all things that can be tied into monitoring systems. Monitoring systemsvary in scope and capabilities, and many can fire off alerts for just about any configuration. The sysadmin then has the responsibility of weighing the alert and decidingif and when to act. In a High Performance Computing (HPC) environment, someof these failures can have a ripple effect, affecting a larger area than the physicalproblem. Furthermore, some temperature and load swings can be more drastic in anHPC environment than they would be otherwise. Because of this a timely, measuredresponse is critical. When a timely response is not possible, conditions can escalaterapidly in an HPC environment, leading to component failure. In this situation, anintelligent, automatic, measured response is critical. Here we present such a system, anovel approach to server monitoring using integrated server hardware operating independently of the operating sytem, and capable not only of monitoring temperatures,but also automatically responding to temperature events. Our proactive response system leverages standard HPC software and integrated server hardware. It is designedto intelligently respond to temperature events from a High Performance Computingperspective, looking at both compute jobs and server hardware.

Description

Citation

Publisher

License

In Copyright(All Rights Reserved)

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN