Department of Energy (DOE) leadership computing facilities are in the process of deploying extreme-scale high-performance computing (HPC) systems with the long-range goal of building exascale systems that perform more than a quintillion (a billion billion) operations per second. More powerful computers mean researchers can simulate biological, chemical, and other physical interactions with an unprecedented amount of realism. However, as HPC systems become more complex, system integrators, component manufacturers as well as computing facilities have to and are preparing for unique computing challenges. Of particular concern are occurrences of unfamiliar or more frequent faults in both hardware technologies and software applications that can lead to computational errors or system failures.
This project will help DOE computing facilities protect extreme-scale systems by characterizing potential faults and creating models that predict their propagation and impact. The Collaboration of Oak Ridge, Argonne and Lawrence Livermore National Laboratories (CORAL) is a private/public partnership that will stand up three extreme-scale systems in 2017/2018, each operating at about 150 to 200 petaflops, or nearly 10 times more power than the 27-petaflop Titan at Oak Ridge National Laboratory (currently the fastest system in the United States) and about a tenth of exascale power.
By monitoring hardware and software performance on current DOE systems, such as Titan, and applying the data to fault analysis and vulnerability studies, this effort will capture observed and inferred fault conditions and extrapolate this knowledge to CORAL and other extreme-scale systems. Using these analyses, the project team will create assessment tools, including a fault taxonomy and catalog as well as fault models, to provide computing facilities with a clear picture of the fault characteristics in DOE computing environments and inform technical and operational decisions to improve resilience. The catalog, models, and the software resulting from this project, will be made publicly available.
3 Bullet Points
- This project identifies, categorizes and models the fault, error and failure properties of DOE systems, using in-breadth offline and online data gathering and analysis techniques, and applications, using realistic in-depth fault vulnerability and error propagation studies.
- This effort will develop a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolate this knowledge to exascale systems.
- The results of this project will provide a clear picture of the fault characteristics in the DOE computing environments and improve resilience through reliable fault detection at an early stage and actionable information for efficient fault mitigation during system design, application and system software development, and runtime.
The creation of the catalog and the models is an iterative process. We started by analyzing system data offline for an initial version of the catalog. This allows us to create realistic scenarios to study fault vulnerabilities and error propagation in application codes. We will identify gaps in the machine centric data that needs to be covered using online analysis. Finally, we will use the information gathered to refine our methods and define new instrumentation data sets and points, which can then feed back into the cycle to create a more complete fault catalog.