Resilience for Extreme Scale Supercomputing Systems

The next-generation of scientific discovery will be enabled by research developments that can effectively harness significant or disruptive advances in computing technology. Applications running on extreme scale computing systems will generate results with orders of magnitude higher resolution and fidelity, achieving a time-to-solution significantly shorter than possible with today' high performance computing platforms. However, indications are that these new systems will experience hard and soft errors with increasing frequency, necessitating research to develop new approaches to resilience that enable applications to run efficiently to completion in a timely manner and achieve correct results. Challenges to be addressed include the following topics:

  1. Fault Detection and Categorization
  2. Fault Mitigation
  3. Anomaly Detection and Fault Avoidance

Core Team

Program Manager: Lucy Nowell


Whole-program Adaptive Error Detection and Mitigation
Characterizing Faults, Errors, and Failures in Extreme-scale Systems
Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection and Impact
Lazy Shadowing - An Adaptive, Power-Aware Resiliency Framework for Exascale Computing
Validating Extreme-scale Resilience with Veracity



The Office of Advanced Scientific Computing Research (ASCR) in the Office of Science (SC), U.S. Department of Energy (DOE), hereby invites proposals...