Exploiting Global View for Resilience

Application Partnerships

  • Advanced Nuclear Reactor Simulation (Andrew Siegel, CESAR)
  • Computational Chemistry (Jeff Hammond, ALCF)
  • Rich Computational Frameworks (Trilinos, Mike Heroux, Sandia)
  • Particle codes (ddcMD) (David Richards, Ignacio Laguna, LLNL)
  • Adaptive Mesh Refinement (Chombo) (Brian van Straalen, Anshu Dubey, LBNL)
  • Combustion (S3D) (Jackie Chen, Sandia)

Global View Resilience (GVR) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. The globally-visible distributed array abstraction is "multi-version", providing redundancy in time, and a convenient location for application annotations for reliability needs. Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. Further, GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.

Resilience Challenges

  • Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
  • What’s an application to do?



Resilience Co-design

Co‑design without co‑dependence

  • Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
  • Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)


Project Impact


  • Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
  • “Outside in”, as needed, incremental, ...


GVR Approach 1



  • Application-System Partnership
    • Expose and exploit algorithm and application domain knowledge
    • Enable “End to end” resilience model
  • Foundation in Data-oriented resilience
    • Internet services, map-reduce, internet, ...
    • Achieve with high performance and massive parallelism...
    • Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)

Data-oriented Resilience



  • Parallel applications and global-view data
  • Natural parallel structure version-to-version
    • Example: shock hydro simulation at t=10ms to 100ms
    • Example: iterative solver at iteration 1 to 20
    • Example: monte carlo at 10M to 20M points
  • Temporal redundancy enables rollback and resume
    • User-controlled, convenient

Resilience Partnership

  • Proportional Resilience
    • Application specifies “Resilience priorities”
    • Mapped into data-redundancy in space
    • Mapped into redundancy in time (multi-version)
    • Complements computation/task redundancy efforts
  • Deep error detection: invariants, assertions, checks ... and recovery
  • Applications add further checks based on algorithm and domain semantics
    • Application add flexible, adaptive recovery mechanisms (and exploit multi-version)
  • “End-to-end” resilience


GVR Approach 2



  • x-layer approach for efficient execution (and better resilience)
    • Spatial redundancy – coding at multiple levels, system level checking
    • Temporal redundancy - Multi-version memory, integrated memory and NVRAM management
  • Push checks to most efficient level (find early, contain, reduce overhead)
  • Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
  • Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment

Multi-version Memory



  • Common parallel paradigm, basis for programmer engagement
  • Frames invariant checks, more complex checks based on high-level semantics
  • Frames sophisticated recovery


Research Challenges

  • Understand application resilience needs and opportunities for proportional resilience and deep error detection/end-to-end resilience
  • Explore multi-version memory as opportunity for framing richer resilience and parallelism
  • Design API that embodies these ideas and gentle slope incremental application effort
  • Create efficient x-layer implementations - many questions
  • Explore architecture opportunities to increase resilience and reduce overhead


Global‑view Data Program



GVR Resilience Program



Global View & Consistent Snapshots



  • How to safely, efficiently identify consistent snapshots?
    • Application control: Global Synch; Array-level synch; explicit snapshot
    • Application flagged (optional)
    • Implicit (runtime decides)
  • Snapshots = natural points to express and implement assertions, checks, recovery


Implementing Multi-version



  • How to implement multi-version efficiently?
    • Time, Space, Label => representation, protocol
  • Which to take?
    • Versions are logical, snapshots require resources
  • Intelligent storage:
    • Representation, compression, architecture support
    • Older versions recede into storage [SILT]


Intelligent Memory and Storage



  • How to exploit intelligence at memory and storage? (at controller)
  • Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
  • Fine-grained state tracking; compression, intelligent, copying, etc.
  • Efficient version capture; differenced checkpoints (Plank95, Svard11)



  • Multi-version and increased concurrency
  • Multi-version and debugging
  • Architecture support and fine-grained synchronization, application checks, compressed memory, etc.
  • ...more?


Expected Outcomes

  • Use cases – Application skeleton design and classifications which form foundation of the design
  • Design of GVR API for flexible resilience and multi-version global data
  • Research prototype software developed as a library; target for programmers, compiler backends
  • Experiments with mini-apps and application partners (w/ co-design postdocs)
  • Assessment of architecture support opportunities and quantitative benefits


GVR X-Stack Synergies



  • Direct Application Programming Interface
  • Co-existence, even target with other Runtimes
  • Rich Solver Library Building Block
  • Programming System Target


Research Products

Full report = Media:gvr-research-products.pdf

  • Demonstrated easy application integration, <2% lines of code change in large (10K-100K line applications)
  • Demonstrated controllable and low performance overhead (application scaling to 16,384 nodes and <2% overhead)
  • Released on multiple platforms, including Cray (Edison, Cori), IBM BG/Q (Mira, JuQueen), Linux clusters
  • Demonstrated flexible, portable application-semantics based forward-error correction in multiple applications (OpenMC, ddcMD, etc.)
  • Software release available from http://gvr.cs.uchicago.edu/ and deployed at multiple supercomputing centers, including NERSC.



The energy, parallelism, and error rate projections (≈ 2 × 109 FITS/billion hours or 30 minutes MTTI) for exascale systems represent extraordinary...


Goals: • Create portable application-­system partnership for resilience • Explore efficient implementation of resilient and multi-­version data •...

Quad Charts

GVR Quad Chart Oct 2013