X-Stack

X-Stack Software Research

The X-Stack Program conducts basic research that represents significant advances in programming models, languages, compilers, runtime systems and tools that address fundamental challenges related to the system software stack for Exascale computing platforms.
 
Programming models, languages, and related technologies that have sustained High Performance Computing (HPC) application software development for the past decade are inadequate for Exascale era computers. The significant increase in complexity of Exascale platforms due to energy-constrained, billion-way parallelism, with major changes to processor and memory architecture, requires new energy-efficient and resilient programming techniques that are portable across multiple future machine generations.
 
To address these challenges, this program invests in the following topics:

  • Scalability: enable applications to strongly scale to Exascale levels of parallelism
  • Programmability: clearly reduce the burden we are placing on high performance programmers
  • Performance Portability: eliminate or significantly minimize requirements for porting to future platforms
  • Resilience: properly manage fault detection and recovery at all components of the software stack
  • Energy Efficiency: maximally exploit dynamic energy saving opportunities, leveraging the tradeoffs between energy efficiency, resilience, and performance.

Communications

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

 

What is the primary performance metric for your runtime?

What are the communication "primitives" that you expect to emphasize within your project? (e.g. two-sided vs one-sided, collectives, topologies, groups) Do we need to define extensions to the traditional application level interfaces which now emphasize only data transfers and collective operations? Do we need atomics, remote invocation interfaces, or should these be provided ad-hoc by clients?
XPRESS The communication primitive is based on the “parcel” protocol that is an expanded form of active-message and that operates within a global address space distributed across “localities” (approx. nodes). Logical destinations are hierarchical global names, actions include instantiations of threads and ParalleX processes (spanning multiple localities), data movement, compound atomic operations, and OS calls. Continuations determine follow-on actions. Payload conveys data operands and block data for moves. Parcels are an integral component of the semantics of the ParalleX execution model providing symmetric semantics in the domain of asynchronous distributed processing to local synchronous processing on localities (nodes).
TG  
DEGAS The GASNet-EX communications library provides Active Messages (AM), one-sided data-movement and collectives as its primary communications primitives. Secondary primitives, such as atomics, may be emulated via AM or implemented through native hardware when available. The programming models in the DEGAS project use these primitives for many purposes. The Active Message primitives support the asynchronous remote invocation operations present in the Habanero and UPC++ efforts, while atomics will provide efficient point-to-point synchronization.
D-TEC The primary focus is computation via asynchronous tasks. The primary communication primitive is (reliably delivered) fire-and-forget active messages. Higher level behavior (finish, at) is synthesized by the APGAS runtime ontop of the active message primitive. However, for performance at scale we recognize the importance of additional primitives: both non-blocking collectives and one-sided asynchronous RDMA's for point-to-point bulk data transfer
DynAX SWARM uses split-phase asynchronous communications operations with separate setup and callback phases, with a common co-/subroutine-call protocol on top of this. Method calls can easily be relocated by SWARM, and within this infrastructure we additionally provide for split-phase transfers and collective operations. Non-blocking one-sided communication is the only real need, though secondary features (such as remote atomics) might be beneficial.
X-TUNE X-TUNE is primarily focused on introducing thread-level parallelism and synchronization, and is relying on programmers or other tools to manage communication. Support in X-TUNE could be used to perform autotuning of communication primitives, but this is beyond the scope of our activities.
GVR  
CORVETTE  
SLEEC N/A
PIPER Communication will be out of band and need to be isolated, emphasis on streaming communication
Traditional communication libraries (e.g. MPI or GASNet) have been developed without tight integration with the computation "model". What is your strategy for integrating communication and computation to address the needs of non SPMD execution?
XPRESS It is important for performance portability that most communication related codes comprise invariants that will always be true. Aggregation, routing, order sensitivity, time to arrival, and error management should be transparent to the user. Destinations, in most cases, should be relative to placement of first class objects for adaptive placement and routing. Scheduling should tolerate asynchrony uncertainty of message delivery without forfeit of performance assuming sufficient parallelism.
TG Our runtime can rely on existing thread (we use pthreads for example) frameworks but we do not need them as we really use the existing threading framework to emulate a computing resource.
DEGAS DEGAS is extending the GASNet APIs to produce GASNet-EX. GASNet-EX is designed to support the computation model rather than dictate it. Unlike the current GASNet, GASNet-EX allows (but does not require) treatment of threads as first-class entities (as in the MPI endpoints proposal), allowing efficient mapping of non-SPMD execution models, e.g. Habanero, that are impractical or inefficient today.
D-TEC It is not clear to us that tight integration of communication libraries and the computation model is needed to support non SPMD execution. The X10/APGAS runtime supports non-SPMD execution at scale while maintaining a fairly strict separation between the communication layer (X10RT) and the computational model. X10RT provides basic active message facilities, but all higher-level computational model concerns are handled above the X10RT layer of the runtime. However, there are certainly opportunities to optimize some of the small "control" messages sent over the X10RT transport by the APGAS runtime layer by off-loading pieces of runtime logic into message handlers that could run directly within the network software/hardware. Pushing this function down simply requires the network layer to allow execution of user-provided handlers, not a true integration of the computation model into the communication library.
DynAX SWARM's codelet model can be used to effect split-phase co-/subroutine calls, whether or not there are networking features present. Applications can control the routing of these calls and their associated data explicitly, but this is not necessary unless higher-level partitioning schemes are being used. This scheme allows any computation encompassed by one or more codelets to be relocated, allowing both data and computation to be relocated transparently and enabling scaling from a single hardware thread to thousands of nodes without rewriting the application.
X-TUNE  
GVR  
CORVETTE  
SLEEC Computation methods should express their dependences so that SLEEC's runtime(s) can manage communication within a heterogeneous node.
PIPER N/A
What type of optimizations should be transparently provided by a communication layer and what should be delegated to compilers or application developers?
What is the primary performance metric for your runtime?
XPRESS Time to solution of application workload, with minimum energy cost within that scope.
TG  
DEGAS Communications libraries and neighboring runtime layers should be responsible only for dynamic optimization of communication. Examples of such optimizations include: aggregation of messages with the same destination, scheduling multiple links, and injection control for congestion avoidance. Compilers or application developers should be responsible for static optimizations such as communication avoidance, hot-spot elimination, etc.

Primary metrics for communications runtime include latency of short messages, bandwidth of large messages, and communication/computation overlap opportunity during long-latency operations. Reduction of energy is a metric for the stack as a whole and may be more dependent on avoiding communication than on optimizing it (see also energy-related question).

D-TEC Under the assumption that the communication layer is not tightly integrated with the computation model, the scope of transparent optimization seems limited to optimizing the flow of traffic within the network. Perhaps also providing performance diagnostics and control points to higher-levels of the runtime to enable them to optimize communication behavior. Optimizations need to be planned/managed at a level of the stack that has sufficient scope to make good decisions.
DynAX A communications layer should ideally be able to load-balance both work and data without application involvement, using optional application-provided placement hints to assist in the process. Compilers should deal more with transforming higher-level language features like data types and method calls into SWARM constructs, and although compilers may generate hints, this will likely have to be the responsibility of the developer and tuner. The primary (external) metrics used are time to application completion and energy cost.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER unclear
What is your strategy towards resilient communication libraries?
XPRESS To first order runtime system assumes correct operation of communication libraries as being pursued by Portals-4 and the experimental Photon communication fabric. Under NNSA PSAAP-2 the Micro-checkpoint Compute-Validate-Commit cycle will detect errors including those due to communication failures.
TG  
DEGAS DEGAS is pursuing a hybrid approach to resilience which consists of both backward recovery (rollback-recovery of state via checkpoint/restart) and forward recovery (recompute or re-communicate faulty state via Containment Domains), working together in the same run. The ability of Containment Domains to isolate faults and to perform most recovery locally is ideal for most "soft errors", while the use of rollback-recovery is appropriate to hard node crashes. The combination of the two not only reduces the frequency of checkpoints required to provide effective protection, but also limits the type of errors that an application programmer must tolerate. Further, our approach allows the scope of rollback-recovery to be limited to subsets of the nodes and, in some cases, only the faulty nodes need to perform recovery.

The communications library supports each resilience mechanism in appropriate ways. For rollback-recovery, GASNet-EX must include a mechanism to capture a consistent state, a significantly more challenging a problem with one-sided communication than in a message-passing system, especially if one does not wish to quiesce all application communications for a consistent checkpoint. For Containment Domains, GASNet-EX must run through communications failures by reacting (not aborting), by notifying other runtime components, by enabling these components to take appropriate actions, and by preventing resource leaks associated with (for instance) now-unreachable peers.

D-TEC This is not an area of research for D-TEC. We are assuming the low-level communication libraries will (at least as visible by our layer) operate correctly or report faults when it does not. Any faults reported by the underlying communication library will be reflected up to higher-levels of the runtime stack.
DynAX The DynAX project will focus on Resilience in year three. As such, it's not yet clear what SWARM's reliability requirements from the communication layer will be. We expect error recovery to occur at the level of the application kernel, by using containment domains to restart a failed tile operation, or algorithmic iteration, or application kernel when the error is detected. The success of this will hinge on whether errors can be reliably detected, and reliably acted upon.
X-TUNE N/A
GVR  
CORVETTE  
SLEEC N/A
PIPER ability to drop and reroute around failed processes
What and how can a communication layer help in power and energy optimizations?
XPRESS Energy waste on unused channels needs to be prevented. Delays due to contention for hotspots need to be mitigated through dynamic routing. Information on message traffic, granularity, and power needs to be provided to OSR.
TG  
DEGAS If/when applications become blocked waiting for communications to complete, one should consider energy-aware mechanisms for blocking. Other than that, most mechanisms for energy reduction with respect to communication are also effective for reducing time-to-solution and are likely to be studied in that context (where the metric is easier to measure).
D-TEC  
DynAX Data transfers will occupy a large portion of the energy budget, so minimizing the need for data movement will greatly improve energy consumption. This can be done by ensuring that, whenever possible, work is forwarded to the data and not vice versa. SWARM's codelet model ensures that this is quasi-transparent to the programmer, although the runtime must obviously perform the work-forwarding and whatever data relocations are needed itself. Hints from the compiler or application programmer/tuner can assist the runtime in this and further decrease energy consumption.

As part of PEDAL (Power Efficient Data Abstraction Layer), we are also developing an additional SW layer that encapsulates data composites. This process assigns specific layouts, transformations and operators to the composites that can be used advantageously to reduce power and energy costs. A similar process will we applicable to resiliency as well.

X-TUNE N/A
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
Congestion management and flow control mechanisms are of particular concern at very large scale. How much can we rely on "vendor" mechanisms and how much do we need to address in higher level layers?
XPRESS Vendor systems can help with redundant paths and dynamic routing. Runtime system data and task placement can attempt to maximize locality for reduced message traffic contention.
TG  
DEGAS As others have observed, the first line of defense against congestion is intelligent placement of tasks and their data. This is the domain of the tasking runtime and the application author.

Ideally, the vendor systems would provide some degree of congestion management. This would use information not necessarily available to the communications runtime, e.g. static information about their network, dynamic information about application traffic, and traffic from other jobs. However, compilers and runtime components with "macro" communications behaviors, i.e. collectives or other structured communications, could potentially inform the communications layer about near-future communications, where this information can be used to build congestion-avoiding communications schedules. These scheduling approaches can be greatly enhanced if the vendors expose information about current network conditions, particularly for networks where multiple jobs share the same links.

D-TEC Higher levels should focus on task & data placement to increase locality and reduce redundant communication. Placement should also be aware of network topology and optimize towards keeping frequently communicating tasks in the same "neighborhood" when possible. Micro-optimization of routing, congestion management, and flow control are probably most effectively handled by system/vendor mechanisms since it may require detailed understanding of the internals of network software/hardware and the available dynamic information.
DynAX Vendor mechanisms are helpful, but not necessary, for ensuring the correctness and timeliness of low-level data transfers. SWARM itself uses a higher-level flow-control mechanism based on codelet completion (i.e., using callbacks as a basis for issuing ACKs). SWARM also performs load-balancing on work and data to help minimize congestion and contention.
X-TUNE N/A
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A

Correctness Tools

Questions:

What kind of runtime overhead can you accept when running your application with a dynamic analysis tools such as a data race detector? 1.1X, 01.5X, 2X, 5X, 10X, 100X.
XPRESS Runtime overhead determines task granularity and therefore available parallelism (strong scaling) determining time to solution. Only fractional overheads can be tolerated at runtime.
TG That depends upon how much of the application - which parts - can be excluded from analysis during the run, because that portion is trusted. If the problem is already isolated to particular section then a very significant hit to performance is acceptable to pinpoint it. If a single loop nest was already identified and that nest normally ran for a few seconds, the even 100x would be acceptable if it did not delay turnaround for more than a few minutes.
DEGAS  
D-TEC In debug mode 10X or even 100X for selected small input data sets. We consider running the app with dynamic analysis tools as part of testing. In release mode we do not expect to run with dynamic analysis tools.
DynAX This should really be a question for the DoE app writers. I imagine that if run in debug mode, performance degrations are acceptable up to a few X. But most likely, if a dynamic tool is modifying program's execution time significantly, it also probably changes the way the program is executed (dynamic task schedules, non-determinism) and hence may analyze irrelevant things.
X-TUNE Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
What types of bugs do you want correctness tools to detect? Data races, deadlocks, non-determinism.
XPRESS Hardware errors, runtime errors, discrepancies with respect to expectations (e.g., task time to completion, satisfying a condition) are among classes of bugs that detection mechanisms would be address. Probably these have to be build into the runtime/compile-time system and be consistent with overall execution model.
TG Look for cycles in EDT graphs; congestion hotspots and livelocks.
DEGAS  
D-TEC It is important to distinguish when an error is detected, at compile time by static analyis, on-the-fly at run-time, requiring less storage space than post-mortem detection since much information can be discarded as execution progresses, or after execution by analyzing traces. Non-determinism can be the result of data races, hence, detecting data races is most important as it may contribute to identify sources of deadlocks and non-determinism. One might also combine the aforementioned different points in time of analysis and combine the results. We consider the different approaches to complement each other by having different trade offs in performance impact and memory consumption.
DynAX SWARM proposes mechanisms (such as tags) to help avoiding data races. However, SWARM is compatible with traditional synchronization and data access mechanisms, and hence the programmer can create data races.

Deadlocks can appear with any programming model based on task graphs, including the one supported by the SWARM runtime. They happen as soon as a cycle is created among tasks. ETI's tracer can help detect deadlocks by showing which tasks were running when the deadlock happened. Non-determinism is often a feature. It may be desired in parts of the programs (for instance when running a parallel reduction), not in others. So a tool for detecting non-determinism per se wouldn't be sufficient, it would require an API to specify when it is unexpected.

X-TUNE We would want to identify errors introduced by the tuning process, so this could be any kind of error.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
What kind of correctness tools can help you with interactive debugging in a debugger? Do you want those tools to discover bugs for you?
XPRESS Beyond a certain level of scaling, debugging is almost indistinguishable from fault tolerance methods. Detection of errors (not found at compile time) requires equivalent mechanisms, diagnosis needs to add code as a possible source of error, but an additional kind of user interface is required, perhaps to provide a patch at runtime permitting continued execution from point of error. This is supported under DOE OS/R Hobbes project.
TG Yes that would be ideal, but this is a tall request even in current debuggers on current systems.
DEGAS  
D-TEC A tool that can identify which assertions it cannot statically verify and that is applied in combination with a slicing tool that allows to compute a backward slice with a given set of concrete values of variables that are provided by the debugger.
DynAX Some bugs can be discovered automatically, such as deadlocks and livelocks. For the others, tools need to reduce the time required to find the source of the bug.
X-TUNE The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
Current auto-tuning and performance/precision debugging tools treat the program as a black-box. What kind of white-box program analysis techniques can help in auto-tuning your application with floating-point precision and alternative algorithms?
XPRESS XPRESS does not address floating-point precision issues and will benefit from other programs in possible solutions. XPRESS does incorporate the APEX and RCR components for introspective performance optimization at runtime system controlling load balancing and task scheduling. It measures progress towards goals to adjust ordering, especially for critical path tasks.
TG I would rephrase the question a bit. Intel's current production tools and the open sources tools in the community are actually capable of tracing possible problems back to source code lines and making suggestions. However they give few hints about performance problems that emanate from runtime or system issues. This would be desired.
DEGAS  
D-TEC Precision analysis that can verify assertions at different points in a program that specify expected precision and ranges of values of variables, subsets, or all values of an array at a point in execution of a program.
DynAX Compilers can instrument the code to let auto-tuners focus on particular parts of the code and on particular execution characteristics (e.g., by selecting hardware counters).

Please define "alternative algorithms".

X-TUNE The key issue will be understanding when differences in output are acceptable, and when they represent an error.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
What kind of testing strategy do you normally apply while developing your software component? Do you write tests before or after the development process? What kind of coverage do you try to achieve? Do you write small unit tests or large systems tests to achieve coverage? Are your tests non-deterministic?
XPRESS Tests will be written after code and incorporated as part of Microcheckpointing Compute-Validate-Commit cycle for fault tolerance and debugging. Tests will be hierarchical for better but incomplete coverage. Phased checkpointing will provide very coarse-grained fall back in case of unrecoverable errors.
TG Ad hoc blend of unit, group, application tests.
DEGAS  
D-TEC Writing tests is part of the development process. System and unit tests are run as part of an incremental integration process. Any found bugs are destilled into test cases that become part of the integration tests.
DynAX Unit tests are written for each piece of software, and massive system tests are run every day. Tests support some amount of non-determinism, which result in bounded numerical variations in the results. Coverage of the test reflects expected use: often-used components get tested more intensively. System tests are often available before the development process, while unit tests are usually written during and after each code contribution.
X-TUNE Comparing output between a version that is believed correct and an optimized version is the standard approach.
GVR  
CORVETTE  
SLEEC  
PIPER tbd.
After a bug is discovered, do you want automated support to create a simplified test that will reproduce the bug?
XPRESS Yes
TG Yes
DEGAS  
D-TEC Yes. We consider SMT solvers in this respect most promising.
DynAX This would sound useful. The more simplified, the better.
X-TUNE Yes
GVR  
CORVETTE  
SLEEC  
PIPER N/A
What is your strategy to debug a DSL and its runtime? What kind of multi-level debugging support do you want for DSLs?
XPRESS XPRESS provides the HPX runtime that will incorporate its own set of test and correctness mechanisms under the Microcheckpointing methodology using the Compute-Validate-Commit cycle and built in testing. It is expected that DSL will support the generation of reverse test cases to support this.
TG Very similar strategy to debugging other codes. There would be nothing fundamentally different.
DEGAS  
D-TEC For DSLs that are translated and/or lowered to a general purpose language we expect assertions present in the DSL also to be generated in lower-level code, and allowing to relate any failing assertions to points in the original DSL. We expect DSLs to allow for domain-specific assertions in combination with user-provided domain-specific knowledge in form of properties that can then be checked in the generated lower-level code.
DynAX  
X-TUNE Do we trust the DSL to translate code correctly? That seems like the fundamental question. The DSL developer should be able to debug the translation, while the DSL user should just be debugging the code that they added.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
What kind of visualization and presentation support for bugs do you want from correctness tools? What kind of IDE integration do you want for the correctness tools?
XPRESS Visualization and presentation support to correlate a detected error in terms of physical location, point in code, virtual user thread instantiation, and exact instruction causing the error will be of great value, especially if combined with a control framework for manipulating the execution stream at the offending point for diagnosis and correction.
TG Visualization can be quite simple. A usefule example is the display used in Intel's performance tuning tools like Vtune. IDEs are personal choices as is the choice not to use one, but I would at a minimum select Eclipse for integration.
DEGAS  
D-TEC We want a query language that allows to mine any collected data as well as its scalable visualization showing dependencies and attached properties. This could support debugging at multiple levels.
DynAX Any tool that allows big-picture and detailed views, as in ETI's tracer for SWARM. Tools that don't integrate with a particular IDE end up being compatible with all IDEs, including vi and emacs, which is desirable.
X-TUNE Pinpointing code or data that are involved in an bug/bottleneck is helpful in debugging/performance tools.
GVR  
CORVETTE  
SLEEC  
PIPER Integration with performance analysis tools (similar domains to represent, don't want to burden users with multiple different tools)
When combining various languages/runtimes, can your language/runtime be debugged in isolation from the rest of the system?
XPRESS Yes.
TG In practice, but this is a goal. It could be called an aspect of separation of concerns.
DEGAS  
D-TEC This depends if the DSL is integrated into a general purpose language or a separate language that interfaces with the language run-time of the host language. In the latter case 'yes', otherwise 'no'.
DynAX The entire execution stack is visible to the programmer at debug time, including the SWARM parts. We do not think that opacity across system parts would bring clarity in the debugging process.
X-TUNE N/A -- We use standard languages and run-time support.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
List the testing and debugging challenges that you think the next generation programming languages and models would face?
XPRESS The principal challenges are detection and isolation.
TG The challenges will necessarily be on providing information about data placement and data movement. This exist to a certain extent already in analysis for performance, but additional information about energy usage will be very valuable.
DEGAS  
D-TEC For the next generation of programming languages (and resiliency issues of next gen exascale hardware) we expect that avoiding errors by verifying properties at compile time will become increasingly important because debugging at run-time will become even more challenging due to resiliency complicating the debugging process in future.
DynAX We expect the main challenges to be related to tractability of debugging and testing applications that run on millions of cores, and reproducibility of non-deterministic applications on a non-reliable hardware platform.
X-TUNE Scalability and determining what is an error seem like the biggest challenges.
GVR  
CORVETTE  
SLEEC  
PIPER N/A
How can correctness tools help with reasoning about energy? How can correctness tools help with resilience?
XPRESS With respect to energy, tools that can determine the critical path of execution would provide the basis for energy/power scheduling. As above for reliability, see 10) and other answers.
TG Please see the note above. Related to resilience: simulations will not be deterministic. Tools that allow the develop to distinguish between differing results that stem, from different operation ordering and results that indicate an actual failure would be useful.
DEGAS  
D-TEC Correctness tool can identify erroneous paths and can therefore help to identify and remove those paths which otherwise could have a severe impact on the power-reasoning. Any violation of a property at run-time that has been established by a correctness tool to hold for a given execution path (before at compile-time) can be identified to be the consequence of resilience and trigger appropriate recovery actions at run-time.
DynAX Correctness tools may be able to help the programmer make design choices which impact energy consumption, such as enabling non-determinism in parts of their application.

Lightweight tools that detect faults are presumably directly applicable to resiliency, for instance by coupling them with checkpointing.

X-TUNE Correctness tools can certainly help with resilience, if they have a concept of what is tolerable vs. an error. I don't see a connection to energy optimization.
GVR  
CORVETTE  
SLEEC  
PIPER

Energy/power monitoring is a performance problem, in particular in power constrained scenarios

Compilers

Sonia requested that Dan Quinlan initiate this page. For comments, please contact Dan Quinlan. This page is still in development.

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

Describe how you expect to require compiler support within your X-Stack project.
XPRESS Currently, the HPX/XPI approach is based on libraries being developed employing either the C or C++ compilers of which several are currently available. A future project, PXC, will develop an advanced compiler capability to fully represent the ParalleX execution model. But this is out of scope of XSTACK program.
TG The compiler will be used at various levels in our project. From most basic to most complex: (a) support for the different ISA we have for our accelerator units (XEs); in particular intrinsic support for all the specialized instructions available (DMA, QMA, advanced math, ...). (b) Keyword extensions to make our programming model concepts first-class citizens for the compiler (data-blocks, EDTs, etc). We currently have limited keywords hinting at memory placement. (c) Advanced code generation using keywords which would enable the compiler to transform straight C (with keywords) into OCR code. (d) Code refactoring that would allow the compiler to take very fine-grained tasks/DBs and emit a code that would better balance runtime overheads and amount of parallelism expressed to better suit the target machine (this part is still TBD).
DEGAS  
D-TEC Compiler support plays an essential role in the D-TEC project.

We use compilers to accept programs written in DSL, analyze, transform and optimize the programs. A backend compiler is also needed to generate the final executable on a target platform. To support flexible definition of DSLs, a compiler infrastructure is needed to facilitate adding extensions to base languages or defining a totally new languages. A DSL may be transformed from a high level format to a low level form. The compiler should be able to maintain the mapping between different levels so we can relate high level semantics and low level performance metrics.

DynAX Compilers have two purposes in the DynAX project, which are both about increasing productivity and programmability.

1- The first role is to automatically parallelize domain-specific applications from a sequential specification. This is what R-Stream does, as it takes sequential C and produce parallel, scalable SWARM code. 2- The second role is to expose high-level parallel programming abstractions. This goal is addressed at two levels: The HTA compiler, which relies on an explicitly parallel intermediate representation (PIL), generates SCALE code. The SCALE compiler, in turn, offers object-oriented programming and simplifies programming to SWARM.

X-TUNE We are developing source-to-source compiler transformations in our project. ROSE is used as an abstract syntax tree to our compiler and modeling software (CHiLL and PBound, respectively). We rely on native backend compilers to perform architecture-specific optimizations and generate SIMD code (SSE, AVX, etc.). Our transformations generate code that we anticipate will be easily vectorized.
GVR  
CORVETTE  
SLEEC Compilers are used for two purposes in SLEEC: 1) To translate annotations on libraries into directives understood by SLEEC runtime systems (e.g., translating directives regarding inputs/outputs of kernels into SemCache API calls; 2) To perform high level optimizations of programs written using SLEEC-enabled libraries (e.g., performing linear algebra optimizations on applications written with BLAS).
PIPER PIPER does not have/need its own compiler, but we need access to compiler generated information that captures high-level semantics of the language implemented (basically a form of DWARF for DSLs). Additionally, advanced instrumentation or instrumentation point/markers could be useful. Information provided by different DSLs should be interoperable, ideally standardized and compatible with existing debug information provided by existing compilers
Program analysis can be both challenging and require specialized expertise. What requirements do you have for program analysis and what level of expertise you expect to require? This problem could be posed in terms of what APIs for program analysis results do you expect?
XPRESS The XPRESS project is exploring the strengths and opportunities enabled through runtime control for dynamic adaptive resource management and task scheduling. Through this investigation some compile time information will be exposed as of importance to be conveyed to the compiler. It is expected that some compiler dataflow analysis of fine-grained parallelism will be required but this is typical now so doesn’t require new capabilities.
TG Inter-procedural optimizations.
DEGAS  
D-TEC We would like to have access to a range of baseline compiler analyses through simple API functions.

Typical examples are control flow analysis, data flow analysis, and dependence analysis. In addition, we have to have extensible versions of these baseline analyses so they can be applied to new DSLs. We also need domain-specific analyses which can take advantages of domain knowledge in each DSL. Ideally, program analysis support should be able to leverage users input, through annotations or semantic-specification files.

DynAX The PIL and SCALE compilers support parallel languages as their input (PIL supports Hierarchically Tiled Arrays for data-parallel programs, and SCALE accepts structured, object-oriented codelet programs).

The R-Stream compiler supports sequential C loops to which a set of writing rules (a "style") are applied by the programmer. The rules, which entail exposing enough static information to the compiler, are defined in the R-Stream user guide. Some pragmas are defined that allow the user to provide additional hints to the compiler. R-Stream relies on extensions of the polyhedral model to represent, analyze and transform programs.

X-TUNE At present, we have implemented the program analysis we need for our optimizations.
GVR  
CORVETTE  
SLEEC We expect to use analysis results from frameworks like Fuse to build our IR for compiler transformations. We will be working on extending Fuse to perform analysis over "locations" that are matrices, or disjoint sets of memory locations.
PIPER PIPER mainly focuses on runtime performance analysis. This could benefit greatly from access to static information (loop structures, static call trees, data structures, ...), which should be provided by compilers in a standardized manner (through APIs or debug information encoded into binaries). HPCToolkit - one of PIPER's tools - parses machine code, reconstructs control flow graphs, performs interval analysis to identify loops, and then combines the information about loops with information about inlining from DWARF to attribute performance metrics to optimized code.
What types of hardware do you expect to address/target within optimizations and at what level of granularity of the program (e.g. coarse-grain, over multiple functions, or fine-grain within statements)?
XPRESS The ParalleX based approach exposes very coarse-grain, medium-grain (e.g., threads), and fine-grain dataflow (e.g., instruction level) parallelism in support of heterogeneous functional unit and memory hierarchy hardware structures. But it is also intended to inform future hardware design for greater efficiency mechanisms in support of system-wide operation for communication, global addresses, and dynamic execution.
TG While our programming model is applicable to today's machines, we are specifically considering machines that will have a global address space with strong NUMA characteristics. For a first pass, we expect the compiler to optimize EDTs (ie: small chunks of code but potentially spanning multiple functions). The granularity of the EDTs would be initially up to the user but during a later stage, the compiler may merge/split tasks to better match the task's granularity to the target machine.
DEGAS  
D-TEC We expect that the future extreme-scale computers will be heterogeneous node architectures connected with network connections.

An example heterogeneous node architecture is a multicore shared memory machine with a NVIDIA GPU accelerator with a separated memory space. For shared memory, non-uniform memory access (NUMA) will be an expected solution to be scalable. We want to exploit coarse-grain parallelism of a program first, then incrementally take advantage of finer-grain parallelism and map them to the proper levels of hardware features.

DynAX We currently target clusters of x86 multicore nodes for our experiments, but we are considering other targets as well, such as Intel's straw man architecture. So far we have worked at function-level granularity, but optimizing across multiple functions is possible to some degree.
X-TUNE We are currently focused on two classes of processor: (1) cache coherent multi- and manycore CPU architectures including the Xeon Phi / MIC architecture; and, (2) NVIDIA GPU architectures which are coherent only within a thread block. The types of optimizations we perform include fusion across operators, wavefront parallelism, introducing ghost zones, and fine-grain rewriting of computations to improve SIMD and instruction-level parallelism. The fusion across operators could potentially be applied across functions.
GVR  
CORVETTE  
SLEEC In addition to targeting general systems with our compiler transformations, we specifically target heterogeneous hardware (e.g., CPU/GPU nodes) with SemCache, and perform optimizations across method (library) calls.
PIPER all of the above
What general purpose languages do you expect to use and or extend to support your research work?
XPRESS XPRESS contends that other than for purposes of support of legacy codes, there is no correct language for the future of exascale in spite of the ardent claims of many of the supporters for Fortran, C++, OpenMP, MPI, Chapel, X10, and a long list of others. While it is true that the ultimate language is LISP, only a few truly enlightened individuals are qualified to recognize this.
TG C
DEGAS  
D-TEC We are interested in a wide range of generic transformations, such as loop transformations, instrumentation, GPU code generation, and data structure transformation.

We also want to have a compiler infrastructure which provides easy code transformation APIs so we can add customized, domain-specific transformations.

DynAX Both R-Stream and SWARM are based on C as their input language
X-TUNE We currently support C, C++ and Fortran because these are supported by the ROSE frontend.
GVR  
CORVETTE  
SLEEC C/C++
PIPER PIPER components will be written in C/C++ plus scripting languages (mostly python). All components/tools will be applicable to a wide range of source languages or even binaries. The exact list of supported languages is tbd. and will depend on demand and progress on the overall exascale software stack.
Do you expect to use, require, or develop an Embedded DSL (defined by compiler support that would leverage semantics of abstractions defined completely within a general purpose base language) or an Extended DSL (define by compiler support that would leverage semantics of abstractions defined by new syntax)?
XPRESS DSLs hold the promise of relieving the burden of coding for key applications or functionality classes and XPRESS expects to support these. However, it is required that such DSLs to exhibit exascale capability that their back-ends will have to be targeted to the HPX runtime system either directly or through offered interfaces like XPI or PXC.
TG Apart from limited keywords to better support the OCR programming model, we would support other languages (including DSLs) through the use of higher level source-to-source translators.
DEGAS  
D-TEC We want to support all flavours of DSLs.
DynAX The R-Stream approach to domain-specificity is to define a programming style and enable domain-specific annotations. The advantage of this approach is that the user still programs in C, and doesn't need to learn a new syntax.

We have developed SCALE as a general-purpose programming language within the domain of HPC. We have not conceived of any of the current SCALE features as being specific to a narrower domain than that.

X-TUNE We are developing domain-specific optimizations for geometric multigrid and stencil computations. These optimizations could be incorporated into a DSL for such applications, but

we are applying the optimizations directly to C and Fortran code. We are also developing a tool for expressing and optimizing tensor computations that could be considered a DSL

GVR  
CORVETTE  
SLEEC N/A
PIPER Possibly as interface to query and analyze performance data, but unclear as of now
What generic and customized code transformations do you require to support your project?
XPRESS  
TG Nothing specific is required but we will add keywords to C to support OCR's programming model
DEGAS  
D-TEC We are interested in a wide range of generic transformations, such as loop transformations, instrumentation, GPU code generation, and data structure transformation.

We also want to have a compiler infrastructure which provides easy code transformation APIs so we can add customized, domain-specific transformations.

DynAX R-Stream supports a wide range of loop and data layout transformations, some of which are specific to stencil operations. These transformations mainly create data locality and parallelism at various levels of the target architecture.
X-TUNE We are developing the transformations ourselves.
GVR  
CORVETTE  
SLEEC We leverage general code motion transformations to expose opportunities for high-level optimization.
PIPER Instrumentation
Which level of Intermediate Representation do you prefer to work with: source level, normalized middle level, or low level (close to binary code)?
XPRESS Currently the selected intermediate representation is a source code XPI interface although lower level HPX library calls through C or C++ are also enabled.
TG LLVM has an IR that it uses all the way through. It is very flexible and we expect to work at that level.
DEGAS  
D-TEC Within D-TEC, we want to support all levels of intermediate representations to fully support the analysis, optimization and code generation of DSLs. For example, a high level representation is best to preserve high level code structures.

A normalized middle-level representation is most important for many static analyses. A low-level IR is very suitable for machine-specific optimizations.

DynAX We do not have a preference, but in our experience, users want to program at the highest possible level, while still being able to access low-level code and understand the parallelization process. Our source-to-source tools enable this.
X-TUNE As we are applying source-to-source transformations, we prefer an intermediate representation that is close to the source level.
GVR  
CORVETTE  
SLEEC Source level (ish). Our IR captures more information (dependences, semantic type information, etc) than is available at source level, but we want to be able to translate back to source.
PIPER Mostly does not apply to PIPER, but some elements of autotuning could use code transformation, e.g., to create multiple variants of the same code
Which parallel programming models (MPI, OpenMP, UPC, etc.) do you want to have better compiler support?
XPRESS XPRESS support MPI and OpenMP.
TG The OCR programming model (fine grained event driven tasks).
DEGAS  
D-TEC We are interested in MPI and OpenMP. Ideally, the compiler should be aware of MPI function calls and have OpenMP implementation.
DynAX None, although I'm not sure I fully understand the question (as of 5/5/14).
X-TUNE N/A
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
What OS configuration and hardware platforms do you want to run the compiler?
XPRESS Initially its anticipated that a cross-compiler will be used running on a standard Linux platform and targeting the HPX environment.
TG Custom.
DEGAS  
D-TEC Linux is our main focus OS. Target platforms may use Intel/AMD x86 multicore machines with NVIDIA GPUs.
DynAX PIL, R-Stream, SCALE and SWARM support Linux (complete list of tested distributions available).

Successful Mac OS uses of R-Stream were also reported (using Darwin), although it is not officially supported. R-Stream can also cross-compile to any platform as long as the native low-level compiler supports this feature. For SCALE and SWARM, cross-compilation is done for Xeon Phi.

X-TUNE We are currently supporting Linux-based systems, and also Nvidia GPUs.
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
How do you expect compilers to incorporate domain-specific information, through DSL, separated semantics-specification files, or other methods?
XPRESS  
TG We expect domain specific knowledge to be expressed through keywords and also specific API calls (that the compiler could identify if needed).
DEGAS  
D-TEC We want to support DSL, annotations, and separated semantics-specification files.
DynAX As explained above, in R-Stream, a domain-specific formulation of the program is performed by combining style rules and annotations (#pragma). Semantics and syntax remain as well-defined as the underlying C language.

PIL is a framework to facilitate any-to-any compilation of parallel languages. There is no domain specific information, but will work for any language.

X-TUNE Scalability and determining what is an error seem like the biggest challenges.
GVR  
CORVETTE  
SLEEC We expect domain specific knowledge to be encoded as annotations on library methods (potentially in a separate specification file).
PIPER N/A
How do you expect compilers to interact with your libraries or runtime systems, if any?
XPRESS  
TG The compiler should ideally refactor the code to better match the hardware and restrict the number of choices the runtime has to make in a way that is non-constraining (ie: in cases where any other choice would be detrimental in a vast majority of cases). The compiler should also potentially present multiple alternatives to the runtime to allow directed choices. In short, the compiler should not make any decisions that it is not sure about but try to reduce the amount of decision space for the runtime (to reduce overhead).
DEGAS  
D-TEC The compiler will generate tasks (kernels) to be executed at runtime. It will also transform and partition data in programs so a NUMA-aware runtime library can be used.
DynAX R-Stream generates codes that includes calls to the target machine's runtimes/libraries for the purpose of parallelization and locality optimization. R-Stream also supports library calls within the input code.
X-TUNE Our compiler generates calls to run-time libraries to create threads. Currently, we support OpenMP and CUDA.
GVR  
CORVETTE  
SLEEC N/A
PIPER Yes: by providing additional debug/code to abstraction mapping information

DSL's

Sonia requested that Saman Amarasinghe and Dan Quinlan initiate this page. For comments, please contact them. This page is still in development.

X-Stack Project Name of the DSL URL Target domain Miniapps supported Front-end technology used Internal representation used Key Optimizations performed Code generation technology used Processors computing models targeted Current status Summary of the best results Interface for perf.&dbg. tools
D-TEC Halide http://halide-lang.org Image processing algorithms Cloverleaf, miniGMG, boxlib Uses C++ Custom IR Stencil optimizations (fusion, blocking, parallelization, vectorization) Schedules can produce all levels of locality, parallelism and redundant computation. OpenTuner for automatic schedule generation. LLVM X86 multicores, Arm and GPU Working system. Used by Google and Adobe. Local laplacian filter: Adobe top engineer took 3 months and 1500 loc to get 10x over original. Halide in 1-day, 60 lines 20x faster. In addition 90x faster GPU code in the same day (Adobe did not even try GPUs). Also, all the pictures taken by google glass is processed using a Halide pipeline. Interfaces with the OpenTuner (http://opentuner.org) to automatically generate schedules. Working on visualizing/debugging tool.
DTEC Shared Memory DSL http://rosecompiler.org MPI HPC applications on many core nodes Internal LLNL App Uses C (maybe C++ and Fortran in future) ROSE IR Shared memory optimization for MPI processes on many core architectures permits sharing large data structures between processes to reduce memory requirements per core. ROSE + any vendor compiler Many core architectures with local shared memory Implementation released (4/28/2014) Being evaluated for use  
D-TEC X-GEN for heterogenous computing http://rosecompiler.org/ HPC applications running on NVIDIA GPUs boxlib, internal kernels Uses C and C++ ROSE IR (AST) loop collapse to expose more parallelism, Hardware-aware thread/block configuration, data reuse to reduce data transfer, round-robin loop scheduling to reduce memory footprint ROSE source-to-source + NVIDIA CUDA compiler NVIDIA GPUs Implementation released with ROSE (4/29/2014) Matches or outperforms caparable compilers targeting GPUs. Generate event traces for gpuplot to identify serial bottleneck
D-TEC NUMA DSL http://rosecompiler.org HPC applications on NUMA-support many core CPU internal LLNL App Uses C++ ROSE IR NUMA-aware data distribution to enhance data locality and avoid long memory latency. Multiple halo exchanging schemes for stencil codes using structured grid. ROSE + libnuma support Many core architecture with NUMA hierarchy implementation in progress. 1.7x performance improvement compared to OpenMP implementation for 2D 2nd order stencil computation. PAPI is used for for performance profiling. libnuma and internal debugging scheme are used to verify memory distribution among NUMA nodes.
D-TEC OpenACC https://github.com/tristanvdb/OpenACC-to-OpenCL-Compiler Accelerated computing Not yet. C (possible C++ and Fortran). Pragma parser for ROSE. ROSE IR Uses on tiling to map parallel loops to OpenCL ROSE (with OpenCL kernel generation backend), OpenCL C Compiler (LLVM) Any accelerator with OpenCL support (CPUs, GPUs, XeonPhi, ...) - Basic kernel generation - Directives parsing - Runtime tested on Nividia GPUs, Intel CPUs, and Intel XeonPhi Reaches ~50 Gflops on Tesla M2070 on matrix multiply. (M2070: ~1Tflops peaks, ~200 to ~400 Gflops effective on linear algebra ; all floating point). A profiling interface collects OpenCL profiling information in a database.
D-TEC Rely http://groups.csail.mit.edu/pac/rely/ Reliability-aware computing and Approximate computing Internal kernels Subset of C with additional reliability annotations Custom IR A language and a static analysis framework for verifying reliability of programs given function-level reliability specifications. Chisel, a code transformation tool built on top of Rely, automatically selects operations that can execute unreliably with minimum resource consumption, while satisfying the reliability specification. Generates C source code. Binary code generator implementation is in progress - Implementation in progress Analysis of computational kernels from multimedia and scientific applications.    
D-TEC Simit   Computations on domains expressible as a graph Internal physics simulations, Lulesh, MiniFE, phdMesh, MiniGhost Uses C++ Custom IR Fusion, Blocking, Vectorization, Parallelization, Distribution, Graph Index Sets LLVM X86 multicores, GPU and later distributed systems Design and implementation in progress   Has a visual backend.
TG X-Stack CnC https://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc Medical Imaging, Media software Lulesh, Rician Denoising, Registration, Segmentation Graph builder Graphs, Tags, Item Collections, Step Collections dependence graph generation for data flow computation CnC compiler data flow computation v0.9 Out of box speedup for most apps, automatically discovers parallelism TBB
TG X-Stack HTA http://polaris.cs.uiuc.edu/hta/ Scientific applications targeting Matlab Multigrid, AMR, LU, NAS parallel, SPIKE HTAlib Hierarchical tiled arrays map-reduce operator framework, overlapped tiling, data layering C++ compilers Multicore, clusters 0.1 Matched handcoded MPI HPC Toolkit
TG X-Stack HC https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C Medical Imaging, Oil and Gas Research Lulesh, Rician Denoising, Graph 500, UTS, SmithWaterman EDG Sage Continuable task generation to support finish Rose distributed data flow computation, structured parallel computation v0.5 Performs better that OpenMP for most apps HPC Toolkit
DEGAS Asp http://sejits.org Infrastructure for building embedded Python DSLs   Python syntax custom IR based on Python's AST Key optimizations performed: loop transformations, vectorization, template-based code generation, caching LLVM (in progress), C, C++, Scala, CUDA, OpenCL x86, Nvidia GPUs, cloud, MPI Numerous DSLs including structured grids, recursive communication-avoiding matmult, machine learning algorithms, communication-avoiding solvers Structured grid DSL achieves 90%+ peak for many kernels on multiple platforms In-progress. tech report on multi-tiered strategy.
DSL 9            

https://docs.google.com/spreadsheets/d/1gvuruGudgDn1Bheoe81jXcRNOOsZLOvK...

https://docs.google.com/spreadsheets/d/1WXRkzwZMfvpNJFz7Jaue5W3PzA30AtVU...

Operating Systems

Sonia requested that Pete Beckman initiate this page. For comments, please contact Pete Beckman. This page is still in development.

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

What are the key system calls / features that you need OS/R to support? Examples: exit, read, write, open, close, link, unlink, chdir, time, chmod, clone, uname, execv, etc.
XPRESS LXK is developed within XPRESS as an independent lightweight kernel to fully support the HPX runtime system and XPI programming interfaces through the RIOS (runtime interface to OS) protocol.
TG We currently require some method to get memory from the system (equivalent of sbrk) and some method of input/output. The requirements will be extended as we see the need but we want to limit the dependence on system calls to make our runtime as general and applicable to as wide a range of targets as possible.
DEGAS  
D-TEC  
DynAX The SWARM runtime requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On commodity clusters, SWARM additionally needs access to I/O facilities, such as the POSIX select, read, and write calls.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
PIPER sockets, ptrace access, dynamic linking, timer interrupts/signals, access to hardware counters, file I/O
Does your project implement it's own lightweight thread package, or does it rely on the threads provided by the OS/R? If you implement your own threads, what are the key features that required developing a new implementation? If you don't implement your own thread package, what are the key performance characteristics and APIs needed to support your project?
XPRESS HPX provides its own lightweight thread package that relies on heavyweight thread execution by the LXK OS. Features required include threads as first class objects, efficient context switch, application dynamic scheduling policies, message-driven remote thread creation.
TG Our runtime can rely on existing thread (we use pthreads for example) frameworks but we do not need them as we really use the existing threading framework to emulate a computing resource.
DEGAS  
D-TEC  
DynAX SWARM uses codelets to intermediate between hardware cores and function/method calls. The general requirement is for direct allocation of hardware resources. On Linux-like platforms, threads are created and bound to cores at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER relying on native threads, typically pthreads, which is appropriate
What energy/power and reliability APIs do you expect from the OS/R?
XPRESS While XPRESS is not explicitly addressing energy and reliability issues, under related support it does incorporate the microcheckpointing compute-validate-commit cycle for reliability and the side-path energy suppression strategy for power reduction. Access to faults detection from hardware and OS is required as is measurements of power and thermal conditions on per socket basis. Control of processor core operation and clocks are required.
TG We hope that this will all be managed inside the runtime layer and that our EDT and DB programming model will enable the user to be unaware of it. We may, however, allow for some hints from higher level tools (for example having a compiler expose multiple versions of an EDT with differing energy envelopes). The runtime will rely on the underlying system-software/OS to provide introspection capabilities as well as "knobs" to turn (mostly voltage and frequency regulation as well as fine-grained on/off).
DEGAS  
D-TEC Fault detection notifications from hardware and OS. - Fine-grain power measurements (per socket) - Thermal information (per socket) - Access to DVFS settings
DynAX The ability to dynamically modify power state and clock frequency for subsets of the system, and a mechanism to detect failures, are both very important for meeting power and reliability goals.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER Power: Power measurements at varying granularity using processor internal counters and external sensors, for autotuning access to setting DVFS levels or power capping; Resilience: access to fault notification (for corrected and non-corrected errors)
Please describe how parallel programs are "composed" within your project, and what APIs and support is required from the OS/R?
XPRESS Parallel programs are composed through ParalleX Processes interfaces comprising message-driven method instantiation for value/synchronization passing and data object access. OS support for memory address and global name-space is required.
TG See the OCR Spec.
DEGAS  
D-TEC  
DynAX Parallel programs comprise a directed acyclic graph of non-blocking tasks, called "codelets," which form the nodes of the graph, and which produce and consume data, which form the edges of the graph. The SWARM runtime keeps track of when each task's input dependencies are met and when it can run, and where it can run in order to maximize locality to minimize the amount of required data movement. Fundamentally, this only requires the OS or hardware to permit SWARM to allocate hardware cores and memory, and to send data around the system. To reduce runtime overhead, hardware features such as DMA would be beneficial. If such hardware features are protected from direct access by the SWARM runtime, then the OS should expose an API to access them.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
What is your model for extreme-scale I/O, and how do you expect the OS/R to support your plans?
XPRESS OpenX supports a multiplicity of I/O support interfaces from conventional file systems to a unified namespace and asynchronous control (under separate funding).
TG Proxy call to external hosts.
DEGAS  
D-TEC  
DynAX At Exascale, it is expected that network latencies will be very high, and that hardware failure is common. Therefore, at the OS/runtime level, what's needed is an interface for doing asynchronous I/O which is resilient in the face of failures. Further details will be dictated largely by the needs of applications.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER For PIPER this refers to performance information: hierarchical aggregation through MRNet to reduce data and perform online analysis. OS support required for bootstrapping / optionally, could be integrated with OS backplanes
Does your project include support for multiple, hierarchical, memory regions? If so, how will it be allocated and managed, by your xstack project or my the OS/R? What APIs do you expect the OS/R to support complex and deep memory?
XPRESS Physical memory is managed by LXK and allocated to HPX ParalleX processes, which aggregate physical blocks to comprise a hierarchy of logical memory resources. Memory is globally accessible through capabilities based addressing and exported process methods.
TG Yes, OCR supports multiple, hierarchical memory regions. We need support from the underlying system-software/OS to inform the runtime of the regions that it can use and of their characteristics.
DEGAS  
D-TEC Yes. We consider SMT solvers in this respect most promising.
DynAX The SWARM runtime does support distributed memory and deep NUMA machines, through the use of a hierarchy of locale structures which roughly parallel the hardware topology. SWARM requires the ability to allocate the relevant memory regions from the OS.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A

Performance Tools

Questions:

What abstractions does your runtime stack use for parallelism?
XPRESS The XPRESS project identifies three levels of parallelism associated with the ParalleX execution model:

a. The coarse grain parallelism is the ParalleX Process that provides context for other child processes and other forms of computation and may span multiple localities (i.e., nodes), b. The medium grain ParalleX compute complex (e.g., thread instantiation) that runs on a single locality and is a partially ordered set of operations, and c. The fine grain operations that are coordinated by a static dataflow graph (DAG).

TG EDTs and data blocks.
DEGAS  
D-TEC  
DynAX The basic unit of computation is the SWARM codelet. Codelets are grouped into SCALE procedures that allow the sharing of data across the codelets. On top of codelets and procedures, two higher level notations have been implemented: The Hierarchically Tiled Array object which includes data parallel operations and is implemented as a library, and Sequential C code auto-parallelized and translated by R-Stream into codelets.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What performance information would application developers need to know to tune codes that use your X-Stack project's software?
XPRESS Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
TG With the goal being separation of concerns, no platform-specific information need to be known by the application developer. They need to provide hints that describe the software, using appropriate runtime APIs, which the runtime uses to aid in appropriate resource management.
DEGAS  
D-TEC  
DynAX Cost of memory accesses across the different levels of the hierarchy, overhead associated with codelet initiation and coordination, scheduling strategies.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What would a systems software developer need to know to tune the performance of your software stack?
XPRESS Critical questions are of granularity of tasks based on overhead costs of managing threads and relative localities of execution and data objects, although both can be addressed in part by compiler and runtime functions.
TG The runtime exposes resource management modules (introspection, allocator, scheduler) using well-defined internal interfaces that can be replaced or tweaked by the systems software developer to target the underlying platform.
DEGAS  
D-TEC  
DynAX Cost of communication across different levels of the hierarchy, cost of context switching, size of different memory levels, classes of processors, parameters of the scheduling strategy.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What information should a performance tool gather from each level in your software stack?
XPRESS Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
TG At application layer - application profiling, at runtime - resource management decisions and runtime overheads, and at simulation - detailed resource usage including monitoring exposed by hardware.
DEGAS  
D-TEC  
DynAX Within the codelet: (a) Sequential performance of each codelet (e.g. in Gigaflops), and (b) Cost associated with memory accesses by instructions in a codelet. (e.g. how many accesses are cache hits, how many accesses go to scratch pad memory, how may to remote locations). Across codelets: (a) Overhead caused by both local and remote codelet initiation, cost of argument boxing, data communication costs (moving data), (b) Processor utilization and impact of the scheduling choices of the runtime stack, and (c) Other overhead of the runtime system.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What performance information can/does each level of your software stack maintain for inspection by performance tools?
XPRESS Performance information is gathered by the APEX runtime introspection data gathering and analysis tool and the RCR low-level system operation data gathering tool. With HPX runtime system policies this information is used to dynamically and adaptively guide resource allocation and task scheduling.
TG Please see above.
DEGAS  
D-TEC  
DynAX Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What information would your software stack need to maintain in order to measure per-thread or per-task, performance? Can this information be accessed safely from a signal handler? Could a performance tool register its own tasks to monitor the performance of the runtime?
XPRESS The APEX component of the HPX runtime system maintains the necessary information to measure per-thread performance including start, stop, suspend/pending event times, and ops counts. A performance tool can register its own tasks to monitor the performance of the runtime. Threads are first class objects and can be directly accessed by other threads.
TG Currently, the runtime maintains these information at varying degrees of granularity, depending on developers' choice (from instruction & byte counts all the way to task statistics), and this information is available for offline analysis. Future work will allow a portion of this analysis to be made online, so that custom performance tool tasks are accommodated.
DEGAS  
D-TEC  
DynAX Each codelet could maintain: initiation time, termination time, cache miss rate, number of codelets initiated, total size of parameters, number of reinitiations (due to failures).
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What types of performance problems do you want tools to measure and diagnose? CPU resource consumption? CPU utilization? Network bandwidth? Network latency? Contention for shared resources? Waste? Inefficiency? Insufficient parallelism? Load-imbalance? Task dependences? Idleness? Data movement costs? Power or energy consumption? Failures and failure handling costs? The overhead of resilience mechanisms? I/O bandwidth consumed? I/O latency?
XPRESS All of the above and more.
TG All the above mentioned, with the exception of I/O. In addition to these - runtime overheads at module-level granularity, memory use at different levels of the hierarchy, temperature & reaction time, DVFS & its effects.
DEGAS  
D-TEC  
DynAX Yes. All of the above.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What kinds of performance problems do you foresee analyzing using post-mortem analysis?
XPRESS Post-mortem information would be useful to analyze non-causal behavioral data that cannot be predicted prior to execution. It must also differentiate this information from that which is entirely data dependent and therefore likely to change from that which is an intrinsic property of the program. A determination of the critical path and side path tasks combined with energy and time consumption requirements for each task would be very useful.
TG The primary problems diagnosed this way will be resource management decisions, and whether hints supplied by the program/compiler are internalized in decision making correctly. Additionally, runtime overheads will also be tracked closely.
DEGAS  
D-TEC  
DynAX To find a balance between performance and power consumption. To have good resiliency for the system and the impact of faults on performance.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What kinds of performance problems do you foresee analyzing using runtime analysis? What interfaces will be needed to gather the necessary information?
XPRESS The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
TG DVFS decisions by the runtime, and its impacts will be analyzed using runtime analysis.
DEGAS  
D-TEC  
DynAX To identify hardware/system/runtime bottlenecks, and to identify poor prioritization of the application's critical path. It's unclear what interfaces would provide sufficient introspection into the necessary systems without paying an unacceptable performance penalty, further discussion on this topic is welcome.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What control interfaces will be necessary to enable runtime adaptation based on runtime performance measurements?
XPRESS The challenge is to prioritize the critical and sub critical tasks for execution filling in with side-path threads with resource availability. Parallelism governing is important to avoid system jamming through throttling so usage monitoring is crucial. The XPRESS APEX runtime subsystem performs these and other services with additional support from the RCR RIOS subsystem.
TG This is an ongoing work, with scalability being the emphasis (since the metrics will provide huge volumes of data that will be hard to manage). Currently statistical properties of metrics is being considered to be used as proxies for various underlying causes. The interfaces are predominantly those exposed by hardware to the runtime (via counters), and minimal interfaces provided to resource management modules by the runtime.
DEGAS  
D-TEC  
DynAX Execution time, energy consumption, cache miss ratio, fraction of accesses to each class of remote memory, latency of memory accesses, network collisions.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
There is a gap between the application-level and implementation-level views of programming languages and DSLs. What information should your software layers (compiler and runtime system) provide to attribute implementation-level performance measurement data to an application-level view?
XPRESS For purposes of performance portability, the principal information required from programmer level is parallelism and some relative locality information. It is possible that some higher-level idiomatic patterns of control and access may be useful but these have as yet to be determined.
TG The runtime provides implementation-level performance at the runtime API level. Source transformations from high level application to runtime API would also need to provide mechanisms to also reverse-map the runtime-provided information at the implementation, level back to high level application. Currently, the implementation-level details can still be mapped back to application design with basic level of familiarity with the transformation tools.
DEGAS  
D-TEC  
DynAX In the case of HTA, the cost of each HTA operation decomposed into computation (sequential and parallel), communication cost, locality (cache misses or equivalent if scratch pad memories are used), network congestion, energy consumption.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
What kind of visualization and presentation support do you want from performance tools? Do you envision any IDE integration for performance tools?
XPRESS Visualization of resource usage and pending (bottlenecked) work will help to inform about intrinsic code parallelism and precedent constraints.
TG Some high level transformation tools already have a graphical representation of the program abstractions. Additionally, we also have a graphical representation of data movement and energy consumption at the simulator level. These will be enhanced to accommodate other performance metrics currently being tracked. IDE integration has not been a focus so far, but will be considered once the toolchain attains maturity.
DEGAS  
D-TEC  
DynAX A system like the parallel study and thread profiler of Intel.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
List the performance challenges that you think next generation programming languages and models will face.
XPRESS Overhead and its impact on granularity, diversity of forms and scales of parallelism, parallelism discovery from meta data, energy suppression.
TG Based on our choice of the EDT (Event Driven Tasks) model, a primary challenge will be to ensure that the resource management overheads do not nullify the gains got due to the extra parallelism the model enables. We plan to address this by settling on the right granularity of task length and data block sizes so that the overheads are kept low, and the right balance between parallelism and management overheads is struck.
DEGAS  
D-TEC  
DynAX Tuning for a complex target environment where power, small memory sizes, and redundancy for reliability are issues. Exposing the important factors for parallelism, locality, communication, power, and reliability in a machine independent manner.
X-TUNE  
GVR  
CORVETTE  
SLEEC  

Note, PIPER is not listed as a column above, since it is intended as a recipient of this information.

Resilience

Sonia requested that Andrew Chien initiate this page. For comments, please contact Andrew Chien.

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

Describe how your approach to resilience and its dependence on other programming, runtime, or resilience technologies? (i.e. uses lower-level mechanisms from hardware or lower level software, depends on higher level management, creates new mechanisms)
XPRESS XPRESS will employ Micro-checkpointing, which employs a Compute-Validate-Commit cycle bounded by Error-Zones for localized support of error detection and isolation, also diagnosis, correction, and recovery.
TG Not in TG scope.
DEGAS Our approach to resilience comprises three principal technologies. First, Containment Domains (CDs) are an application-facing resilience technology. Second, we introduce error handling and recovery routines into the communications runtime (GASNet-EX) and clients, e.g. UPC or CAF, to handle errors, and return the system to a consistent global state after an error. Third, we provide low-level state preservation, via node-level checkpoints for the application and runtime to complement their native error-handling routines.

Effective use of CDs requires application and library writers to write application-level error detection, state preservation and fault recovery schemes. For GASNet-EX, we would like mechanisms to identify failed nodes. Although we expect to implement timeout-based failure detection, we would like system-specific RAS function to provide explicit notification of failures and consensus on such failures, as we expect vendor-provided RAS mechanisms to be more efficient, more accurate, and more responsive than a generic timeout-based mechanism. Finally, our resilience schemes depend on fast durable storage for lightweight state preservation. We are designing schemes that use local or peer storage (disk or memory) for high volume (size) state preservation, and in-memory storage for high traffic (IOs) state preservation, we need the hardware to be present. Since our IO bottleneck is largely sequential writes with little or no reuse, almost any form of persistent storage technology is suitable.

D-TEC Dealing with errors in Exascale systems requires a deviation from the traditional approach to fault tolerance models that take an egalitarian view of the importance of errors (all errors are equal) to a utilitarian approach where errors are ranked by their impact on the accuracy of computation.

To deal with this problem, we propose a novel fault tolerance approach in which we view computations as composed of tasks or choices that are accompanied with specifications that define acceptable computational error margins. Tasks critical to the computation indicate low error margins (or potentially no margin for error) and less critical tasks have a wider error margin. Using uncertainty quantification (UQ) techniques, we can propagate these error margins throughout the program, enabling programmers to reason about the effects of failures in terms of an expected error in the resulting computation and selectively apply fault tolerance techniques to components critical to the computation. We are the first to develop a sensitivity analysis framework that can identify critical software components through targeted fault injection. We have demonstrated this functionality by developing a tool to find critical program regions, developers can produce selective detection- and recovery-techniques. We are also the first to explore the use of a sensitivity analysis framework to quantify computational uncertainty (i.e., the relationship between input and output) by modeling errors as input uncertainty. Specifically, we have explored how to apply uncertainty quantification techniques to critical program regions so as to enable selective recovery techniques (e.g., result interpolation). We have demonstrated this approach by developing a new programming language, Rely, that enables developers to reason about the quantitative reliability of application.

DynAX The DynAX project will focus on Resilience in year three. The general approach will be to integrate Containment Domains into the runtime and adapt proxy applications to use them. This will depend on the application developer (or a smart compiler) identifying key points in the control flow where resumption can occur after failure, and identifying the set of data which must be saved and restored in order to do so. It also relies on some mechanism to determine when a failure has occurred and which tasks were effected. This will vary from one system and type of failure to the next, examples include a hardware ECC failure notification, or a software data verification step detects invalid output, or maybe even a simple timeout occurs. Finally, it will require a way for the user to provide some criteria for how often to save resilience information off; the runtime may choose to snapshot at every opportunity it is given, or it might only do a subset of those, to meet some power/performance goal.
X-TUNE  
GVR GVR depends on MPI3 and lower level storage (memory, nvram, filesystem) services. It is intended as a flexible portable library, so these dependences are intentionally minimal, and likely well below the requirements of any programming system or library or application that it might be embedded into. So in short, it effectively adds no dependences. GVR provides a portable, versioned distributed array abstraction... which is reliable. An application can use one or many of these, and version them at different cadences. This can be used by libraries (e.g. demonstrated with Trilinos), programming systems, or applications to create resilient applications. Because versioning can be controlled, these systems can manage both the overheads and the resilience coverage as needed. Because error checking and recovery can be controlled by applications, GVR allows applications to become increasingly reliable based on application semantics (and minimal code change), and portably (the investment is preserved over different platforms).
CORVETTE  
SLEEC We do not create any new mechanisms. However, we might expect some information from lower level mechanisms (e.g., whether a particular library method might be re-executed for resilience purposes) to inform our cost models and optimizations.
PIPER Resilience is important for the data gathering and aggregation mechanism. When layered on top of new RTS we require proper notification and controllable fault isolation (e.g., for process failures in aggregation trees). Infrastructures like MRNet already provide basic support for this.
Charm++ The Charm++ resilience support in production use includes in-remote-memory checkpoints and buddy-based automatic failure detection for fail-stop failures. Proactive schemes evacuate a processor when notified of impending failures. The protocols depend on the runtime for monitoring and object migration. Scalable message-logging scheme is also supported where appropriate. The soft error recovery is a work in progress: experimental versions of replication-based strategies are available, while other methods are planned. These depend on over-decomposition provided by the programming models and supported by the RTS.
Early Career-SriramK We focus on selective localized recovery from faults. This involves tracking ongoing execution progress and identifying the state to be recovered and tasks re-executed to tolerate faults. Dynamic load balancing adapts the execution around faults. We assume the presence of a failure detector, either in hardware or in software. While the recovery techniques do not rely on frequent collective checkpoints, checkpoints act as backstop when the information tracked is insufficient to effect a localized recovery.
One challenging problem for Exascale systems is that projections of soft error rates and hardware lifetimes (wearout) span a wide range from a modest increase over current systems to as much as a 100-fold increase. How does your system scale in resilience to ensure effective exascale capabilities on both the varied systems that are likely to exist and varied operating points (power, error rate)?
XPRESS Both hardware and software errors are treated in the same manner for error detection, isolation, and diagnosis. Recovery differs if the error is transient, a software bug, or a hardware error. Because the method is localized as in (1), it is believed to be scalable into the billion-way parallelism era. However, this has to be demonstrated through experience. The methods assume a Poisson error distribution.
TG Not in TG scope.
DEGAS Our resilience technologies provide tremendous flexibility in handling faults. In our hybrid user-level and system-level resilience scheme, CDs provide lightweight error recovery that enables the application to isolate the effects of faults to specific portions of the code, thus localizing error recovery. With the use of nesting CDs, an inner CD can decide how to handle an error, or to propagate this error to a parent CD. If no CD handles the error locally, we use a global rollback to hide the fault. With this approach, the use of local CDs for isolated recovery limits the global restart rate
D-TEC Our ability to reason about the importance of bugs, enables us to develop selective detection and recovery techniques, which allows our to system to focus only on critical faults. For example, if a task fails, then depending on where this task occurs within the overall computation, it may be possible to simply tolerate the fault. For example, if the computation is an input to a large reduction node and the node can, for instance, tolerate losing 10\% of its input tasks and yet still provide acceptable accuracy, then the task does not need recovery.

If an assembly of tasks leads to accuracy degradation, then the system can provide alternative implementations that use fault tolerance techniques, such as replication, to the optimization management framework. This, in combination with the runtime system, will ensure the overall system goals are met despite failures.

DynAX In a recursive tree hierarchy of containment domains, each level will have successively finer grained resilience, with fewer affected tasks, and less cost of redundant computation if the domain restarts due to failure. Containment domains are isolated from each other, so there is no extra communication / synchronization necessary between them. The 3 dependencies mentioned above (identification of containment domains, identification of failures, and the cost function) should be equally applicable to any exascale system. The only hardware dependence we anticipate is the hardware's ability to identify failures as it occurs, and the details of how that failure is reported. The system can be tuned to act more conservatively, or less conservatively, by ignoring some containment domains and enforcing others. The most conservative approach is to back up the necessary data for, and check results of, every containment domain in the application. The least conservative case is to only act on the outer-most containment domain, i.e. simply retain the program image and the parameters it was launched with, and if a failure occurs anywhere in the application, the entire thing is thrown out and starts again from the beginning. Reality is usually somewhere between those two extremes... the runtime can pick and choose which containment domains to act on, based on cost/benefit analysis (which I will discuss further in the question below on application semantics information).
X-TUNE  
GVR GVR uses versioning as the primary basis for error checking and recovery. Applications and programming systems can control the frequency of versioning, error checking, and recovery approach to adapt to the underlying error rate. As this is under application control, the application programmer is armed both with the ability to control/manage overhead as well as error coverage and to do so portably. The ideal outcome is a portable application than runs effectively over 100x or larger dynamic range of errors with no more than a single or few parameter change.
CORVETTE  
SLEEC N/A
PIPER N/A
Charm++ The control system observes the error rates and tunes its behavior (parameters for now, and in future, choice of strategies) accordingly. The simplest example is checkpoint periodicity, controlled by observations of fail-stop failures. Message-logging can be used if failures are frequent, because it avoids rolling back all the processors. Object based checksums are used to contain corruption while the object is passive.
Early Career-SriramK The schemes can be tuned to meet the resilience needs. In particular, the overhead during normal execution can be decreased at the expense of increased penalty incurred on a failure. The approach can also reduce to checkpointing when the task-parallel phases are too small compared to node/system MTBF.
What opportunities are there to improve resilience or efficiency of resilience by exporting/exploiting runtime or application semantics information in your system?
XPRESS Application semantics can provide myriad simple tests of correctness that will detect errors early and limit their propagation throughout the computation through better isolation.
TG Not in TG scope.
DEGAS In DEGAS, applications express semantic information required for resilience through the CD API; that's the point of CDs. The CD hierarchy, i.e. boundary and nesting structure, describes the communication dependencies of the application in a way that is not visible to the underlying runtime or communication systems. In contrast, a transparent checkpoint scheme must discover the dependency structure of an application by tracking (or preventing) receive events to construct a valid recovery line. The CD structure makes such schemes unnecessary, as the recovery line is described by the CD hierarchy.

For applications that do not use CDs, we fall back to a transparent checkpoints. Here, we rely on the runtime to discover communication dependencies, advance the recovery line, and perform rollback propagation as required. One clear opportunity to improve efficiency for resilience is by exploiting the one-sided semantics of UPC to track rollback dependencies. We are starting by calculating message (receive) dependencies inside the runtime, but we are interested in tracking memory (read/write) dependencies as a possible optimization. A second area for improved efficiency is in dynamically adjusting the checkpoint interval (recovery line advance) according to job sizes at runtime. An application running on, 100 k-nodes must checkpoint half as often as one running on 400 k-nodes; for error rates, by tolerating 3/4 of the errors, checkpoint overhead is halved. Such considerations suggest that static (compile-time) resilience policies should be supplemented with runtime policies that take into account the number of nodes, failure (restart) rates, checkpoint times, memory size, etc. to ensure good efficiency.

D-TEC Application semantic information can be used to implement efficient, targeted ABFT. When combined with our reliability analysis, these techniques can be used to selectively detect and recover from critical faults. Information from the runtime system can enable our system to dynamically adapt its resiliency posture in order to meet accuracy/performance requirements
DynAX The runtime decides whether or not to act on each individual containment domain. It chooses which containment domains to act upon, based on tuning parameters from the user and based on the data size and the estimated amount of parallel computation involved in completing the work associated with a containment domain. The data size is easy to determine at runtime, and the execution information can be provided by runtime analysis of previous tasks, or by hints from the application developer or smart compiler. Using that information, and resilience tuning parameters provided by the user, the runtime can do simple cost/benefit analysis to determine which containment domains are the best ones to act upon to achieve the user's performance/resilience goal.
X-TUNE  
GVR GVR allows applications and systems software to control error checking and handling (recovery), thus a full range of application semantics (algorithmic, data structure, physics, etc.) and system correctness semantics can be exploited. We have implemented full ABFT checkers and recovery using GVR.
CORVETTE  
SLEEC We could integrate resilience or accuracy information into our cost models to drive transformations (e.g., to tune library parameters to hit an overall accuracy target)
PIPER The aggregation runtime will likely be highly structured (e.g., use tree overlay networks), which could be exploited for resilience
Charm++ Our system relies on over-decomposition provided by the application, and runtime instrumentation provided by the runtime, to optimize its resiliency protocols. In addition, work is under way to separate application object memory into different compartments requiring different degree of resilience. This requires some support for application and/or compiler.
Early Career-SriramK The resilience techniques are closely related to the runtime and application semantics exposed by the user. In particular, the task-parallel program specification and properties of the task-parallel abstraction allow us to constrain the inter-task and task-data relationships to be tracked and recovered.
What capabilities of provided by resilience researchers (software or hardware) could have a significant impact on the capabilities or efficiency of resilience? Where does resilience fit into the X-stack runtime abstract architecture?
XPRESS Numerical correctness of calculations in a non-deterministic scheduling context would benefit from knowledge about floating-point corruption due to order of actions.
TG Not in TG scope.
DEGAS Fast, durable storage is a key technology for increasing the efficiency of resilience. In DEGAS we are interested in both bulk storage and logging storage. The requirements for these differ slightly, in that we use logging for small high-frequency updates, possibly as much as one log entry per message, but we use the bulk storage for large infrequent updates, as these are used for checkpoints.

We are exploring non-inhibitory consistency algorithms, i.e. algorithms that allow messages to remain outstanding while a recovery line is advanced. We face a challenge right now in that it is difficult to order RDMA operations with respect to other message operations on a given channel. In the worst case, we are required to completely drain network channels, and globally terminate communications in order to establish global consistency, i.e. agreement on which messages have been sent and received. A hardware capability that may prove valuable here are network-level point-to-point fence operations. A similar issue on Infiniband networks is that the current Infiniband driver APIs require us to fully tear down network connections in order to put the hardware into a known physical state with respect to the running application process, i.e. we need to shut down the NIC to ensure that it doesn't modify process memory while we determine local process state. A lightweight method of disabling NIC transfers (mainly RDMA) would eliminate this teardown requirement.

D-TEC Our programming language, Rely, defines a static quantitative reliability analysis that verifies quantitative requirements on the reliability of Rely programs, enabling a developer to perform sound and verified reliability engineering. The analysis takes a Rely program with a reliability specification and a hardware specification (which characterizes the reliability of the underlying hardware components) and verifies that the program satisfies its reliability specification when executed on the underlying unreliable hardware platform. Naturally, having better models of the hardware, and hardware support soft-error detection (e.g., TMR) can help improve the accuracy of our system.
DynAX If the hardware can reliably detect and report soft failures (such as errors in floating point instructions), this would avoid running extra application code to check the results, which are sometimes rather costly. Reliable, slower memory for storing snapshot data would also be beneficial. This memory would be written with snapshot data every time a containment domain was entered, but only read while recovering from a failure, so the cost of the write is much more important than the cost of the read.
X-TUNE  
GVR Cheap non-volatile storage, inexpensive error checking, exposure of "partially correct" states to higher levels, ... would all increase the scope of errors that GVR and applications could handle.
CORVETTE  
SLEEC N/A
PIPER N/A
Charm++ The first question is unclear, for software. For hardware, some useful capabilities are: quick detection and flexible notification of errors, more robust memory regions for control variables, sensors for early notification of impending component failures.

The need for (and the complexity of) resilience will depend on the direction taken by the exascale hardware (viz. how much reliability gets sacrificed for power-performance). It is desirable, but challenging, to separate the resilience protocols modularly from the rest of the runtime.

Early Career-SriramK Efficient low-overhead error detection can enable recovery to be performed at various layers of the software stack. Runtime support for resilience is a non-trivial challenge and, if desired, needs to be planned from the beginning.

Runtimes (application-facing)

Sonia requested the Traleika Glacier X-Stack team to initiate this page. For comments, please contact Shekhar Borkar.

Snapshot as of 5/10/14: Runtimes_(application_facing).pdf

Questions:

What policies and/or mechanisms will your runtime use to schedule code and place data for 100M objects (executing code, data elements, etc.) in a scalable fashion?
XPRESS A hierarchical representation of logical contexts and tasks (processes and compute complexes) provides semantic representations of relative locality for placement of data objects and the tasks that are performed on them. Where data is widely distributed, they can be organized on separate processes distributed across multiple nodes with methods that allow actual work to be performed near the data. Research is exploring the allocation of resources by the LXK OS to the HPX runtime system and the policies to be implemented including programming interface semantics.
TG Open Community Runtime (OCR) will optimize for data-movement scalability. Our programming model divides an application into event-driven tasks with explicit data-dependences. Our runtime uses of this to schedule code close to its data or move the data close to the code. Scalability will be achieved through hierarchical task-stealing favoring locality.
DEGAS The DEGAS runtime uses one-sided communication (put, get, active messages, atomics, and remote enqueue of tasks) to place data and work across a large-scale machine. Within a node there are currently two scheduling approaches being pursued. One (under HCLib/Habanero-C) is built on OCR and uses a dynamic task scheduler; it is being evaluated to determine the need for locality control within the node; the second is derived from the UPC runtime and has both a fixed set of locality-aware threads tied to cores (or hardware threads or NUMA domains -- it's an abstraction that can be used a various machine levels), augmented with voluntary task scheduling for both locality and remotely generated dynamic tasks. A global task stealing scheduler is also part of the DEGAS plan and exists in prototype form; as with dynamic tasking, it is to be used on-demand for applications that are not naturally load balanced (e.g., divide-and-conquer problems with irregular trees).
D-TEC The APGAS (Asynchronous Partitioned Global Address Space) runtime uses a work-stealing scheduler to dynamically schedule tasks within a node. We are introducing areas to enable finer-grained locality and scheduling control within a node (Place). By design the runtime does not directly address automatic cross-node data placement. The APGAS runtime/programming model does provide primitive mechanisms (Places and Areas; at/async/finish) that allow application frameworks to productively implement data placement and cross-node scheduling frameworks on top of the runtime.
DynAX The SWift Adaptive Runtime Machine (SWARM) has a "locale" hierarchy, which roughly mirrors the hardware architecture hierarchy. Each locale has a set of local scheduler queues, allowing distributed and scalable scheduling. Data allocation and task/data migration are expressed to ensure proper parallelism around the conjunction. SWARM will rely on a single-assignment policy to prevent the need for globally coordinated checkout or write-back operations.
X-TUNE The compiler for X-TUNE must generate code with hierarchical threading, and will rely on the run-time to manage that threading efficiently. Point-to-point synchronization between threads may be more efficient than barriers to allow more dynamic behavior of the threads.
GVR GVR will use performance information for varied memory and storage types (DRAM, NVRAM, SSD, Disk), resource failure rate and prediction, redundancy in data encoding, existing version data copies and their location, as well as communication costs to place data. GVR does not include code scheduling mechanisms.
CORVETTE N/A
SLEEC SLEEC does not have a true runtime component, except insofar as we are developing single-node runtimes to, e.g., manage data movement between cores and accelerators. We also perform small-scale inspector/executor-style scheduling for applications. However, we expect to rely on other systems for our large-scale runtime needs.
PIPER N/A
Charm++ Overdecomposition: A Charm++ program consists of a large number of objects assigned to the processors by the RTS. Initial placement of objects is controlled by map functions that are either system defined or user defined static functions. The RTS dynamically migrates objects across processors as needed. Message driven scheduling is used on individual processors: a message, containing a method invocation for an object, is selected by the scheduler and the corresponding object's execution is triggered.
Early Career-SriramK Work is assumed to be decomposed into finer-grained tasks. The specification of inter-task dependencies and task-data relationships is used to automate aspects of locality management, load balance, and resilience. We investigate algorithms based on dynamic load balancing for various classes of inter-task and task-data relationships — strict computations, data-flow graphs, etc. — for intra-node and inter-node scheduling.
What features will allow your runtime to dynamically adapt the schedule and placement for 100K sockets to improve the metrics of code-data affinity, power consumption, migration cost and resiliency?
XPRESS The HPX/LXK System software architecture (also known as the “OpenX Architecture” integrates a closed-loop introspection component comprising the APEX and RCR components within the runtime and OS respectively. Code-data affinity is supported by multiple mechanisms. Intra-compute complex (thread) function keeps all private or local data in the same locality. Parcels move work to the data when preferred although supports data access and gathers as appropriate. Processes keep shared data organized within a single logical context that can be spread across multiple localities. The effective reduction of latency effects also reduces data movement energy. For resiliency reconfiguration and recovery data migration is enabled by logical active global address space. Research is being performed to address these issues; some under other funding.
TG If the hardware supports it, OCR will monitor performance and power counters to adapt its scheduling and data-placement to better utilize the machine.
DEGAS For resilience, the DEGAS runtime uses customizable application and system-level policies to trade-off the storage costs associated with resilient processing against the expected failure rate, with an objective of optimizing the expected forward progress in an application against expected recovery and preservation times. The Containment Domain hierarchy allows the storage hierarchy of the system to be mapped to a hierarchical resilience structure. Process migration for GAS applications is planned, and we are investigating live migration techniques to move work around the system without stopping the application, or individual processes, from running during migration. The UPC language makes memory affinity explicit for programmers, and UPC supports teams as a construct to manage communications locality.
D-TEC Automatic cross-socket migration and placement is not a topic we are actively exploring at the APGAS runtime level.
DynAX The locale hierarchy, runtime awareness of high-level data types, and support for task affinities to certain hardware will allow the runtime to make good placement decisions and move tasks and data around the system as needed to minimize the overall energy costs and improve efficiency. The use of a single-assignment data model and hints associated with particular tasks or data allows the runtime to establish good code-data affinity and energy efficiency.
X-TUNE Autotuning is the main mechanism that allows our project to adapt to execution context. In the long term, this autotuning must be performed during program execution to support dynamically-varying execution contexts.
GVR GVR creates multiple-versions (snapshots) of globally-accessible data arrays as the primary basis or resilience. GVR creates an independent stream for each resilient data array, allowing it to be independently versioned, recovered, and managed - different from checkpointing -- and enabling a wealth of efficiency optimizations and flexible control by the application. Beneath that, GVR will optimize location, encoding, version creation and deletion, to maximize compute performance, resilience coverage, energy efficiency, and even wear-out lifetime of non-volatile storage devices (NVRAM).
CORVETTE N/A
SLEEC N/A
PIPER N/A
Charm++ Overdecomposition: A Charm++ program consists of a large number of objects assigned to the processors by the RTS. Initial placement of objects is controlled by map functions that are either system defined or user defined static functions. The RTS dynamically migrates objects across processors as needed. Message driven scheduling is used on individual processors: a message, containing a method invocation for an object, is selected by the scheduler and the corresponding object's execution is triggered.
Early Career-SriramK The specification naturally supports task and data migration. We support locality-constrained dynamic load balancing and selective localized recovery. We consider fully-automated and user-supported techniques. These are guided by introspection of the running tasks and feedback from the execution environment.
How will the runtime manage resources (compute, memory, power, bandwidth) for 100K sockets to meet a power, energy and performance objective?
XPRESS The HPX runtime system maintains an abstraction of global data and compute complexes (threads) within the context of the ParalleX process hierarchy and engages in a bi-directional protocol with the LXK lightweight kernel to acquire and employ memory blocks and OS thread executables. As the OS manages multiple job program resource conflicts and the HPX runtime manages the intra-job task requirements and priorities, the two work together in dialog to balance the complex tradeoffs. Power imposes upper constraints at the node (locality) and socket level limited by the OS. Energy usage is governed by the ParalleX Side-Path Energy Suppression methodology that (attempts) to determine critical path of execution to which highest power is applied and reduces energy usage to the non-critical (side-path) work to the degree that the critical path does not change thus minimizing total energy with shortest time to completion. This strategy addresses scaling of both energy and performance objectives.
TG OCR will manage resources based on the application's needs and the power budget and turn off or scale back unneeded resources.
DEGAS The DEGAS energy goals are primarily met by avoiding data movement both within and between nodes. “Communication avoidance” is a primary goal of the project in language, compilers and runtimes and has proven ties to energy use and performance. Dynamic energy management will be handled by the dynamic tasking on node and global task stealing between nodes, which as noted above is a voluntary and therefore “tunable” part of the runtime.
D-TEC Using techniques developed in the SEEC runtime, the runtime could adaptively monitor application progress and increase/decrease resource utilization to minimize power consumption under the constraints of meeting application performance targets. This requires the application to be modified to report an abstract notion of progress to the runtime, the system software and hardware to provide the necessary monitoring APIs, and for the system software and hardware to provide the ability to dynamically adjust power consumption at the cost of reduced performance/reliability.
DynAX The runtime software will allocate only as many processors and as much memory as an application needs for efficient execution. It should be possible to adjust these parameters according to the real time or energy efficiency requirements indicated by the system user. Past power consumption can either be read out from supporting hardware or modeled based on software characteristics. This data, in conjunction with a system- or user-designated power budgets and hints associated with particular tasks and objects, will help the runtime decide when to focus work and data in a smaller area, allowing it to clock- or power-gate the remainder of the hardware, or when to spread work out across more of the system, requiring a higher power usage to induce a higher throughput. If hardware supports frequency scaling, this can be used to more finely tune power usage in runtime-managed components.
X-TUNE Autotuning can be used to support multiple objectives, as long as the tradeoff space and the goals of the developers are well understood. The question is really how much performance may be sacrificed to meet other optimization objectives.
GVR The optimization for resilience embodied in GVR - and its application partnership - can be constrained by power, energy and performance limits. The philosophy of GVR as a library is to adapt to these as external constraints, and is therefore compatible with a variety of runtime and programming system tools.
CORVETTE  
SLEEC  
PIPER N/A
Charm++ The extensive introspection capabilities are used to create a local database of individual object performance and communication behavior. Runtime strategies, such as load balancers, make use of the database (in hierarchical and/or distributed fashion) to make resource management decisions in a scalable manner. Continuous monitoring of temperature, power, energy behavior is used to trigger runtime optimizations. Some of the main mechanisms for resource management include: DVFS/RAPL for chip level power management, object migrations, processor evacuations, automatic failure detection (via buddy-based heartbeat detection, for example) etc.
Early Career-SriramK Power/energy optimization is not directly addressed in this project.
How does the runtime software itself scale to 100K sockets? Specifically, how does it distribute, monitor and balance itself and how is it resilient to failures?
XPRESS Individual instances of runtime system functions and responsibilities are created on a per node basis and per user program basis to spread the work uniformly as a system scales in workload (number of user jobs) and scales to larger number of hardware localities (ensembles of sockets). Introspection at the hardware support layer and software application layer detects and manages load balance through the RIOS control interface, the APEX runtime instrumentation and control layer, and the RCR black-boarding at the OS layer. Resiliency will be supported through the ParalleX execution model micro-checkpointing cross-cutting Compute-Validate-Commit cycle that employs hierarchical fault zones. This dynamic methodology engages all component layers of the hardware-software system for fault detection, isolation, diagnosis, reconfiguration, recovery, and restart.
TG OCR functionality is hierarchically distributed along the hardware’s natural computation hierarchy (if it has one) or imposing an arbitrary one. OCR divides cores into "runtime" and "user". For efficiency, "user" cores run a small layer of the runtime and manage that specific core. The other "runtime" cores manage the user cores in a hierarchical fashion where the "runtime" cores "closest" to the "user" cores will perform low-latency simple scheduling decisions whereas higher level cores will perform longer-term optimization operations.
DEGAS DEGAS is already highly scalable on the largest machines available today and while some scaling issues in hierarchical synchronization (phasers), collective communication, and job startup require constant attention within the runtime, we do not see any major barriers to arbitrary scale. Note that the runtime is parallel by default (a job starts with a task per core/numa-domain/hardware thread) which greatly aids in scalability. Balancing due to resilience or load problems is done with the dynamic tasking and work stealing across nodes, both envisioned as voluntary within UPC++ and “by default” within a node in Habanero. We see this question of the default policy as key for the remainder of the project, but the same runtime mechanisms are needed in any case. Resilience is also in some sense tunable by the application using the general model of containment domains. GASNet-EX is designed to allow processes to fail and later be replaced. Distributing work is largely left to the applications programmer, but self-monitoring features and error reporting are being added to the interface to allow client runtimes to handle changes. We are investigating the semantic changes required to the GASNet-EX interfaces to enable client programs to continue through failures. We have performed some investigation into automatically rebalancing work past a failure. Process monitoring is done primarily by checking for progress on communications channels, and reporting errors to clients. An early goal for the GASNet API specification is to ensure that GASNet itself remains in a well-defined state after a process failure, providing knowledge of which operations have completed, which operations have terminated, and which processes have terminated.
D-TEC The APGAS runtime has already been demonstrated to run non-resiliently and achieve scalable performance on a 55k core system. Most runtime operations are localized to a single APGAS place and thus naturally scale as the number of nodes increase. We have prototyped a resilient version of the APGAS runtime at a small scale (<500 cores) and are actively working on scaling the resilient version of the runtime to larger scale systems.
DynAX The runtime software will operate in all processor cores of the system, and will divide the system into executive and worker cores, with a hierarchy of executive cores associated with each non-leaf locale, managing each other and the workers. This helps localize work and data, but allows load to spill out into wider regions if a narrower region is flooded at any point. If the locale hierarchy is aligned with the hardware memory and communications hierarchy, it also helps localize the effects of any hardware failure. Detection of a failure that impacts correctness of runtime operation will result in the processor core, memory unit, or subsystem being taken out of service and the same operation being retried elsewhere. We will use containment domains to establish strict task boundaries, and can use earlier versions of data for resumption of failed tasks. Because we use a single-assignment model, the chance of overwriting old data from which recovery would otherwise be possible is eliminated and the extent of the effects of any failure can be limited.
X-TUNE This is not applicable to X-TUNE as we rely on a run-time system provided by others.
GVR GVR is based on a decentralized architecture that replicates metadata across the machine, and creates redundant data versions for application resiliency. The GVR architecture will exploit replicated metadata storage, and a stateless recovery architecture to enable resilience to scale from application thru GVR implementation resiliences as well as from petascale to exascale systems.
CORVETTE  
SLEEC SLEEC's runtimes are intended to operate within the scope of a single node, or at a small scale. We rely on other runtimes to provide higher levels of the hierarchy.
PIPER N/A
Charm++ The RTS is extremely lightweight. Aside from a tiny instrumentation system around the core scheduler, all aspects of the runtime itself are libraries and therefore monitored as usual. The fully distributed nature of the runtime makes it easy to balance it. The production version of Charm++ currently supports recovery only from fail-stop failures. For this, the RTS components on buddy nodes are used.
Early Career-SriramK The runtime software is lightweight and designed to introduce minimal overhead. Depending on the application and execution scenario, one/few hardware thread(s) might be dedicated to handle incoming communication. The actions of the runtime are fully distributed and for the most part asynchronous and non-collective. Faults are handled through selective localized recovery coupled with dynamic load balancing.
What is the efficiency of the runtime? Specifically, how much impact does the runtime have on a) the total execution time of the application and b) resources taken from algorithmic computations? What are your plans to maximize efficiency? How will runtime overhead scale to 100K sockets?
XPRESS The HPX runtime is event driven and stays out of the way of the user codes executing intra-thread for purposes of efficiency. However, inter-thread there are a number of overhead actions that impact efficiency and impose a lower bound on thread granularity, which limits scalability for fixed size workloads. OS overhead (LXK) is fixed on a per node basis and therefore scalable. HPX process calls across nodes (conceptually millions) employ symmetric semantics (synchronous versus asynchronous) for portability, parcels for message-driven computing in combination with local control objects to manage asynchrony including mitigation of latency effects, and active global address space to handle remote data load and stores. Research will determine the scaling factors for these as well as the time and energy efficiencies that may be achieved.
TG OCR code runs on cores that are physically separate from those for user code. Our goal is to have enough “runtime” cores that runtime overhead is completely masked by the application code. As machine size increases, more runtime cores will be needed to handle higher-level functions and global optimizations but this will increase very slowly.
DEGAS As noted above, we see no major barriers to scaling to arbitrary machine sizes, but expect resource management at this scale to require additional research and engineering. The large number of cores on a node accessing a shared communication resource is one such problem. The dynamic tasking runtimes have more overhead, but we are working to minimize the difference in performance between the static and dynamic case. In our experience the major problem is lose of locality from the dynamic case, which we are addressing in various ways, including an “inspector-executor” style scheduler.
D-TEC Running at 55k cores on typical kernel benchmarks, the APGAS runtime has been demonstrated to have very low overheads. As a general design principle, the runtime overheads should be expected to be proportional to the frequency with which the application requests services from the runtime.
DynAX We have focused very heavily on minimizing the amount of inline work, allowing the application to run un-hindered, and minimizing the hardware resources required for the runtime-internal threads. Overall, we anticipate that the processing overhead per core is expected to be essentially constant regardless of system size, with the exception of global operations like barriers and reductions which may require additional time scaling with the logarithm of the system size. Memory usage for thread descriptors and stacks will be linear with the number of cores, although temporary linearithmic (i.e., O(n lg n) for n coordinating agents) memory blocks may be needed to manage global operations. Very little static-/runtime-bound data is required beyond that, aside from what's required for basic interfacing with the underlying platform.
X-TUNE This is not applicable to X-TUNE as we rely on a run-time system provided by others.
GVR GVR seeks to minimize resilience overhead. We have performed experiments with numerous applications (ddcMD, OpenMC, PCG, GMRES, and mini-apps such miniFE, miniMD) that demonstrate overheads of less than 1% runtime without any special hardware support. With novel emerging features such as storage-class memory (integrated NVRAM), we expect this overhead to be even smaller.
CORVETTE  
SLEEC Because SLEEC focuses on small-scale runtimes that are directly integrated with application code, we expect our runtime overheads to be negligible and, essentially, independent of scale (because scaling will be provided by other runtime systems.
PIPER N/A
Charm++ The RTS maintains a lightweight presence on every core of the system, recording small bits of information at every scheduling event. Overhead for each scheduling event is less than a microsecond on today's machines, thus creating a very low impact on application performance. Strategies such as load balancing are executed only when needed; their costs vary from very light refinement strategies to extensive global strategies, and are deployed based on the need identified by recognizing runtime patterns.
Early Career-SriramK The resources taken by the runtime involve (a) the few hardware threads potentially dedicated to process messages and (b) runtime actions to balance load, manage locality, and recover from faults. The degree of overheads incurred will depend on the degree of automation desired. Fully user-controlled approaches can achieve very low to no overheads. We have demonstrated elements of the automation on 100K+ cores and expect these to scale further. Ongoing research is investigating additional aspects that can be automated.
Do you support isolation of the runtime code from the user code to avoid violations and contamination?
XPRESS The ParalleX process construct and hierarchy with capabilities addressing separates runtime functions from user functions. The global addressing permits runtime system instances to manipulate user “compute complexes” (e.g., threads) as first class objects. Independent runtime instances isolates multiple user applications sharing any particular localities (nodes). Research is exploring the costs and completeness of these protection mechanisms.
TG The majority of the runtime code runs on cores that are physically separate from the ones on which user code is running. Although we are currently considering a model where all cores can touch data everywhere else, our model will support possible hardware restriction (user cores cannot touch data in runtime cores).
DEGAS The runtime code is separate from user code, but there is no enforced isolation.
D-TEC We are not directly addressing this issue. The design point we are pursuing is that there will be different instances of the APGAS runtime for different programs and isolation will be provided by other layers of the stack.
DynAX Complete isolation depends heavily on hardware support. For architectures which support it, such as Traleika Glacier, SWARM isolates resources rather well, with only a thin shim layer of SWARM residing on the application cores. When possible, runtime decisions happen on executive cores, which have visibility and control over worker cores, but not vice-versa. When using hardware that does not support this kind of work division, it may not be possible to prevent violation/contamination over application cores. (When features like segmentation or virtual memory are present, these can potentially be used to enforce separation between the runtime and application code, but doing so may impose an enormous overhead---comparable to placing the runtime in the OS kernel---and this will likely not be worth the enormous performance hit that would be taken.)
X-TUNE We rely on run-time systems provided by others, and would simply invoke the run-time.
GVR GVR supports use of operating system or other runtime mechanisms for this isolation, but provides no mechanisms of its own. GVR's design and implementation supports flexible recovery from detected violations or contamination.
CORVETTE  
SLEEC SLEEC's runtimes are application/domain-specific and hence intended to closely couple with the application code.
PIPER N/A
Charm++ Yes, by normal memory allocation mechanisms. But no OS level protection is used (as that’d be too heavy weight).
Early Career-SriramK We do not address isolation/contamination issues.
What specific hardware features do you require for proper or efficient runtime operation (atomics, DMA, F/E bits, etc.)?
XPRESS There are no absolute requirements for proper operation of the HPX runtime system beyond those found on conventional parallel and distributed systems. These include compound atomic operations, message exchange between nodes, scheduling of threads and their precise interrupts, and local virtual address translation. However, there are additional features that may be incorporated in the future that would dramatically reduce overheads, mitigate latencies, increase parallelism, and circumvent hotspots. Among such mechanisms for efficient runtime operation are hardware support for 1) user lightweight thread creation, termination, and context switching (including preemption), 2) global virtual address translation, 3) ‘struct’ processing for simultaneous multi-word processing (for local control objects among others), message driven computation, and combined DMA plus synchronization. Research will ascertain, evaluate, and analyze to degree of operational improvements that may be derived from such hardware support.
TG OCR requires hardware to support some form of atomic locking. Additional HW features identified for increased efficiency: 1) Remote atomics for cheaper manipulation of far-away memory; 2) Heterogeneity to taylor "user" cores for user code and "runtime" cores for runtime code (no FP for example) 3) Fast runtime core-to-core communication to allow the runtime to communicate efficiently without impacting user code 4) Asynchronous data movement (DMA engines) 5) HW monitoring to allow introspection and adaptation; 6) knowledge of HW structure (memory costs, network links available, etc) enabling more efficient scheduling and placement.
DEGAS For efficient processing, remote DMA operations and low computational overhead queue pair access are essential, as is interrupt-free message processing. Loose ordering restrictions on message and RDMA processing are important, as is the ability to issue relatively large numbers (hundreds) of outstanding remote memory operations. Registration and pinning of RDMA memory often remain issues for us, and we would like both low-overhead memory registration techniques and the ability to have large numbers of registered memory areas, as we have encountered difficulties due to limitations on the number of (distinct) memory regions that can be accessed by an RDMA-capable device.
D-TEC Nothing beyond those found already found on conventional parallel and distributed systems. The APGAS runtime has been extended to exploit unique hardware capabilities of particular machines (eg Torrent exploitation in Power775) and if unique hardware capabilities are available on future systems the APGAS runtime could be extended to exploit them.
DynAX Our fundamental requirements are atomic operations (preferably at least compare-and-swap), memory fences, RDMA, power/clock/frequency management, and hardware failure event notification. Optionally, F/E bits, and explicit associative memories would result in additional efficiency improvements. If a transparent data caches is available on each core, then features like hardware transactional memory can greatly speed up operations on shared data structures.
X-TUNE Autotuning relies on accurate hardware monitoring to provide measurements used to calculate optimization objectives.
GVR GVR is designed for portability, and should be able to run on systems ranging from current-day petascale to CORAL to Exascale systems. However, hardware features such as integrated NVRAM, efficient change tracking, data compression, efficient and reliable RMA/RDMA, collectives, etc. will further increase the efficiency of GVR.
CORVETTE  
SLEEC  
PIPER N/A
Charm++ All the features mentioned are useful when available, though the RTS can function with what is available on each machine. DMA is almost essential to derive the benefits of latency tolerance implicit in Charm++ over-decomposition model.
Early Career-SriramK (1) Remote atomic operations, (2) Asynchronous data transfer (DMA), (3) Support for efficient remote method invocation, (4) Hardware locality and topology information.
What is your model for execution of multiple different programs (ie: a single mention would be doing more than one thing) in terms of division, isolation, containment and protection?
XPRESS The HPX runtime system supports the ParalleX processes, which serve as logical contexts and are referenced through a hierarchical namespace. The global root process of the entire system provides global naming. Each program has a program root process that contains the instances of the dedicated runtime kernel and the “main” process of the user applications. The process boundaries incorporate a form of capabilities based addressing for protections. Programs are logically separate and isolated although can interact through the upper hierarchy of the process stack. Nonetheless, programs may share physical resources (localities). The underlying OS manages the protections of the virtual address space.
TG Our programming model splits user code into small event-driven tasks (EDTs). Multiple non-related EDT sub-graphs can coexist at the same time with the same runtime. While not isolating applications, it does automatically globally balance all the applications at once. The locality aware scheduling will also naturally migrate related data and code closer together thereby physically partitioning the different applications. If a more secure model is required, different runtimes can run on a subset of the machine thereby statically partitioning the machine for the various applications; it is more secure but less flexible.
DEGAS Our model is support for hierarchical applications, at which the top level hierarchy may be logically separate programs (or physics models, or…). We also have a strong emphasis on interoperability, including with current MPI, MPI+X applications. Part of our interoperability between tasking layers is supported by the work on Lithe. We are interested in the iPython model for combining applications into workflows, but this is not part of the DEGAS project itself.
D-TEC If this is another way of asking question 6), then this is not an issue being addressed by our runtime work.
DynAX The SWARM runtime depends on isolation features in the hardware, which vary from one hardware platform to the next. Where possible, SWARM will make use of hardware isolation features to protect multiple programs from each other. If a single instance of the SWARM runtime software is in control of all of the running programs which must be isolated from each other, it is both well-suited and in a perfect position to ensure that no program starves others for resources. If there is a higher-level OS or executive managing multiple SWARM instances, there may need to be higher-level signaling to prevent resource starvation. When distinct applications must be run within the same runtime, hardware features will be required to prevent the applications from reading from or writing into each other's state; SWARM cannot provide this guarantee on its own. However, instituting a single-assignment policy helps prevent most applications from butting up against the hardware protection mechanisms accidentally---doing so constitutes programmer error with regard to the application. If distinct applications have distinct runtime instances, then SWARM has little to no control its client applications' attempts to read/write things they shouldn't, and so must rely entirely on hardware mechanisms and lower software layers for protection.
X-TUNE  
GVR Each program would have a unique instance of the GVR library, and as such create a version store that includes several independent streams of global-view structures (tuned for efficiency and resilience). However, the version stores do not interact with each other, so there is no interference. Access control to version stores is enforced by the operating system.
CORVETTE  
SLEEC  
PIPER N/A
Charm++ Multiple modules are well-supported by message-driven scheduling paradigm: modules interleave their execution based on availability of data, thus overlapping idle time in one with useful computation in the other. The same idea has been extended to different applications, for the cloud settings. No strong isolation (as in OS protection domains) is supported.
Early Career-SriramK We do not address isolation/containment across different applications running on a machine. We support phase-based switching between programming/execution paradigms. Different partitions within an application can employ distinct programming paradigms as long as basic progress guarantees are met.

Runtimes (os/hardware-facing)

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

What system calls does your RTS currently use?
XPRESS HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others.
TG Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms.
DEGAS  
D-TEC Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations.
DynAX SWARM requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On today's x86 clusters, SWARM additionally needs access to I/O facilities, such as the Linux select, read, and write calls.
X-TUNE Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable.
GVR  
CORVETTE  
SLEEC  
PIPER The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.
Does your RTS span the system? If so, what network interface capability does your RTS need?
XPRESS The HPX RTX spans the system. It requires global address space and parcels message-driven interface.
TG Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes.
DEGAS  
D-TEC We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI.
DynAX Yes, SWARM operates on all available/configured threads of all available/configured nodes. SWARM can operate over stream-, message-, or DMA-based interconnects.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
PIPER Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.
How does your RTS map user-level and OS-level scheduling?
XPRESS The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads.
TG Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves.
DEGAS  
D-TEC We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler.
DynAX SWARM uses codelets to intermediate between threads and function/method calls. Threads are set up and bound at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet. The runtime can dynamically balance load by shifting readied codelets and/or context data from one location to another. When the OS is in charge of power management, blocking is used to relinquish a hardware thread to the OS so that it can be used for other work, or its core powered down.
X-TUNE The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed.
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
What does your RTS use for locality information?
XPRESS The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects.
TG We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior.
DEGAS  
D-TEC The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task.
DynAX It uses a tree of locale descriptors to associate threads, cores, nodes, etc. with each other, typically in a fashion correlating with the hardware memory hierarchy.
X-TUNE The key issue will be understanding when differences in output are acceptable, and when they represent an error.
GVR  
CORVETTE  
SLEEC  
PIPER Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.
What OS or hardware information does your RTS need to monitor and adapt?
XPRESS Availability of execution resources, energy consumption, detected errors, delays due to contention.
TG Performance monitoring units and fault detection.
DEGAS  
D-TEC The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures.
DynAX Reliable notification of hardware failures, and a local or global cycle-based or real time. Performance counters would help for load modeling and balancing.
X-TUNE  
GVR  
CORVETTE  
SLEEC  
PIPER in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required
Does your RTS require support for global namespace or global address space?
XPRESS Yes.
TG No, will use if available.
DEGAS  
D-TEC Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system.
DynAX SWARM can take advantage of a global name/address space, but provides for a global namespace entirely in software. OS or hardware involvement are only needed for data storage and communication.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
What local memory management capability does your RTS require?
XPRESS It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses.
TG Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use.
DEGAS  
D-TEC Garbage collection.
DynAX SWARM requires the ability to allocate physical memory.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial
Does your RTS address external I/O capability?
XPRESS Yes.
TG Yes (partial).
DEGAS  
D-TEC No
DynAX Yes.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
What interface and/or mechanism is used for the OS to request RTS services?
XPRESS The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls.
TG n/a
DEGAS  
D-TEC The X10/APGAS RTS is linked with the application binary.
DynAX Current versions of SWARM do not require the OS to request services from the runtime. In the event this is necessary, it's expected that either a signal-/interrupt-based or polling-based interface will be provided, and either of these can be integrated easily.
X-TUNE N/A -- We use standard languages and run-time support.
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
How does your RTS support legacy application or legacy RTS capability?
XPRESS Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form.
TG Not in TG scope.
DEGAS  
D-TEC N/A
DynAX Legacy applications can be converted piecewise or in entirety; while the

application may block normally during single-threaded regions, parallelized regions requires blocking calls to use stack-switching or (equivalently) extra software threads, or else break apart blocking operations into separate initiation and callback sections. Where possible, SWARM provides predefined exports that allow asynchronous use of common legacy runtime/API functionality.

X-TUNE Scalability and determining what is an error seem like the biggest challenges.
GVR  
CORVETTE  
SLEEC  
PIPER Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches
Does your RTS depend on any specific hardware-specific capability?
XPRESS HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability.
TG No but it can take advantage of some if available.
DEGAS  
D-TEC No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs.
DynAX SWARM can operate perfectly well on commodity systems, but benefits from access to performance-counting and power-monitoring/-control facilities.
X-TUNE  
GVR  
CORVETTE  
SLEEC N/A
PIPER Full (and well documented) access to performance counters, profiling and sampling

Scientific Libraries

Sonia requested that Milind Kulkarni initiate this page. For comments, please contact Milind.

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

Describe how you expect to target (optimize/analyze) applications written using existing computational libraries
XPRESS Libraries written in MPI with C will run on XPRESS systems using UH libraries combined with ParalleX XPI/HPX interoperability interfaces. It is expected that future or important libraries will be developed employing new execution methods/interfaces.
TG OCR scheduler will optimize execution of code generated by R-Stream.
DEGAS  
D-TEC Where appropriate library abstractions will be provided with compiler support (typically for finer granularity abstractions at an expression or statement level). Source-to-source transformations will rewrite the code to leverage abstraction semantics and program analysis used to identify the restricted contexts to support the generation of the most efficient code. Fundamentally, libraries can't see how their abstractions are used within an application were as the compiler can do so readily and use such information to generate tailored code.
DynAX We are focusing on ways to identify scalable and resilient data access and movement patterns, and express them efficiently in task-based runtimes. For computational libraries which do not already provide such semantics, alternative means must be found. (For instance, a LAPACK SVD call can be replaced with a distributed, more scalable equivalent.)
X-TUNE Work on autotuning to select among code variants could be applied to libraries that provide multiple implementations of the same computation. The key idea is to build a model for variant selection based on features of input data, and use this to make run-time selection decisions.
GVR  
CORVETTE  
SLEEC  
PIPER Support optimization efforts with tools that can capture some of the internal semantics of a given library (e.g., levels of multigrid V cycle or patches in an AMR library)
Many computational libraries (e.g., Kokkos in Trilinos) provide support for managing data distribution and communication. Describe how your project targets applications that use such libraries.
XPRESS This issue is unresolved
TG The OCR tuning hints framework can be used for user directed management of data and communication.
DEGAS  
D-TEC We expect to leverage existing libraries and runtime systems (most commonly implemented as libraries) as needed. The X10 runtime system will be used, for example, to abstract communication between distributed memory processors. Other communication libraries (e.g. MPI) are being use to both simplify the generation of code by the compiler and leverage specific semantics that can, with program analysis, be used to rewrite application code to make it more efficient and/or leverage specific Exascale hardware features.
DynAX Such libraries often have their own system/runtime requirements. If those requirements line up with the requirements of the application, no further adaptation is necessary. Otherwise, such a library could possibly be used through some form of adaptation layer, or the algorithm could simply be ported to run on the necessary software stack, directly. This demonstrates a need for interoperability, which we feel is an area that needs to be explored further.
X-TUNE There is an opportunity to apply autotuning to such decisions.
GVR  
CORVETTE  
SLEEC N/A
PIPER PIPER will provide stack wide instrumentation to facilitate optimization - access to internal information only known to the library should be exported to tools through appropriate APIs (preferably through similar and interoperable APIs)
If your project aims to develop new programming models, describe any plans to integrate existing computational libraries into the model, or how you will transition applications written using such libraries to your model.
XPRESS Low-level system oriented libraries such as STDIO will be employed by the LXK and HPX systems among others. No scientific libraries per say will be built in the systems as intrinsics below the compiler level. Over time many libraries will be ported to the ParalleX model for dramatic improvements in efficiency and scalability.
TG R-Stream compiler
DEGAS  
D-TEC Our research work supports more of how to build the compiler support for programming models than a focus on a specific DSL or programming model. However, specific work relative to MPI is leveraging MPI semantics to rewrite application code (via compiler source-to-source transformations) to better overlap communication and computation. This is done as one of many building blocks from which to construct DSLs that would implement numerous programming models. Other work on how to leverage semantics in existing HPC code is targeting the rewriting of the code to target both single and multiple GPUs per node, this work leverages several OpenMP runtime libraries.
DynAX The HTA programming model may result in new generation of computational libraries.
X-TUNE N/A
GVR  
CORVETTE  
SLEEC N/A
PIPER N/A
What sorts of properties (semantics of computation, information about data usage, etc.) would you find useful to your project if captured by computational libraries?
XPRESS Libraries crafted in a form that eliminated global barriers, worked on globally addressed objects, and exploited message driven computation would greatly facilitate the porting of conventional rigid models to future dynamic adaptive and scalable models such as the ParalleX based methods.
TG Affinities, priorities, accuracy expectations, critical/non-critical tasks and data.
DEGAS  
D-TEC Libraries should present simple user level API's with clear semantics. A relatively course level of granularity of semantics is required to avoid library use from contributing to abstraction penalties. Appropriate properties for libraries are data handling specific to Adaptive Mesh Refinement, data management associated with many-core optimizations, etc. Actual use or fine-grain access of data abstractions via libraries can be a problem for general purpose compilers to optimize.
DynAX Wide availability / compatibility with multiple runtimes would help reduce effort. The ability to tune performance not only for a single library call, but across the application as a whole, would be beneficial.

Algorithms vary widely in their data access patterns, and this means that, for a particular algorithm, some data distributions are much more suitable than others. An application developer may have full control of the data's distribution before the application calls the computational library, but has no idea what data access pattern the library uses internally, and therefore, performance is lost by rearranging data unnecessarily. Some feedback from the library would be helpful for preventing that kind of performance loss.

X-TUNE Affinity information would be helpful.
GVR  
CORVETTE  
SLEEC  
PIPER Attribution information about internal data structures (e.g., data distributions, patch information for AMR) as well as phase/time slice information

 

Core Team

Program Manager: Sonia Sachs
E-mail: sonia.sachs@science.doe.gov

Projects

Motivation High performance scientific computing Exascale: O(106) nodes, O(103) cores per node Requires asynchrony and “relaxed” memory consistency Shared memory with dynamic task parallelism Languages allow remote memory modification Correctness challenges Non-deterministic causes hard to...
Domain Specific Languages (DSLs) are a tranformational technology that capture expert knowledge about applica@on domains. For the domain scientist, the DSL provides a view of the high‐level programming model. The DSL compiler captures expert knowledge about how to map high‐level abstractions to...
Mission Statement: To ensure the broad success of Exascale systems through a unified programming model that is productive, scalable, portable, and interoperable, and meets the unique Exascale demands of energy efficiency and resilience.   Goals and Objectives Scalability: Billion‐way concurrency...
This project will conduct research on runtime software for exascale computing. Moving forward, exascale software will be unable to rely on minimally invasive system interfaces to provide an execution environment. Instead, a software runtime layer is necessary to mediate between an application and...
FOX
Fault-oblivious Extreme-scale Execution Environment
GVR
Application Partnerships Advanced Nuclear Reactor Simulation (Andrew Siegel, CESAR) Computational Chemistry (Jeff Hammond, ALCF) Rich Computational Frameworks (Trilinos, Mike Heroux, Sandia) Particle codes (ddcMD) (David Richards, Ignacio Laguna, LLNL) Adaptive Mesh Refinement (Chombo) (Brian van...
This project is developing new techniques for measuring, analyzing, attributing, and presenting performance data on exascale systems. Objectives Exascale architectures and applications will be much more complex than today's systems, and to achieve high performance, radical changes will be required...
Motivations Modern computational science applications composed of many different libraries Computational libraries, communication libraries, data structure libraries, etc. Peridigm, developed by co-PI Mike Parks, builds on 10 different Trilinos libraries Each library has its own idioms and expected...
Goal:  Research and mature software technologies addressing major Exascale challenges and get ready to intercept by 2018-2020 Objectives: Energy efficiency: Software components interoperate, harmonize, exploit hardware features, and optimize the system for energy efficiency Data locality: PGM...
What is Autotuning? Definition: Automatically generate a “search space” of possible implementations of a computation A code variant represents a unique implementation of a computation, among many A parameter represents a discrete set of values that govern code generation or execution of a variant...
Goals, Objectives, and Approach Goals: Enable exascale performance capability for current and future DOE applications Develop and deliver a practical computing system software X‐stack, “OpenX”, for future practical DOE computing systems Provide programming methods, environments, languages, and...

Resources

Handouts

Goal: The goal of the XPRESS project is to conduct the research and development of OpenX, a complete system software architecture for Exascale...
The Traleika Glacier project will research and mature software technologies addressing major Exascale challenges, and get ready to intercept by the...
Modern scientific applications are built out of numerous libraries, spanning a variety of domains. Finding the most efficient way to use and compose...
Exascale architectures and applications will be far more complex than today's systems. To harness their power, applications, libraries, and system...
The energy, parallelism, and error rate projections (≈ 2 × 109 FITS/billion hours or 30 minutes MTTI) for exascale systems represent extraordinary...

Presentations

XARC
Abstract Representations for the Extreme-Scale Stack
ARGO
Hobbes
X-Stack and OS/R Overview
DTEC-x-stack-PI-meeting
TRALEIKA Overview
XPRESS
Progress and Challenges
SLEEC
GVR
GVR
X-TUNE
Vancouver: Designing a Next-Generation Software Infrastructure for Productive Heterogeneous Exascale Computing
Runtime System Report
Reprot of the 2014
SLEEC: Semantics-rich Libraries for Effective Exascale Computation
DEGAS: Dynamic Exascale Global Address Space
CORVETTE: Program Correctness, Verification, and Testing for Exascale
CORVETTE: Program Correctness, Verification, and Testing for Exascale
ARGO, HOBBES, and X-ARCC, Marc Snir, ANL
Why DSLs are Important to the DoE Exascale Mission, Saman Amarasinghe, MIT
Why DSLs Are Desirable, Anshu
Apps on OCR, Roger Golliver, UIUC
XPRESS Project Update, Ron Brightwell, Sandia
Apps on Charm++, Sanjay Kale, UIUC
CORVETTE: Program Correctness, Verification, and Testing for Exascale, Koushik Sen, UC Berkeley
SST, Arun Rodrigues, Sandia
CoDEx: CoDesign for Exascale, John Shalf, LBNL

Quad Charts

XPRESS Quad Chart Oct 2013
Traleika Quad Chart Oct 2013
SLEEC Quad Chart Oct 2013
PIPER Quad Chart Oct 2013
GVR Quad Chart Oct 2013

Solicitations

The Office of Advanced Scientific Computing Research (ASCR) of the Office of Science (SC), U.S. Department of Energy (DOE), hereby announces its...
The Office of Advanced Scientific Computing Research (ASCR) of the Office of Science (SC), U.S. Department of Energy (DOE), hereby invites...