Lazy Shadowing

The inherent instability of extreme-scale computing systems puts into question the viability of traditional fault-tolerance methods and calls for a reconsideration of the fault-tolerance and power-awareness problem, at scale. This project explores a novel paradigm, Lazy Shadowing, to achieve high resiliency in future extreme-scale systems. The basic tenet of Lazy Shadowing is the association of a shadow with each process which runs in parallel, but on a different node and at a reduced execution rate, than its associated process. The successful completion of the main process causes the immediate termination of the shadow, resulting in significant energy savings.  Upon failure of the main process, however, the shadow’s execution rate is increased so that it completes by the expected completion time without significantly impacting the progress of the remaining tasks. Different approaches can be used to control execution rate, including co-locating multiple shadows on a single computing node and/or using Dynamic Voltage and Frequency Scaling.