SOURCE: International Science Grid This Week
As more powerful systems encompass ever-increasing numbers of components, even a small fault rate on individual processors will generate multiple faults across the components, stopping long-running applications in their tracks.
At a recent workshop, U.S. experts met to discuss issues relating to the fault-tolerance of today’s and tomorrow’s petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults.
“It is invaluable for the systems specialists, middleware designers, and applications scientists to share their experiences and to talk about their expectations for other parts of the HPC ecosystem. This is the only way we will know what works, what doesn’t work, and what we still need to do,” said Daniel S. Katz, TeraGrid Grid Infrastructure Group Director of Science and lead organizer of the workshop.
While sharing her experiences with Kraken, TeraGrid's largest supercomputer, Patricia Kovatch of the National Institute for Computational Sciences noted that while application size and system complexity are growing geometrically (as the systems get bigger, they grow faster), the rate of improvement in mitigation techniques remains constant.
“To stave off this Malthusian Catastrophe,” she said, “we are leveraging economies of scale with shared infrastructure, better machine design and checkpointing.”
Don Lamb, a University of Chicago professor and Director of the ASC/Alliance Flash Center, presented experiences from three production runs of simulation software, called FLASH, used by scientists in fields such as cosmology and plasma physics.
“FLASH handles astronomically large ranges of values of physical quantities, and operates at the upper level of available memory,” said Lamb. “Consequently, it has walked into almost every hardware or software limitation in the high end systems.”
A checkpoint/rollback capability is in place, he said. “But it is controlled by the application, which has no way of detecting imminent component failures. If a failure happens just before checkpointing, rollback can be expensive." He suggested a solution, called Fault Tolerance Backplane, that could keep the application informed about the state of the machine and use this knowledge to write a checkpoint before an imminent failure, thereby avoiding the expensive recovery scenario.
Several tool and application developers and other systems specialists shared their experiences regarding faults and resiliency, methodologies for acceptance testing, and performance metrics that recognize inevitable events such as chassis failure, boot failure, silent corruption, and more.
John Daly of the Research Directorate at the National Security Agency currently leads an effort on resilience for the Advanced Computing Systems research program. He advocates a focus shift from fault-tolerance in systems to resilience in applications.
Daly outlined three problems he sees in fault-tolerance approaches. First, as the number and density of components increases, so do the system faults, and recovery-based fault-tolerance is approaching a theoretical limit. Second, redundancy-based schemes increase the share of resources dedicated to fault recovery. Third, silent failure modes (intolerable for many application users) reduce monitoring effectiveness, and hence both application progress and certainty of correctness.
Resilience, on the other hand, an application-centric paradigm, aims to protect applications from data corruption and Byzantine faults, Daly said. It aims to do so in a timely and efficient manner (considering tradeoffs in power, productivity and performance) and in the presence of hardware or software degradations and failures.
"Fault tolerance uses redundancy and replication to recover from failure,” he said. “Resilience offers a more integrated approach in which the system works with applications to keep them running in spite of component failure."
—Elizabeth Leake, TeraGrid, and Anne Heavey, iSGTW
