Source: TeraGrid
The National Science Foundation Office of Cyberinfrastructure's Blue Waters and TeraGrid projects co-sponsored the Fault Tolerance for Extreme Scalability Workshop held recently in Albuquerque, New Mexico. Fifty national experts met to discuss issues relating to the fault-tolerance on today's and tomorrow's petascale and exascale computing systems - supercomputers capable of performing quadrillions and quintillions of calculations every second. This is comparable to the computational power of one million personal computers.
As more and more powerful systems are being built from an increasing number of components, a small rate of failure for an individual component can stop a long-running application that uses tens of thousands of processors in a supercomputing system, making it necessary for the application to be re-started. The Blue Waters system, for example, will use more than 200,000 processors to achieve sustained performance of 1 petaflop (1 quadrillion calculations per second) for a range of science and engineering applications. While the fault rate for individual processors and other components might be small, even a small rate will generate multiple faults across such a large number of components, and strategies must be developed to ensure that the system and the applications it runs can tolerate these faults.
Over two days, application scientists, high-performance computing (HPC) center staff, systems analysts, middleware specialists, and fault tolerance experts met and explored past practices and common pitfalls. Attendees included researchers from universities and government centers sponsored by the National Science Foundation, the Department of Energy, the Department of Defense, and representatives from major petascale hardware and software vendors. Eighteen presentations fostered engaging discussions about challenges, successes, and opportunities for addressing fault tolerance. Speakers shared ways to push the limits of capability-class computational and storage systems. Fault tolerance and resiliency, observed fault types and rates on today's largest computer systems, storage systems, middleware designed for fault tolerance (including software to help applications and systems checkpoint and restart applications), and applications were discussed. Several systems specialists shared their experiences with faults and resiliency, methodologies for acceptance testing, and performance metrics that recognize inevitable events such as chassis failure, boot failure, silent corruption, and more. The sponsors sought to make shared experience help the high-performance computing community, including those who manage and use HPC resources, work toward more generally accepted standards for fault tolerance.
"It is invaluable for the systems specialists, middleware designers, and applications scientists to share their experiences and to talk about their expectations for other parts of the HPC ecosystem. This is the only way we will know what works, what doesn't work, and what we still need to do," said Daniel S. Katz, TeraGrid Grid Infrastructure Group (GIG) Director of Science and lead organizer of the workshop. "Although the issues vary from platform to platform, there are many common experiences, tools, and techniques that, when shared, can lead to the development of best practices"" he added.
A report of the proceedings will be available on Blue Waters and TeraGrid web sites. For more information, please visit www.ncsa.uiuc.edu/BlueWaters/ or www.teragrid.org.
About Blue Waters and TeraGrid:
Blue Waters is a National Science Foundation-funded project to deliver high-capability computing power to the nation's scientists and engineers, enabling them to achieve breakthrough results. Blue Waters will provide sustained performance of 1 petaflop (1 quadrillion calculations per second) on a range of science and engineering applications when it comes online in 2011. The project is a joint effort of the National Center for Supercomputing Applications, the University of Illinois at Urbana-Champaign, IBM, and the Great Lakes Consortium for Petascale Computation.
The TeraGrid, sponsored by the National Science Foundation Office of Cyberinfrastructure, is a partnership of people, resources and services that enables discovery in U.S. science and engineering. Through coordinated policy, grid software, and high-performance network connections, the TeraGrid integrates a distributed set of high-capability computational, data-management and visualization resources to make research more productive. With Science Gateway collaborations and education programs, the TeraGrid also connects and broadens scientific communities.
