ALPACA: Cactus Tools for Application Level Performance and Correctness Analyais

Schnetter, Erik; Allen, Gabrielle; Goodale, Tom; Tyagi, Mayank


Although the speed and performance of high end computers have increased dramatically over the last decade, the ease of programming such parallel computers has not progressed. The time and effort required to develop and debug scientific software has become the bottleneck in many areas of science and engineering. The difficulty of developing high-performance software is recognised as one of the a most significant challenges today in the effective use of large scale computers.

Cactus is a framework for science applications which is used to simulate physical systems in many fields of science, such as black holes and neutron stars in general relativity. As in other software frameworks, applications are built from separately developed and tested components. Below we outline Alpaca, a concept and a project to develop high-level tools to allow developers and end-users to examine and validate the correctness of an application, and aid them in measuring and improving its performance in production environments. These tools are components themselves, built into the application and interacting with it. Alpaca's approach includes help to render applications tolerant against partial system failures, which is becoming a pressing need with tomorrow's architectures consisting of tens of thousands of nodes.

In contrast to existing debuggers and profilers, Alpaca's approach works at a much higher level, at the level of the physical equations and their discretisations which are implemented by the application, not at the level of individual lines of code or variables. It is not enough for only the main kernels to be correct and show good scalability -- the overall application, which may contain many smaller modules, must perform. We assume that Alpaca's integrative ansatz will lead to well-tested and highly efficient applications which are developed in a shorter time scale and execute more reliably.

Download Article: CCT-TR-2008-2.pdf