CCT-TR-2008-2
Title:
ALPACA: Cactus Tools for Application Level Performance and Correctness Analyais
Authors:
Schnetter, Erik; Allen, Gabrielle; Goodale, Tom; Tyagi, Mayank
Summary:
Although the speed and performance of high end computers have
increased dramatically over the last decade, the ease of programming
such parallel computers has not progressed. The time and effort
required to develop and debug scientific software has become the
bottleneck in many areas of science and engineering. The difficulty
of developing high-performance software is recognised as one of the a
most significant challenges today in the effective use of large scale
computers.
br>
Cactus is a framework for science applications which is used to
simulate physical systems in many fields of science, such as black
holes and neutron stars in general relativity. As in other software
frameworks, applications are built from separately developed and
tested components. Below we outline Alpaca, a concept and a project
to develop high-level tools to allow developers and end-users to
examine and validate the correctness of an application, and aid them
in measuring and improving its performance in production
environments. These tools are components themselves, built into the
application and interacting with it. Alpaca's approach includes help
to render applications tolerant against partial system failures,
which is becoming a pressing need with tomorrow's architectures
consisting of tens of thousands of nodes. br> br>
In contrast to existing debuggers and profilers, Alpaca's approach
works at a much higher level, at the level of the physical equations
and their discretisations which are implemented by the application,
not at the level of individual lines of code or variables. It is not
enough for only the main kernels to be correct and show good
scalability -- the overall application, which may contain many
smaller modules, must perform. We assume that Alpaca's integrative
ansatz will lead to well-tested and highly efficient applications
which are developed in a shorter time scale and execute more reliably. br>
Download Article:
CCT-TR-2008-2.pdf
|