Text Only Login to PAWS Baton Rouge, Louisiana |
LSU Homepage
homeaboutprogramprojectscyberinfrastructurenewseventscontact

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC

Jun 16, 2006 11:00 am to 12:00 pm

Johnston Hall 338
Pavan Balaji, Argonne National Laboratory
Postdoctoral Researcher

Bio

Pavan Balaji holds a joint appointment as a post-doctoral researcher at the Argonne National Laboratory and as a fellow of the Computation Institute at the University of Chicago. He had received his Ph.D. from the Computer Science and Engineering department at the Ohio State University. His research interests include high-speed interconnects, efficient protocol stacks, parallel programming models and middleware for communication and I/O, and job scheduling and resource management. He has nearly 40 publications in these areas. Dr. Balaji has also served as a Program Chair at the Parallel Programming Models and Systems Software workshop and as a Program Committee Member and Technical Referee on numerous International conferences and journals. He has delivered multiple talks and tutorials at different research institutes and conferences. He is a member of the IEEE and ACM. More details about Dr. Balaji are available at http://www.mcs.anl.gov/~balaji.

Abstract

Many large-scale computational biology applications simultaneously rely on multiple resources for efficient execution. For example, such applications may require both large compute and storage resources; however, very few supercomputing centers can provide large quantities of both. Thus, data generated at the compute site oftentimes has to be moved to a remote storage site for either storage or visualization and analysis. Clearly, this is not an efficient model, especially when the two sites are distributed over a Grid. In this talk, I'll present a framework called "ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing'' which uses application-specific semantic information to convert the generated data to orders-of-magnitude smaller metadata at the compute site, transfer the metadata to the storage site, and re-process the metadata at the storage site to regenerate the output. Specifically, ParaMEDIC trades a small amount of additional computation (in the form of data post-processing) for a potentially significant reduction in data that needs to be transferred in distributed environments. The ParaMEDIC framework allowed us to use nine different supercomputers distributed within the U.S. to sequence-search the entire microbial genome database against itself and store the one petabyte of generated data at Tokyo, Japan.

Available Media

There Is No Media Available For Download.

LSU Homepage