CS7700

CS7700

Data Intensive Distributed Computing

Fall 2006 - Projects

1. Staging vs. Remote I/O: One of the fundamental decisions that the data intensive applications need to make is which data placement technique to use in order to access remotely available data: staging or remote I/O? The proper use of these approaches heavily affects the overall performance of the data intensive distributed applications. Although there has been some intuitive ways of determining when to prefer which one; there has not been a comprehensive study performed giving a model (whether application specific or generic) for it yet. In this project, you will take the first step in formulizing it and providing the foundations of a system for the intelligent selection among these approaches.

2. Distributed Supercomputing: Distributed supercomputing applications use grids to aggregate substantial computational resources in order to tackle problems that cannot be solved on a single system [Foster98]. Choose a closely coupled real-life application (eg. coastal modeling, protein folding, CFD etc) which requires vast amount of computational power and high rate of inter-process communication. Instead of running the application on a single cluster, make use of multiple clusters, which are connected to each other via very high speed (possibly optical) networks, to run this application. Study the performance improvement due to the increased number of processors vs. the communication overhead introduced by the network connection between different clusters.

3. Distributed Visualization: Use of virtual distributed RAM disks for high speed visualization is a state-of-the-art technique and preferred whenever possible, since random disk access time is at least an order of magnitude slower than RAM access time, even the disk is local and the RAM is at a remote location. Choose an application which requires rendering and visualization of very large data sets. Instead of reading the data from the local disk and visualizing it, retrieve the data via a high speed (possibly optical) network directly from the memory of a remote cluster to the memory of the visualization machine.

4. Comparison of Grid File Systems: There is an increasing number of distributed parallel file systems made available for the use of Grid community (eg. GPFS, PVFS, Lustre, OceanStore, GFarm). Some of these systems are commercial and some are open source. Study all of these Grid file systems in detail and compare them in terms of performance, reliability, security, and the functionality they provide experimentally and/or analytically. (This may result in a Taxonomy paper.)

5. Study of Caching & Pre-fetching Techniques: The Data Grid infrastructure can consist of three layers of storage distributed at multiple sites: (1) primary very high speed RAM storage for data visualization; (2) secondary disk storage for data analysis and processing; and (3) tertiary tape storage for data archival and long term studies. The data stored at the slow tape storage need to be pre-fetched and cached at the faster Disk or RAM storage to allow optimal access to data depending on the needs of the applications. Study different caching and pre-fetching techniques (from tape to disk, and from disk to RAM) developed especially for distributed computing environments and compare them. The comparisons can be performed experimentally and/or analytically. (This may result in a Taxonomy paper.)

6. Data-aware Scheduling: The insufficiency of the traditional systems and existing cpu-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. One of the first examples of such schedulers is the Stork data placement scheduler [Kosar04]. You will study the limitations of the traditional schedulers in handling the challenging data scheduling problem of large scale distributed applications; design and implement a data-aware scheduler (or enhance an existing one: eg. Stork) to overcome some of these limitations.

7. Grid-enabling a Sequential application: Work together with field scientists from different application areas such as coastal modeling, bioinformatics, physics, chemistry etc. (these maybe the Professors you are already working with) to develop a Grid solution for an application normally developed and/or run on a sequential single CPU environment. Study the computational and data requirements for this application; develop a distributed application architecture, end-to-end workflow for distributed processing and analysis, and performance models & scaling characteristics for this application.

8. Workflow Optimization: Take the workflow of an already existing sequential/parallel/distributed application from a preferred field of science area. Study the limitations of this workflow and define the requirements for further improving/optimizing this workflow. Develop new techniques (or use existing state-of-the-art techniques) to further improve/optimize this workflow in terms of reliability and/or efficiency.