CS7700
Data
Intensive Distributed Computing
Fall 2006 - Projects
1.
Staging vs. Remote I/O: One of the fundamental decisions that
the data intensive applications need to make is which data placement technique
to use in order to access remotely available data: staging or remote I/O? The
proper use of these approaches heavily affects the overall performance of the
data intensive distributed applications. Although there has been some intuitive
ways of determining when to prefer which one; there has not been a
comprehensive study performed giving a model (whether application specific or
generic) for it yet. In this project, you will take the first step in
formulizing it and providing the foundations of a system for the intelligent
selection among these approaches.
2.
Distributed Supercomputing: Distributed
supercomputing applications use grids to aggregate substantial computational
resources in order to tackle problems that cannot be solved on a single system
[Foster98]. Choose a closely coupled real-life application (eg.
coastal modeling, protein folding, CFD etc) which requires vast amount of
computational power and high rate of inter-process communication. Instead of
running the application on a single cluster, make use of multiple clusters,
which are connected to each other via very high speed (possibly optical)
networks, to run this application. Study the performance improvement due to the
increased number of processors vs. the communication overhead introduced by the
network connection between different clusters.
3.
Distributed Visualization: Use of virtual distributed RAM disks for
high speed visualization is a state-of-the-art technique and preferred whenever
possible, since random disk access time is at least an order of magnitude
slower than RAM access time, even the disk is local and the RAM is at a remote
location. Choose an application which requires rendering and visualization of
very large data sets. Instead of reading the data from the local disk and
visualizing it, retrieve the data via a high speed (possibly optical) network
directly from the memory of a remote cluster to the memory of the visualization
machine.
4.
Comparison of Grid File Systems: There is an increasing
number of distributed parallel file systems made available for the use of Grid
community (eg. GPFS, PVFS, Lustre,
OceanStore, GFarm). Some of
these systems are commercial and some are open source. Study all of these Grid
file systems in detail and compare them in terms of performance, reliability,
security, and the functionality they provide experimentally and/or analytically.
(This
may result in a Taxonomy paper.)
5.
Study of Caching & Pre-fetching Techniques: The Data Grid
infrastructure can consist of three layers of storage distributed at multiple
sites: (1) primary very high speed RAM storage for data visualization; (2)
secondary disk storage for data analysis and processing; and (3) tertiary tape
storage for data archival and long term studies. The data stored at the slow tape
storage need to be pre-fetched and cached at the faster Disk or RAM storage to
allow optimal access to data depending on the needs of the applications. Study
different caching and pre-fetching techniques (from tape to disk,
and from disk to RAM) developed especially for distributed computing environments
and compare them. The comparisons can be performed experimentally and/or
analytically. (This may result in a
Taxonomy paper.)
6.
Data-aware Scheduling:
The insufficiency of the traditional systems and existing cpu-oriented
schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers.
One of the first examples of such schedulers is the Stork data placement scheduler
[Kosar04]. You will study the limitations of the traditional schedulers in
handling the challenging data scheduling problem of large scale distributed
applications; design and implement a data-aware scheduler (or enhance an
existing one: eg. Stork) to overcome some of these
limitations.
7.
Grid-enabling a Sequential application: Work together with
field scientists from different application areas such as coastal modeling,
bioinformatics, physics, chemistry etc. (these maybe the Professors you are
already working with) to develop a Grid solution for an application normally
developed and/or run on a sequential single CPU environment. Study the computational
and data requirements for this application; develop a distributed
application architecture, end-to-end workflow for distributed processing and
analysis, and performance models & scaling characteristics for this
application.
8.
Workflow Optimization: Take the workflow of an already existing
sequential/parallel/distributed application from a preferred field of science
area. Study the limitations of this workflow and define the requirements for
further improving/optimizing this workflow. Develop new techniques (or use
existing state-of-the-art techniques) to further improve/optimize this workflow
in terms of reliability and/or efficiency.