High Performance Parallel Computing with Cloud and Cloud Technologies
We present our experiences in applying, developing, and evaluating cloud and cloud technologies. First, we present our experience in applying Hadoop and DryadLINQ to a series of data/compute intensive applications and then compare them with a novel MapReduce runtime developed by us, named CGL-MapReduce, and MPI. Preliminary applications are developed for particle physics, bioinformatics, clustering, and matrix multiplication. We identify the basic execution units of the MapReduce programming model and categorize the runtimes according to their characteristics. MPI versions of the applications are used where the contrast in performance needs to be highlighted. We discuss the application structure and their mapping to parallel architectures of different types, and look at the performance of these applications. Next, we present a performance analysis of MPI parallel applications on virtualized resources.
Cloud and cloud technologies are two broad categories of technologies related to the general notion of Cloud Computing. By “Cloud,” we refer to a collection of infrastructure services such as Infrastructure-as-a-service (IaaS), Platform-as-a-Service (PaaS), etc., provided by various organizations where virtualization plays a key role. By “Cloud Technologies,” we refer to 2 various cloud runtimes such as Hadoop(ASF, core, 2009), Dryad (Isard, Budiu et al. 2007), and other MapReduce(Dean and Ghemawat 2008) frameworks, and also the storage and communication frameworks such as Hadoop Distributed File System (HDFS), Amazon S3, etc. The introduction of commercial cloud infrastructure services such as Amazon EC2, GoGrid(ServePath 2009), and ElasticHosts(ElasticHosts 2009) has allowed users to provision compute clusters fairly easily and quickly, by paying a monetary value for the duration of their usages of the resources. The provisioning of resources happens in minutes, as opposed to the hours and days required in the case of traditional queue-based job scheduling systems. In addition, the use of such virtualized resources allows the user to completely customize the Virtual Machine (VM) images and use them with root/administrative privileges, another feature that is hard to achieve with traditional infrastructures. The availability of open source cloud infrastructure software such as Nimbus(Keahey, Foster et al. 2005) and Eucalyptus(Nurmi, Wolski et al. 2009), and the open source virtualization software stacks such as Xen Hypervisor, allows organizations to build private clouds to improve the resource utilization of the available computation facilities. The possibility of dynamically provisioning additional resources by leasing from commercial cloud infrastructures makes the use of private clouds more promising. Among the many applications which benefit from cloud and cloud technologies, the data/compute intensive applications are the most important. The deluge of data and the highly compute intensive applications found in many domains such as particle physics, biology, chemistry, finance, and information retrieval, mandate the use of large computing infrastructures and parallel processing to achieve considerable performance gains in analyzing data. The 3 addition of cloud technologies creates new trends in performing parallel computing. An employee in a publishing company who needs to convert a document collection, terabytes in size, to a different format can do so by implementing a MapReduce computation using Hadoop, and running it on leased resources from Amazon EC2 in just few hours. A scientist who needs to process a collection of gene sequences using CAP3(Huang and Madan 1999) software can use virtualized resources leased from the university’s private cloud infrastructure and Hadoop. In these use cases, the amount of coding that the publishing agent and the scientist need to perform is minimal (as each user simply needs to implement a map function), and the MapReduce infrastructure handles many aspects of the parallelism. Although the above examples are successful use cases for applying cloud and cloud technologies for parallel applications, through our research, we have found that there are limitations in using current cloud technologies for parallel applications that require complex communication patterns or require faster communication mechanisms. For example, Hadoop and Dryad implementations of Kmeans clustering applications which perform an iteratively refining clustering operation, show higher overheads compared to implementations of MPI or CGLMapReduce a streaming-based MapReduce runtime developed by us. These observations raise questions: What applications are best handled by cloud technologies? What overheads do they introduce? Are there any alternative approaches? Can we use traditional parallel runtimes such as MPI in cloud? If so, what overheads does it have? These are some of the questions we try to answer through our research. In section 1, we give a brief introduction of the cloud technologies, and in section 2, we discuss with examples the basic functionality supported by these cloud runtimes. Section 3 discusses how these technologies map into programming models. We describe the applications 4 used to evaluate and test technologies in section 4. The performance results are in section 5. In section 6, we present details of an analysis we have performed to understand the performance implications of virtualized resources for parallel MPI applications. Note that we use MPI running on non virtual machines in section 5 for comparison with cloud technologies. We present our conclusions in section 7.
2. Cloud Technologies
The cloud technologies such as MapReduce and Dryad have created new trends in parallel
programming. The support for handling large data sets, the concept of moving computation to
data, and the better quality of services provided by the cloud technologies make them favorable
choice of technologies to solve large scale data/compute intensive problems.
The granularity of the parallel tasks in these programming models lies in between the fine
grained parallel tasks that are used in message passing infrastructures such as PVM(Dongarra,
Geist et al. 1993) and MPI(Forum n.d.) ; and the coarse grained jobs in workflow frameworks
such as Kepler(Ludscher, Altintas et al. 2006) and Taverna(Hull, Wolstencroft et al. 2006), in
which the individual tasks could themselves be parallel applications written in MPI. Unlike the
various communication constructs available in MPI which can be used to create a wide variety of
communication topologies for parallel programs, in MapReduce, the “map->reduce” is the only
communication construct available. However, our experience shows that most composable
applications can easily be implemented using the MapReduce programming model. Dryad
supports parallel applications that resemble Directed Acyclic Graphs (DAGs) in which the
vertices represent computation units, and the edges represent communication channels between
different computation units.
In traditional approaches, once parallel applications are developed, they are executed on 5 compute clusters, super computers, or Grid infrastructures (Foster 2001), where the focus on allocating resources is heavily biased by the availability of computational power. The application and the data both need to be moved to the available computational resource in order for them to be executed. These infrastructures are highly efficient in performing compute intensive parallel applications. However, when the volumes of data accessed by an application increases, the overall efficiency decreases due to the inevitable data movement. Cloud technologies such as Google MapReduce, Google File System (GFS) (Ghemawat, Gobioff et al. 2003), Hadoop and Hadoop Distributed File System (HDFS), Microsoft Dryad, and CGL-MapReduce adopt a more data-centered approach to parallel runtimes. In these frameworks, the data is staged in data/compute nodes of clusters or large-scale data centers, such as in the case of Google. The computations move to the data in order to perform the data processing. Distributed file systems such as GFS and HDFS allow Google MapReduce and Hadoop to access data via distributed storage systems built on heterogeneous compute nodes, while Dryad and CGL-MapReduce support reading data from local disks. The simplicity in the programming model enables better support for quality of services such as fault tolerance and monitoring