Data Mining Using High Performance Data Clouds-Experimental Studies Using Sector and Sphere
We describe the design and implementation of a high per- formance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an in- frastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters con- nected with wide area high performance networks (for ex- ample, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.
Historically, high performance data mining systems have been designed to take advantage of powerful, but shared pools of processors. Generally, data is scattered to the pro- cessors, the computation is performed using a message pass- ing or grid services library, the results are gathered, and the process is repeated by moving new data to the processors. This paper describes a distributed high performance data mining system that we have developed called Sector/Sphere that is based on an entirely different paradigm. Sector is designed to provide long term persistent storage to large datasets that are managed as distributed indexed les. Dif- ferent segments of the file are scattered throughout the dis- tributed storage managed by Sector. Sector generally repli- cates the data to ensure its longevity, to decrease the latency when retrieving it, and to provide opportunities for paral- lelism. Sector is designed to take advantage of wide area high performance networks when available. Sphere is designed to execute user de ned functions in par- allel using a stream processing pattern for data managed by Sector. We mean by this that the same user de ned function is applied to every data record in a data set man- aged by Sector. This is done to each segment of the data set independently (assuming that sucient processors are available), providing a natural parallelism. The design of Sector/Sphere results in data frequently being processed in place without moving it. To summarize, Sector manages data using distributed, in- dexed files; Sphere processes data with user-de ned func- tions that operate in a uniform manner on streams of data managed by Sector; Sector/Sphere scale to wide area high performance networks using specialized network protocols designed for this purpose. In this paper, we describe the design of Sector/Sphere. We also describe a data mining application developed using Sec- tor/Sphere that searches for emergent behavior in distributed network data. We also describe various experimental studies that we have done using Sector/Sphere. Finally, we describe several experimental studies comparing Sector/Sphere to Hadoop using the Terasort Benchmark , as well as a companion benchmark we have developed called Terasplit that com- putes a split for a regression tree. This paper is organized as follows: Section 2 describes back- ground and related work. Section 3 describes the design of Sphere. Section 4 describes the design of Sector. Section 5 describes the design of the networking and routing layer. Section 6 contains some experimental studies. Section 7 de- scribes a Sector/Sphere application that we have developed. Section 8 is the summary and conclusion