CUBA-An Architecture for Efficient CPU Data Communication

Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a generalpurpose CPU. However, applications often have to be modified in non-intuitive and complicated ways to mitigate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads cannot be amortized and co-processors are unable to provide benefit. The additional effort and complexity of incorporating co-processors makes it difficult, if not impossible, to effectively utilize co-processors in large applications. This paper presents CUBA, an architecture model where coprocessors encapsulated as function calls can efficiently access their input and output data structures through pointer parameters. The key idea is to map the data structures required by the co-processor to the co-processor local memory as opposed to the CPU’s main memory. The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory. The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors. CUBA allows the CPU to cache hosted data structures with a selective write-through cache policy, allowing the CPU to access hosted data structures while supporting efficient communication with the co-processors. Benchmark simulation results show that a CUBAbased system can approach optimal transfer rates while requiring few changes to the code that executes on the CPU. Categories and Subject Descriptors

CUBA (Champaign-Urbana/BArcelona) is an architec- ture model for coupling data-parallel co-processors to general- purpose CPUs. ing the communication latency and data marshalling over- heads incurred when moving data between CPUs and co- processors. Lowering the cost of accessing highly data-parallel co-processors, both in terms of prolonged execution time and programming e orts, CUBA allows for a wider range of ap- plications to benefit from data-parallel co-processors. Data parallelism (DP) refers to the property of an appli- cation to have a large number of independent arithmetic op- erations that can be executed concurrently on di erent parts of the data set. Data parallelism exists in many important applications that model reality, such as physics simulation, weather prediction, financial analysis, medical imaging, and media processing. We refer to data-parallel phases of an ap- plication as kernels. Contemporary high-performance CPUs employ instruction-level parallelism (ILP), Single-Instruction- Multiple-Data (SIMD), memory-level parallelism (MLP), and thread-level parallelism (TLP) techniques that all exploit data parallelism to a certain degree. However, due to cost and performance constraints imposed by sequential appli- cations, these CPUs can only dedicate small portions of their resources to the exploitation of data parallelism, which motivates the design of co-processors for exploiting massive amounts of data parallelism. We define a co-processor as a programmable set of func- tional units, possibly with its own instruction memory, that is under the control of a general-purpose processor. Some co-processors implement fine-grained computations, such as floating-point arithmetic, vector operations or SIMD instruc- tions. These fine-grained co-processors are integrated into CPUs as functional units that support new processor in- structions (e.g., SIMD instructions). Other co-processors, such the Synergistic Processing Elements (SPEs) in the Cell BE processor [5], the NVIDIA GeForce 8 Series graphics processing units, Physics Engines or the reconfigurable logic in the Cray XD1 [4], execute medium- and coarse-grained operations. The software abstraction used by many coarse- grained co-processors today is a threaded execution model such as SPU threads in the Cell SDK. In this paper we focus on coarse-grained co-processors that can potentially exploit a large amount of data paral- lelism. We argue that a programming model where coarse- grained co-processors are encapsulated as function calls is a useful and powerful model that presents two main bene- fits. First, function call encapsulation provides co-processors with a simple model similar to libraries and API (Applica- tion Programming Interface) functions, which is familiar to developers. As a result, co-processors can be easily incorpo-

Free download research paper