Designing Distributed Systems for Heterogeneity-thesis
Modern distributed and networked systems are highly heterogeneous in many dimensions, including available bandwidth, processor speed, disk capacity, security, failure rate, and pattern of failures. The theme of this dissertation is that this heterogeneity can not only be handled, but rather should generally be viewed as an asset. We begin by introducing a framework, the price of heterogeneity, to model the effect of heterogeneity in parallel and distributed systems. Our results in this framework show broad classes of systems in which heterogeneity cannot be a disadvantage. We then develop practical methods for distributed systems to adapt to and take advantage of heterogeneity. The Y0 distributed hash table achieves improved load balance, route length, and congestion with low overhead in environments with heterogeneous node capacities, such as bandwidth or processing speed. Addressing heterogeneity in reliability, we show that randomization in node selection strategies typically reduces failure rates—a property that permits better understanding of subtle properties of existing systems, as well as the design of new systems. Finally, we study how to improve stability in the Internet’s interdomain routing protocol, while carefully managing tradeoffs with network operators’ perferred routes. These results show how both performance and reliability can be improved in heterogeneous environments.
Distributed and networked systems have become highly heterogeneous. Rather than running on clusters or supercomputers composed of identical nodes, today’s distributed systems have wide variation in participants’ failure rates, bandwidth, processing speed, security, and other dimensions. This heterogeneity can result from many factors, ultimately driven by explosive growth of the Internet. Modern Internet applications have giant scale, so that even within a single data center there are various generations of equipment. They may involve specialized nodes, from well-provisioned servers down to nodes whose specialty is ﬁtting in pockets. Participating nodes are often owned and administered by many different entities. And they are deployed globally on top of the largest distributed system of them all, the Internet, which exempliﬁes all of these characteristics. One can get a quantitative sense of this heterogeneity through measurements of particular systems (Figure 1.1). In the following systems, we will compare the 95th and 5th percentiles of distributions to remove outliers. Emulab, a network testbed housed in a data center at the University of Utah, consists of nodes which differ by 8× in total CPU speed (summing the cores) . BitTorrent, the massively popular peer-to-peer ﬁle distribution system, has peers whose uplink bandwidths differ by 154×. The voiceover-IP application Skype has a network of superpeers with widely varying reliability: the lengths of their sessions—that is, periods of continuous uptime—differ by . And the session lengths of routes on the Internet differ by about 1, 150, 000×, based on one month of data from Route Views some sessions lasted the entire period, others were separated by only the one-second measurement granularity, and the entire spectrum in between was well represented.