Online clustering of parallel data streams
In recent years, the management and processing of so-called data streams has become a topic of active research in several ﬁelds of computer science such as, e.g., distributed systems, database systems, and data mining. A data stream can roughly be thought of as a transient, continuously increasing sequence of timestamped data. In this paper, we consider the problem of clustering parallel streams of real-valued data, that is to say, continuously evolving time series. In other words, we are interested in grouping data streams the evolution over time of which is similar in a speciﬁc sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming data in an online manner, tolerating not more than a constant time delay. For this purpose, we develop an eﬃcient online version of the classical K-means clustering algorithm. Our methods effciency is mainly due to a scalable online transformation of the original data which allows for a fast computation of approximate distances between streams.
Email Data Cleaning-free research paper on computer science
Cluster Analysis for Data Mining
FREE IEEE PAPER