Online Structural Graph Clustering Using Frequent Subgraph Mining
The goal of graph clustering is to partition objects in a graph database into different clusters based on various criteria such as vertex connectivity, neighborhood similarity or the size of the maximum common subgraph. This can serve to structure the graph space and to improve the understanding of the data. In this paper, we present a novel method for structural graph clustering, i.e. graph clustering without generating features or decomposing graphs into parts. In contrast to many related approaches, the method does not rely on computationally expensive maximum common subgraph (MCS) operations or variants thereof, but on frequent subgraph mining. More speciﬁcally, our problem formulation takes advantage of the frequent subgraph miner gSpan (that performs well on many practical problems) without eﬀectively generating thousands of subgraphs in the process. In the proposed clustering approach, clusters encompass all graphs that share a suﬃciently large common subgraph. The size of the common subgraph of a graph in a cluster has to take at least a user-speciﬁed fraction of its overall size. The new algorithm works in an online mode (processing one structure after the other) and produces overlapping (non-disjoint) and non-exhaustive clusters. In a series of experiments, we evaluated the eﬀectiveness and ef- ﬁciency of the structural clustering algorithm on various real world data sets of molecular graphs.
Mining graph data has attracted a lot of attention in the past ten years [1–3]. One family of methods is concerned with mining subgraph patterns in graph databases [1, 2]. The criteria for interestingness are often based on the support of a pattern in the graph database, e.g., requiring a minimum and / or maximum frequency, closedness, freeness or class-correlation. However, in all of these cases, the structural diversity of graph databases, i.e. the existence of groups of similar or dissimilar graphs, is not explicitly taken into account or revealed by the algorithm. Vice versa, the structural composition and existence of groups of similar graphs has a serious impact on the output and runtime performance