On the Comparability of Software Clustering Algorithms

Evaluation of software clustering algorithms is typically done by comparing the clustering results to an authoritative decomposition prepared manually by a system expert. A well-known drawback of this approach is the fact that there are many, equally valid ways to decompose a software system, since different clustering objectives create different decompositions. Evaluating all clustering algorithms against a single authoritative decomposition can lead to biased results. In this paper, we introduce and quantify the notion of clustering algorithm comparability. It is based on the concept that algorithms with different objectives should not be directly compared. Not surprisingly, we find that several of the published algorithms in the literature are not comparable to each other.

Perhaps the most well-known tenet of clustering research is that there is no single correct answer to a clustering problem [6]. There are many, equally valid ways to decompose a data set into clusters. Similarly, in the context of software clustering, there are many, equally valid ways to decompose a software system into meaningful subsystems. Different software clustering algorithms employ different clustering objectives producing different results. All such clustering results could be useful to someone attempting to understand the software system at hand. It is, therefore, rather strange that the typical way to evaluate a software clustering algorithm is to compare its output to a single authoritative decomposition, i.e. a decomposition created manually by a system expert. If the authoritative decomposition was constructed with a different point of view in mind than the algorithm, the evaluation results will probably be biased against the algorithm. However, no better way to evaluate software clustering algorithms exists at the moment. In this paper, we introduce and quantify the notion of software clustering algorithm comparability. Comparable algorithms use the same or similar clustering objectives, produce similar results, and can therefore be evaluated against the same authoritative decomposition. Noncomparable algorithms use different clustering objectives, produce significantly different results, and cannot be evaluated by direct comparison to an authoritative decomposition. Being able to distinguish between comparable and noncomparable algorithms allows us to define families of algorithms that share common clustering objectives. Conventional evaluation techniques apply only within the context of a given algorithm family. As expected, our experimental results clearly show that several existing software clustering algorithms belong in different families, and therefore should not be directly compared against each other. We also utilized the concept of algorithm families in order to answer the question of whether conventional evaluation measures, such as MoJoFM [15] and KE [8], behave in a congruent fashion by limiting our investigation only to comparable algorithms. The structure of the rest of the paper is as follows. The notion of comparability as well as the way we quantify it is presented in Section 2. Experiments that verify whether several published algorithms are comparable, are presented in Section 3. The comparison of MoJoFM and KE in the context of comparability is shown in Section 4. Finally, Section 5 concludes the paper

Free download research paper