2. Clustering with KMedoids, CLARA and Common-nearest-neighbors

2.1. K-Medoids

KMedoids is related to the KMeans algorithm. While KMeans tries to minimize the within cluster sum-of-squares, KMedoids tries to minimize the sum of distances between each point and the medoid of its cluster. The medoid is a data point (unlike the centroid) which has the least total distance to the other members of its cluster. The use of a data point to represent each cluster’s center allows the use of any distance metric for clustering. It may also be a practical advantage, for instance K-Medoids algorithms have been used for facial recognition for which the medoid is a typical photo of the person to recognize while K-Means would have obtained a blurry image that mixed several pictures of the person to recognize.

KMedoids can be more robust to noise and outliers than KMeans as it will choose one of the cluster members as the medoid while KMeans will move the center of the cluster towards the outlier which might in turn move other points away from the cluster centre.

KMedoids is also different from K-Medians, which is analogous to KMeans except that the Manhattan Median is used for each cluster center instead of the centroid. K-Medians is robust to outliers, but it is limited to the Manhattan Distance metric and, similar to KMeans, it does not guarantee that the center of each cluster will be a member of the original dataset.

The complexity of K-Medoids is \(O(N^2 K T)\) where \(N\) is the number of samples, \(T\) is the number of iterations and \(K\) is the number of clusters. This makes it more suitable for smaller datasets in comparison to KMeans which is \(O(N K T)\).

Examples:

  • sphx_glr_auto_examples_plot_kmedoids_digits.py: Applying K-Medoids on digits with various distance metrics.

Algorithm description: There are several algorithms to compute K-Medoids, though KMedoids currently only supports K-Medoids solver analogous to K-Means called alternate and the algorithm PAM (partitioning around medoids). Alternate algorithm is used when speed is an issue.

  • Alternate method works as follows:

    • Initialize: Select n_clusters from the dataset as the medoids using a heuristic, random, or k-medoids++ approach (configurable using the init parameter).

    • Assignment step: assign each element from the dataset to the closest medoid.

    • Update step: Identify the new medoid of each cluster.

    • Repeat the assignment and update step while the medoids keep changing or maximum number of iterations max_iter is reached.

  • PAM method works as follows:

    • Initialize: Greedy initialization of n_clusters. First select the point in the dataset that minimizes the sum of distances to a point. Then, add one point that minimizes the cost and loop until n_clusters points are selected. This is the init parameter called build.

    • Swap Step: for all medoids already selected, compute the cost of swapping this medoid with any non-medoid point. Then, make the swap that decreases the cost the most. Loop and stop when there is no change anymore.

References:

  • Maranzana, F.E., 1963. On the location of supply points to minimize transportation costs. IBM Systems Journal, 2(2), pp.129-135.

  • Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids clustering. Expert systems with applications, 36(2), pp.3336-3341.

  • Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM). In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2

  • Bhat, Aruna (2014).K-medoids clustering using partitioning around medoids for performing face recognition. International Journal of Soft Computing, Mathematics and Control, 3(3), pp 1-12.

2.2. CLARA

CLARA is related to the KMedoids algorithm. CLARA (Clustering for Large Applications) extends k-medoids to be more scalable, uses a sampling approach.

Algorithm description: CLARA uses random samples of the dataset, each of size sampling_size The algorith is iterative, first we select one sub-sample, then CLARA applies KMedoids on this sub-sample to obtain n_clusters medoids. At the next step, CLARA sample sampling_size-n_clusters from the dataset and the next sub-sample is composed of the best medoids found until now (with respect to inertia in the whole dataset, not the inertia only on the sub-sample) to which we add the new samples just drawn. Then, K-Medoids is applied to this new sub-sample, and loop back until sample sub-samples have been used.

2.3. Common-nearest-neighbors clustering

CommonNNClustering provides an interface to density-based common-nearest-neighbors clustering. Density-based clustering identifies clusters as dense regions of high point density, separated by sparse regions of lower density. Common-nearest-neighbors clustering approximates local density as the number of shared (common) neighbors between two points with respect to a neighbor search radius. A density threshold (density criterion) is used – defined by the cluster parameters min_samples (number of common neighbors) and eps (search radius) – to distinguish high from low density. A high value of min_samples and a low value of eps corresponds to high density.

As such the method is related to other density-based cluster algorithms like DBSCAN or Jarvis-Patrick. DBSCAN approximates local density as the number of points in the neighborhood of a single point. The Jarvis-Patrick algorithm uses the number of common neighbors shared by two points among the \(k\) nearest neighbors. As these approaches each provide a different notion of how density is estimated from point samples, they can be used complementarily. Their relative suitability for a classification problem depends on the nature of the clustered data. Common-nearest-neighbors clustering (as density-based clustering in general) has the following advantages over other clustering techniques:

  • The cluster result is deterministic. The same set of cluster parameters always leads to the same classification for a data set. A different ordering of the data set leads to a different ordering of the cluster assignment, but does not change the assignment qualitatively.

  • Little prior knowledge about the data is required, e.g. the number of resulting clusters does not need to be known beforehand (although cluster parameters need to be tuned to obtain a desired result).

  • Identified clusters are not restricted in their shape or size.

  • Points can be considered noise (outliers) if they do not fullfil the density criterion.

The common-nearest-neighbors algorithm tests the density criterion for pairs of neighbors (do they have at least min_samples points in the intersection of their neighborhoods at a radius eps). Two points that fullfil this criterion are directly part of the same dense data region, i.e. they are density reachable. A density connected network of density reachable points (a connected component if density reachability is viewed as a graph structure) constitutes a separated dense region and therefore a cluster. Note, that for example in contrast to DBSCAN there is no differentiation in core (dense points) and edge points (points that are not dense themselves but neighbors of dense points). The assignment of points on the cluster rims to a cluster is possible, but can be ambiguous. The cluster result is returned as a 1D container of labels, i.e. a sequence of integers (zero-based) of length \(n\) for a data set of \(n\) points, denoting the assignment of points to a specific cluster. Noise is labeled with -1. Valid clusters have at least two members. The clusters are not sorted by cluster member count. In same cases the algorithm tends to identify small clusters that can be filtered out manually.

Examples:

Implementation:

The present implementation of the common-nearest-neighbors algorithm in CommonNNClustering shares some commonalities with the current scikit-learn implementation of DBSCAN. It computes neighborhoods from points in bulk with NearestNeighbors before the actual clustering. Consequently, to store the neighborhoods it requires memory on the order of \(O(n ⋅ n_n)\) for \(n\) points in the data set where \(n_n\) is the average number of neighbors (which is proportional to eps), that is at worst \(O(n^2)\). Depending on the input structure (dense or sparse points or similarity matrix) the additional memory demand varies. The clustering itself follows a breadth-first-search scheme, checking the density criterion at every node expansion. The linear time complexity is roughly proportional to the number of data points \(n\), the total number of neighbors \(N\) and the value of min_samples. For density-based clustering schemes with lower memory demand, also consider:

  • OPTICS – Density-based clustering related to DBSCAN using a eps value range.

  • cnnclustering – A different implementation of common-nearest-neighbors clustering.

Notes:

  • DBSCAN provides an option to specify data point weights with sample_weights. This feature is experimentally at the moment for CommonNNClustering as weights are not well defined for checking the common-nearest-neighbor density criterion. It should not be used in production, yet.

References:

  • B. Keller, X. Daura, W. F. van Gunsteren “Comparing Geometric and Kinetic Cluster Algorithms for Molecular Simulation Data” J. Chem. Phys., 2010, 132, 074110.

  • O. Lemke, B.G. Keller “Density-based Cluster Algorithms for the Identification of Core Sets” J. Chem. Phys., 2016, 145, 164104.

  • O. Lemke, B.G. Keller “Common nearest neighbor clustering - a benchmark” Algorithms, 2018, 11, 19.