Skip to main content
All CollectionsFAQModeling FAQClustering FAQ
Which algorithm should I pick for my clustering model?
Which algorithm should I pick for my clustering model?

This article discusses which algorithm you should pick when building a clustering model in the G2M platform.

Updated over 6 months ago

If you're not sure which algorithm to use for your clustering model you should start with a simple PCA-KMeans and try the following:

  • PCA K-Means (simple method). When in doubt use this general purpose algorithm. Its incremental implementation will scale reasonably well. The PCA step will generally identify natural groupings in the data driven by the same external process. Unlike the MLE method, you get to pick the number of clusters you need with this method.

  • PCA K-Means (MLE method). This algorithm is similar to the PCA K-Means simple method, but will automatically pick the number of clusters based on the number of groupings (components) identified by the PCA step. Models with many variables will return a potentially large number of clusters. This methodology does not scale as well for larger datasets.

  • K-Means. The traditional and perhaps best known clustering algorithm. It will scale well but usually does not identify more nuanced groupings in the data. Use it only for straightforward clustering cases when you expect clearly separated, convex clusters.

  • BIRCH. The BIRCH (balanced iterative reducing and clustering using hierarchies) algorithm is designed to handle larger datasets and as well as noisy data, outliers, etc. In practice it tends to be fairly sensitive. You may want to try it and fall back on the PCA K-Means simple method if this fails to produce meaningful clusters.

  • DBSCAN. The DBSCAN (density-based spatial clustering of applications with noise) algorithm was designed with the same objectives as BIRCH, i.e. handle large datasets with noise and outliers. It handles non-convex / uneven clusters more effectively than most. It will also automatically determine the number of clusters. In practice it can be fairly sensitive, esp. with unevenly distributed datasets.

  • Gaussian Mixture. The Gaussian Mixture algorithm assumes the dataset to be the superposition of independent, convex clusters. It works reasonably fast on smaller datasets but does not scale well. Use it as validation of the clusters you might have generated with another algorithm.

  • Agglomerative Hierarchical. The Agglomerative Hierarchical Clustering algorithm is designed to work well on large datasets with large numbers of clusters. In practice this hierarchical approach may be impacted by the order of the data, and does not always scale well. Use it as validation of the clusters you might have generated with another algorithm such as PCA K-Means.

  • Mean Shift. The Mean Shift clustering algorithm is designed to work on smaller datasets with complex, non-convex clusters with outliers. It will automatically determine the number of clusters. However, it does not scale well with larger datasets.

  • OPTICS. The OPTICS (ordering points to identify the clustering structure) algorithm is similar to BIRCH and DBSCAN but is specifically designed to handle unevenly distributed datasets. This algorithm in practice can be as sensitive as its two predecessors. Try it first then fall back on the PCA K-Means simple method if needed.

  • Spectral Clustering. The Spectral Clustering method is designed to handle a wide variety of cluster shapes but does not scale well. Use it to validate clustering results on small datasets. Fall back on the PCA K-Means simple method if needed.


Did this answer your question?