skcyto.ConsensusCluster

class skcyto.ConsensusCluster(k_min: int | None = None, k_max: int = 20, n_iter: int = 100, subsample_fraction: float = 0.9, random_state: int | None = None)[source]

Consensus clustering

Finding the optimal number of clusters is a common problem. Consensus clustering is one technique, where one repeatedly subsamples the data and tries a range of number of clusters.

This implementation currently only supports hierarchical clustering as the algorithm of choice. This is chosen because it is used in the R FlowSOM implementation, and because only hierarchical clustering guarantees that the AUC of the CDF increases when adding more clusters.

Read more in Monti et al., Machine Learning 52, 91–118 (2003)

Limitations are described in Șenbabaoğlu et al., Sci Rep. 4:6207 (2014)

Parameters:
k_minint, optional

Lower bound of number of clusters to try, by default None If None, only k_max is evaluated.

k_max: int

Upper bound of number of clusters to try, by default 20

n_iter: int

Number of iterations, by default 100

subsample_fraction: float

Fraction how many instances to sample from original data, by default 0.9.

random_state: int, RandomState instance or None

Determines random number generation for subsampling, by default None.

Attributes:
X_NDArray

Input data

k_best_int

Optimal number of clusters

cluster_AgglomerativeClustering

Fitted cluster algorithm for k_best_

labels_NDArray

Labels for each instance of X with optimal number of clusters

AUC_dict

Dictionary with CDF AUC for each evaluated k

AUC_delta_dict

Dictionary with change in AUC compared to k-1 for each k

consensus_matrix_allk_dict

Dictionary with consensus matrix for each evaluated k

cluster_allk_dict

Dictionary with fitted cluster algorithm for each evaluated k

labels_allk_dict

Dictionary with labels for each instance for each evaluated k

Raises:
ValueError

when k_max is < 2

ValueError

When k_max is < k_min

Examples

>>> from skcyto.consensuscluster import ConsensusCluster
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> CClust = ConsensusCluster(k_max = 2).fit(X)
>>> CClust.labels_
array([1, 1, 1, 0, 0, 0])
__init__(k_min: int | None = None, k_max: int = 20, n_iter: int = 100, subsample_fraction: float = 0.9, random_state: int | None = None)[source]
fit(X: ndarray[Any, dtype[ScalarType]], y=None)[source]

Fit multiple hierarchical clustering instances, one for each candidate k.

Parameters:
XNDArray

Training data to cluster

yIgnored

Not used, present here for API consistency by convention.

Returns:
self: object

Returns the fitted instance

fit_predict(X: ndarray[Any, dtype[ScalarType]], y=None) ndarray[Any, dtype[ScalarType]][source]

Fit and return sample’s best clustering assignment.

In addition to fitting, this method also returns the results of the clustering assigned with the optimal number of clusters for each sample in the training set.

Parameters:
XNDArray

Training instances to cluster

yIgnored

Not used, present here for API consistency by convention.

Returns:
NDArray

Cluster labels