skcyto.ConsensusCluster¶
- class skcyto.ConsensusCluster(k_min: int | None = None, k_max: int = 20, n_iter: int = 100, subsample_fraction: float = 0.9, random_state: int | None = None)[source]¶
Consensus clustering
Finding the optimal number of clusters is a common problem. Consensus clustering is one technique, where one repeatedly subsamples the data and tries a range of number of clusters.
This implementation currently only supports hierarchical clustering as the algorithm of choice. This is chosen because it is used in the R FlowSOM implementation, and because only hierarchical clustering guarantees that the AUC of the CDF increases when adding more clusters.
Read more in Monti et al., Machine Learning 52, 91–118 (2003)
Limitations are described in Șenbabaoğlu et al., Sci Rep. 4:6207 (2014)
- Parameters:
- k_minint, optional
Lower bound of number of clusters to try, by default None If None, only k_max is evaluated.
- k_max: int
Upper bound of number of clusters to try, by default 20
- n_iter: int
Number of iterations, by default 100
- subsample_fraction: float
Fraction how many instances to sample from original data, by default 0.9.
- random_state: int, RandomState instance or None
Determines random number generation for subsampling, by default None.
- Attributes:
- X_NDArray
Input data
- k_best_int
Optimal number of clusters
- cluster_AgglomerativeClustering
Fitted cluster algorithm for k_best_
- labels_NDArray
Labels for each instance of X with optimal number of clusters
- AUC_dict
Dictionary with CDF AUC for each evaluated k
- AUC_delta_dict
Dictionary with change in AUC compared to k-1 for each k
- consensus_matrix_allk_dict
Dictionary with consensus matrix for each evaluated k
- cluster_allk_dict
Dictionary with fitted cluster algorithm for each evaluated k
- labels_allk_dict
Dictionary with labels for each instance for each evaluated k
- Raises:
- ValueError
when k_max is < 2
- ValueError
When k_max is < k_min
Examples
>>> from skcyto.consensuscluster import ConsensusCluster >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> CClust = ConsensusCluster(k_max = 2).fit(X) >>> CClust.labels_ array([1, 1, 1, 0, 0, 0])
- __init__(k_min: int | None = None, k_max: int = 20, n_iter: int = 100, subsample_fraction: float = 0.9, random_state: int | None = None)[source]¶
- fit(X: ndarray[Any, dtype[ScalarType]], y=None)[source]¶
Fit multiple hierarchical clustering instances, one for each candidate k.
- Parameters:
- XNDArray
Training data to cluster
- yIgnored
Not used, present here for API consistency by convention.
- Returns:
- self: object
Returns the fitted instance
- fit_predict(X: ndarray[Any, dtype[ScalarType]], y=None) ndarray[Any, dtype[ScalarType]][source]¶
Fit and return sample’s best clustering assignment.
In addition to fitting, this method also returns the results of the clustering assigned with the optimal number of clusters for each sample in the training set.
- Parameters:
- XNDArray
Training instances to cluster
- yIgnored
Not used, present here for API consistency by convention.
- Returns:
- NDArray
Cluster labels