This function allows to do Cluster-Of-Clusters Analysis on a binary matrix where each column is a clustering of the data, each row corresponds to a data point and the element in position (i,j) is equal to 1 if data point i belongs to cluster j, 0 otherwise.
coca( moc, K = NULL, maxK = 6, B = 1000, pItem = 0.8, hclustMethod = "average", choiceKmethod = "silhouette", ccClMethod = "kmeans", ccDistHC = "euclidean", maxIterKM = 1000, savePNG = FALSE, fileName = "coca", verbose = FALSE, widestGap = FALSE, dunns = FALSE, dunn2s = FALSE, returnAllMatrices = FALSE )
moc | N X C data matrix, where C is the total number of clusters considered. |
---|---|
K | Number of clusters. |
maxK | Maximum number of clusters considered for the final clustering if K is not known. Default is 6. |
B | Number of iterations of the Consensus Clustering step. |
pItem | Proportion of items sampled at each iteration of the Consensus Cluster step. |
hclustMethod | Agglomeration method to be used by the hclust function to perform hierarchical clustering on the consensus matrix. Can be "single", "complete", "average", etc. For more details please see ?stats::hclust. |
choiceKmethod | Method used to choose the number of clusters if K is NULL, can be either "AUC" (area under the curve, work in progress) or "silhouette". Default is "silhouette". |
ccClMethod | Clustering method to be used by the Consensus Clustering algorithm (CC). Can be either "kmeans" for k-means clustering or "hclust" for hiearchical clustering. Default is "kmeans". |
ccDistHC | Distance to be used by the hiearchical clustering algorithm inside CC. Can be "pearson" (for 1 - Pearson correlation), "spearman" (for 1- Spearman correlation), or any of the distances provided in stats::dist() (i.e. "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"). Default is "euclidean". |
maxIterKM | Number of iterations for the k-means clustering algorithm. Default is 1000. |
savePNG | Boolean. Save plots as PNG files. Default is FALSE. |
fileName | If |
verbose | Boolean. |
widestGap | Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE. |
dunns | Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE. |
dunn2s | Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE. |
returnAllMatrices | Boolean. If TRUE, return consensus matrices for all considered values of K. Default is FALSE. |
This function returns a list containing:
a symmetric matrix where the element in position (i,j) corresponds to the proportion of times that items i and j have been clustered together and a vector of cluster labels.
the final cluster labels.
the final number of clusters. If provided by the user, this is
the same as the input. Otherwise, this is the number of clusters selected via
the requested method (see argument choiceKmethod
).
if returnAllMatrices = TRUE, this array also returned, containing the consensus matrices obtained for each of the numbers of clusters considered by the algorithm.
The Cancer Genome Atlas, 2012. Comprehensive molecular portraits of human breast tumours. Nature, 487(7407), pp.61–70.
Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of 'omic datasets. arXiv preprint. arXiv:1904.07701.
# Load data data <- list() data[[1]] <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "coca"), row.names = 1)) data[[2]] <- as.matrix(read.csv(system.file("extdata", "dataset2.csv", package = "coca"), row.names = 1)) data[[3]] <- as.matrix(read.csv(system.file("extdata", "dataset3.csv", package = "coca"), row.names = 1)) # Build matrix of clusters outputBuildMOC <- buildMOC(data, M = 3, K = 5, distances = "cor") # Extract matrix of clusters moc <- outputBuildMOC$moc # Do Cluster-Of-Clusters Analysis outputCOCA <- coca(moc, K = 5) # Extract cluster labels clusterLabels <- outputCOCA$clusterLabels