This function allows to do Cluster-Of-Clusters Analysis on a binary matrix where each column is a clustering of the data, each row corresponds to a data point and the element in position (i,j) is equal to 1 if data point i belongs to cluster j, 0 otherwise.

coca(
  moc,
  K = NULL,
  maxK = 6,
  B = 1000,
  pItem = 0.8,
  hclustMethod = "average",
  choiceKmethod = "silhouette",
  ccClMethod = "kmeans",
  ccDistHC = "euclidean",
  maxIterKM = 1000,
  savePNG = FALSE,
  fileName = "coca",
  verbose = FALSE,
  widestGap = FALSE,
  dunns = FALSE,
  dunn2s = FALSE,
  returnAllMatrices = FALSE
)

Arguments

moc

N X C data matrix, where C is the total number of clusters considered.

K

Number of clusters.

maxK

Maximum number of clusters considered for the final clustering if K is not known. Default is 6.

B

Number of iterations of the Consensus Clustering step.

pItem

Proportion of items sampled at each iteration of the Consensus Cluster step.

hclustMethod

Agglomeration method to be used by the hclust function to perform hierarchical clustering on the consensus matrix. Can be "single", "complete", "average", etc. For more details please see ?stats::hclust.

choiceKmethod

Method used to choose the number of clusters if K is NULL, can be either "AUC" (area under the curve, work in progress) or "silhouette". Default is "silhouette".

ccClMethod

Clustering method to be used by the Consensus Clustering algorithm (CC). Can be either "kmeans" for k-means clustering or "hclust" for hiearchical clustering. Default is "kmeans".

ccDistHC

Distance to be used by the hiearchical clustering algorithm inside CC. Can be "pearson" (for 1 - Pearson correlation), "spearman" (for 1- Spearman correlation), or any of the distances provided in stats::dist() (i.e. "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"). Default is "euclidean".

maxIterKM

Number of iterations for the k-means clustering algorithm. Default is 1000.

savePNG

Boolean. Save plots as PNG files. Default is FALSE.

fileName

If savePNG is TRUE, this is the string containing (the first part of) the name of the output files. Can be used to specify the folder path too. Default is "coca". The ".png" extension is automatically added to this string.

verbose

Boolean.

widestGap

Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE.

dunns

Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE.

dunn2s

Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE.

returnAllMatrices

Boolean. If TRUE, return consensus matrices for all considered values of K. Default is FALSE.

Value

This function returns a list containing:

consensusMatrix

a symmetric matrix where the element in position (i,j) corresponds to the proportion of times that items i and j have been clustered together and a vector of cluster labels.

clusterLabels

the final cluster labels.

K

the final number of clusters. If provided by the user, this is the same as the input. Otherwise, this is the number of clusters selected via the requested method (see argument choiceKmethod).

consensusMatrices

if returnAllMatrices = TRUE, this array also returned, containing the consensus matrices obtained for each of the numbers of clusters considered by the algorithm.

References

The Cancer Genome Atlas, 2012. Comprehensive molecular portraits of human breast tumours. Nature, 487(7407), pp.61–70.

Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of 'omic datasets. arXiv preprint. arXiv:1904.07701.

Examples

# Load data data <- list() data[[1]] <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "coca"), row.names = 1)) data[[2]] <- as.matrix(read.csv(system.file("extdata", "dataset2.csv", package = "coca"), row.names = 1)) data[[3]] <- as.matrix(read.csv(system.file("extdata", "dataset3.csv", package = "coca"), row.names = 1)) # Build matrix of clusters outputBuildMOC <- buildMOC(data, M = 3, K = 5, distances = "cor") # Extract matrix of clusters moc <- outputBuildMOC$moc # Do Cluster-Of-Clusters Analysis outputCOCA <- coca(moc, K = 5) # Extract cluster labels clusterLabels <- outputCOCA$clusterLabels