This function creates a matrix of clusters starting from a list of heterogeneous datasets.
buildMOC( data, M, K = NULL, maxK = 10, methods = "hclust", distances = "euclidean", fill = FALSE, computeAccuracy = FALSE, fullData = FALSE, savePNG = FALSE, fileName = "buildMOC", widestGap = FALSE, dunns = FALSE, dunn2s = FALSE )
data | List of M datasets, each of size N X P_m, where m = 1, ..., M. |
---|---|
M | Number of datasets. |
K | Vector containing the number of clusters in each dataset. If given an integer instead of a vector it is assumed that each dataset has the same number of clusters. If NULL, it is assumed that the true cluster numbers are not known, therefore they will be estimated using the silhouette method. |
maxK | Vector of maximum cluster numbers to be considered for each dataset if K is NULL. If given an integer instead of a vector it is assumed that for each dataset the same maximum number of clusters must be considered. Default is 10. |
methods | Vector of strings containing the names of the clustering methods to be used to cluster the observations in each dataset. Each can be "kmeans" (k-means clustering), "hclust" (hierarchical clustering), or "pam" (partitioning around medoids). If the vector is of length one, the same clustering method is applied to all the datasets. Default is "hclust". |
distances | Distances to be used in the clustering step for each dataset. If only one string is provided, then the same distance is used for all datasets. If the number of strings provided is the same as the number of datasets, then each distance will be used for the corresponding dataset. Default is "euclidean". Please note that not all distances are compatible with all clustering methods. "euclidean" and "manhattan" work with all available clustering algorithms. "gower" distance is only available for partitioning around medoids. In addition, "maximum", "canberra", "binary" or "minkowski" are available for k-means and hierarchical clustering. |
fill | Boolean. If TRUE, if there are any missing observations in one or more datasets, the corresponding cluster labels will be estimated through generalised linear models on the basis of the available labels. |
computeAccuracy | Boolean. If TRUE, for each missing element, the performance of the predictive model used to estimate the corresponding missing label is computer. |
fullData | Boolean. If TRUE, the full data matrices are used to estimate the missing cluster labels (instead of just using the cluster labels of the corresponding datasets). |
savePNG | Boolean. If TRUE, plots of the silhouette for each datasets are saved as png files. Default is FALSE. |
fileName | If |
widestGap | Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE. |
dunns | Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE. |
dunn2s | Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE. |
This function returns a list containing:
the Matrix-Of-Clusters, a binary matrix of size N x sum(K) where element (n,k) contains a 1 if observation n belongs to the corresponding cluster, 0 otherwise.
a vector of length sum(K) in which each element is the number of the dataset to which the cluster belongs.
the total number of NAs in the matrix of clusters. (If the
MOC has been filled with imputed values, number_nas
indicates the
number of NAs in the original MOC.)
a matrix that is equivalent to the matrix of clusters, but is in compact form, i.e. each column corresponds to a dataset, each row represents an observation, and its values indicate the cluster labels.
vector of cluster numbers in each dataset. If these are provided as input, this is the same as the input (expanded to a vector if the input is an integer). If the cluster numbers are not provided as input, this vector contains the cluster numbers chosen via silhouette for each dataset.
The Cancer Genome Atlas, 2012. Comprehensive molecular portraits of human breast tumours. Nature, 487(7407), pp.61–70.
Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, pp.53-65.
# Load data data <- list() data[[1]] <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "coca"), row.names = 1)) data[[2]] <- as.matrix(read.csv(system.file("extdata", "dataset2.csv", package = "coca"), row.names = 1)) data[[3]] <- as.matrix(read.csv(system.file("extdata", "dataset3.csv", package = "coca"), row.names = 1)) # Build matrix of clusters outputBuildMOC <- buildMOC(data, M = 3, K = 6, distances = "cor") # Extract matrix of clusters matrixOfClusters <- outputBuildMOC$moc