This function creates a matrix of clusters starting from a list of heterogeneous datasets.

buildMOC(
  data,
  M,
  K = NULL,
  maxK = 10,
  methods = "hclust",
  distances = "euclidean",
  fill = FALSE,
  computeAccuracy = FALSE,
  fullData = FALSE,
  savePNG = FALSE,
  fileName = "buildMOC",
  widestGap = FALSE,
  dunns = FALSE,
  dunn2s = FALSE
)

Arguments

data

List of M datasets, each of size N X P_m, where m = 1, ..., M.

M

Number of datasets.

K

Vector containing the number of clusters in each dataset. If given an integer instead of a vector it is assumed that each dataset has the same number of clusters. If NULL, it is assumed that the true cluster numbers are not known, therefore they will be estimated using the silhouette method.

maxK

Vector of maximum cluster numbers to be considered for each dataset if K is NULL. If given an integer instead of a vector it is assumed that for each dataset the same maximum number of clusters must be considered. Default is 10.

methods

Vector of strings containing the names of the clustering methods to be used to cluster the observations in each dataset. Each can be "kmeans" (k-means clustering), "hclust" (hierarchical clustering), or "pam" (partitioning around medoids). If the vector is of length one, the same clustering method is applied to all the datasets. Default is "hclust".

distances

Distances to be used in the clustering step for each dataset. If only one string is provided, then the same distance is used for all datasets. If the number of strings provided is the same as the number of datasets, then each distance will be used for the corresponding dataset. Default is "euclidean". Please note that not all distances are compatible with all clustering methods. "euclidean" and "manhattan" work with all available clustering algorithms. "gower" distance is only available for partitioning around medoids. In addition, "maximum", "canberra", "binary" or "minkowski" are available for k-means and hierarchical clustering.

fill

Boolean. If TRUE, if there are any missing observations in one or more datasets, the corresponding cluster labels will be estimated through generalised linear models on the basis of the available labels.

computeAccuracy

Boolean. If TRUE, for each missing element, the performance of the predictive model used to estimate the corresponding missing label is computer.

fullData

Boolean. If TRUE, the full data matrices are used to estimate the missing cluster labels (instead of just using the cluster labels of the corresponding datasets).

savePNG

Boolean. If TRUE, plots of the silhouette for each datasets are saved as png files. Default is FALSE.

fileName

If savePNG is TRUE, this is the string containing the name of the output files. Can be used to specify the folder path too. Default is "buildMOC". The ".png" extension is automatically added to this string.

widestGap

Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE.

dunns

Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE.

dunn2s

Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE.

Value

This function returns a list containing:

moc

the Matrix-Of-Clusters, a binary matrix of size N x sum(K) where element (n,k) contains a 1 if observation n belongs to the corresponding cluster, 0 otherwise.

datasetIndicator

a vector of length sum(K) in which each element is the number of the dataset to which the cluster belongs.

number_nas

the total number of NAs in the matrix of clusters. (If the MOC has been filled with imputed values, number_nas indicates the number of NAs in the original MOC.)

clLabels

a matrix that is equivalent to the matrix of clusters, but is in compact form, i.e. each column corresponds to a dataset, each row represents an observation, and its values indicate the cluster labels.

K

vector of cluster numbers in each dataset. If these are provided as input, this is the same as the input (expanded to a vector if the input is an integer). If the cluster numbers are not provided as input, this vector contains the cluster numbers chosen via silhouette for each dataset.

References

The Cancer Genome Atlas, 2012. Comprehensive molecular portraits of human breast tumours. Nature, 487(7407), pp.61–70.

Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, pp.53-65.

Examples

# Load data data <- list() data[[1]] <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "coca"), row.names = 1)) data[[2]] <- as.matrix(read.csv(system.file("extdata", "dataset2.csv", package = "coca"), row.names = 1)) data[[3]] <- as.matrix(read.csv(system.file("extdata", "dataset3.csv", package = "coca"), row.names = 1)) # Build matrix of clusters outputBuildMOC <- buildMOC(data, M = 3, K = 6, distances = "cor") # Extract matrix of clusters matrixOfClusters <- outputBuildMOC$moc