This function fills in a matrix of clusters that contains NAs, by estimating the missing cluster labels based on the available ones or based on the other datasets. The predictive accuracy of this method can also be estimated via cross-validation.

fillMOC(clLabels, data, computeAccuracy = FALSE, verbose = FALSE)

Arguments

clLabels

N X M matrix containing cluster labels. Element (n,m) contains the cluster label for element data point n in cluster m.

data

List of M datasets to be used for the label imputation.

computeAccuracy

Boolean. If TRUE, for each missing element, the performance of the predictive model used to estimate the corresponding missing label is computer. Default is FALSE.

verbose

Boolean. If TRUE, for each NA, the size of the matrix used to estimate its values is printed to screen. Default is FALSE.

Value

The output is a list containing:

fullClLabels

the same matrix of clusters as the input matrix clLabels, where NAs have been replaced by their estimates, where possible.

nRows

matrix where the item in position (i,j) indicates the number of observations used in the predictive model used to estimate the corresponding missing label in the fullClLabels matrix.

nColumns

matrix where the item in position (i,j) indicates the number of covariates used in the predictive model used to estimate the corresponding missing label in the fullClLabels matrix.

accuracy

a matrix where each element corresponds to the predictive accuracy of the predictive model used to estimate the corresponding label in the cluster label matrix. This is only returned if the argument computeAccuracy is set to TRUE.

accuracy_random

This is computed in the same way as accuracy, but with the labels randomly shuffled. This can be used in order to assess the predictive accuracy of the imputation algorithm and is returned only if the argument computeAccuracy is set to TRUE.

References

The Cancer Genome Atlas, 2012. Comprehensive molecular portraits of human breast tumours. Nature, 487(7407), pp.61–70.

Examples

# Load data data <- list() data[[1]] <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "coca"), row.names = 1)) data[[2]] <- as.matrix(read.csv(system.file("extdata", "dataset2.csv", package = "coca"), row.names = 1)) data[[3]] <- as.matrix(read.csv(system.file("extdata", "dataset3.csv", package = "coca"), row.names = 1)) # Build matrix of clusters outputBuildMOC <- buildMOC(data, M = 3, K = 6, distances = "cor") # Extract matrix of clusters clLabels <- outputBuildMOC$clLabels # Impute missing values using full datasets outputFillMOC <- fillMOC(clLabels, data) # Extract full matrix of cluster labels clLabels2 <- outputFillMOC$fullClLabels