This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.
klic( data, M, individualK = NULL, individualMaxK = 6, individualClAlgorithm = "kkmeans", globalK = NULL, globalMaxK = 6, B = 1000, C = 100, scale = FALSE, savePNG = FALSE, fileName = "klic", verbose = TRUE, annotations = NULL, ccClMethods = "kmeans", ccDistHCs = "euclidean", widestGap = FALSE, dunns = FALSE, dunn2s = FALSE )
data | List of M datasets, each of size N X P_m, m = 1, ..., M. |
---|---|
M | number of datasets. |
individualK | Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette. |
individualMaxK | Maximum number of clusters considered for the individual data. Default is 6. |
individualClAlgorithm | Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters. |
globalK | Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette. |
globalMaxK | Maximum number of clusters considered for the final clustering. Default is 6. |
B | Number of iterations for consensus clustering. Default is 1000. |
C | Maximum number of iterations for localised kernel k-means. Default is 100. |
scale | Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance. |
savePNG | Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE. |
fileName | If |
verbose | Boolean. Default is TRUE. |
annotations | Data frame containing annotations for final plot. |
ccClMethods | The i-th element of this vector goes into the
|
ccDistHCs | The i-th element of this vector goes into the |
widestGap | Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE. |
dunns | Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE. |
dunn2s | Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE. |
The function returns a list contatining:
an array containing one consensus matrix per data set.
a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix.
the weighted kernel matrix obtained by taking a weighted
sum of all kernels, where the weights are those specified in the
weights
matrix.
a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices.
a vector containing the best number of clusters between 2 and
maxIndividualK
for each kernel. These are chosen so as to maximise the
silhouette and only returned if the number of clusters individualK
is not provided.
the
best number of clusters for the final (global) clustering. This is chosen so
as to maximise the silhouette and only returned if the final number of
clusters globalK
is not provided.
Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Load synthetic data data1 <- as.matrix(read.csv(system.file('extdata', 'dataset1.csv', package = 'klic'), row.names = 1)) data2 <- as.matrix(read.csv(system.file('extdata', 'dataset2.csv', package = 'klic'), row.names = 1)) data3 <- as.matrix(read.csv(system.file('extdata', 'dataset3.csv', package = 'klic'), row.names = 1)) data <- list(data1, data2, data3) # Perform clustering with KLIC assuming to know the # number of clusters in each individual dataset and in # the final clustering klicOutput <- klic(data, 3, individualK = c(4, 4, 4), globalK = 4, B = 30, C = 5) # Extract cluster labels klic_labels <- klicOutput$globalClusterLabels cluster_labels <- as.matrix(read.csv(system.file('extdata', 'cluster_labels.csv', package = 'klic'), row.names = 1)) # Compute ARI ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels) }#> All datasets contain the same number of observations 100 . #> We assume that the observations are the same in each dataset and that they are in the same order. #> [1] "*** Generating similarity matrices ***" #> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100% #> [1] "*** Finding the global clustering ***"