Kernel learning integrative clustering

This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.

klic(
  data,
  M,
  individualK = NULL,
  individualMaxK = 6,
  individualClAlgorithm = "kkmeans",
  globalK = NULL,
  globalMaxK = 6,
  B = 1000,
  C = 100,
  scale = FALSE,
  savePNG = FALSE,
  fileName = "klic",
  verbose = TRUE,
  annotations = NULL,
  ccClMethods = "kmeans",
  ccDistHCs = "euclidean",
  widestGap = FALSE,
  dunns = FALSE,
  dunn2s = FALSE
)

Arguments

data	List of M datasets, each of size N X P_m, m = 1, ..., M.
M	number of datasets.
individualK	Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette.
individualMaxK	Maximum number of clusters considered for the individual data. Default is 6.
individualClAlgorithm	Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters.
globalK	Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette.
globalMaxK	Maximum number of clusters considered for the final clustering. Default is 6.
B	Number of iterations for consensus clustering. Default is 1000.
C	Maximum number of iterations for localised kernel k-means. Default is 100.
scale	Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance.
savePNG	Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE.
fileName	If `savePNG` is TRUE, this is the name of the png file. Can be used to specify the folder path too. Default is "klic".
verbose	Boolean. Default is TRUE.
annotations	Data frame containing annotations for final plot.
ccClMethods	The i-th element of this vector goes into the `clMethod` argument of consensusCluster() for the i-th dataset. If only one string is provided, then the same method is used for all datasets.
ccDistHCs	The i-th element of this vector goes into the `dist` argument of `consensusCluster()` for the i-th dataset.
widestGap	Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE.
dunns	Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE.
dunn2s	Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE.

Value

The function returns a list contatining:

consensusMatrices

an array containing one consensus matrix per data set.

weights

a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix.

weightedKM

the weighted kernel matrix obtained by taking a weighted sum of all kernels, where the weights are those specified in the weights matrix.

globalClusterLabels

a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices.

bestK

a vector containing the best number of clusters between 2 and maxIndividualK for each kernel. These are chosen so as to maximise the silhouette and only returned if the number of clusters individualK is not provided.

globalK

the best number of clusters for the final (global) clustering. This is chosen so as to maximise the silhouette and only returned if the final number of clusters globalK is not provided.

References

Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.

Examples

if(requireNamespace("Rmosek", quietly = TRUE) &&
(!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){

# Load synthetic data
data1 <- as.matrix(read.csv(system.file('extdata',
'dataset1.csv', package = 'klic'), row.names = 1))
data2 <- as.matrix(read.csv(system.file('extdata',
'dataset2.csv', package = 'klic'), row.names = 1))
data3 <- as.matrix(read.csv(system.file('extdata',
'dataset3.csv', package = 'klic'), row.names = 1))
data <- list(data1, data2, data3)

# Perform clustering with KLIC assuming to know the
# number of clusters in each individual dataset and in
# the final clustering
klicOutput <- klic(data, 3, individualK = c(4, 4, 4),
globalK = 4, B = 30, C = 5)

# Extract cluster labels
klic_labels <- klicOutput$globalClusterLabels

cluster_labels <- as.matrix(read.csv(system.file('extdata',
'cluster_labels.csv', package = 'klic'), row.names = 1))
# Compute ARI
ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels)
}
#> All datasets contain the same number of observations  100 .
#> We assume that the observations are the same in each dataset and that they are in the same order.
#> [1] "*** Generating similarity matrices ***"
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%
#> [1] "*** Finding the global clustering ***"