This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.

klic(
  data,
  M,
  individualK = NULL,
  individualMaxK = 6,
  individualClAlgorithm = "kkmeans",
  globalK = NULL,
  globalMaxK = 6,
  B = 1000,
  C = 100,
  scale = FALSE,
  savePNG = FALSE,
  fileName = "klic",
  verbose = TRUE,
  annotations = NULL,
  ccClMethods = "kmeans",
  ccDistHCs = "euclidean",
  widestGap = FALSE,
  dunns = FALSE,
  dunn2s = FALSE
)

Arguments

data

List of M datasets, each of size N X P_m, m = 1, ..., M.

M

number of datasets.

individualK

Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette.

individualMaxK

Maximum number of clusters considered for the individual data. Default is 6.

individualClAlgorithm

Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters.

globalK

Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette.

globalMaxK

Maximum number of clusters considered for the final clustering. Default is 6.

B

Number of iterations for consensus clustering. Default is 1000.

C

Maximum number of iterations for localised kernel k-means. Default is 100.

scale

Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance.

savePNG

Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE.

fileName

If savePNG is TRUE, this is the name of the png file. Can be used to specify the folder path too. Default is "klic".

verbose

Boolean. Default is TRUE.

annotations

Data frame containing annotations for final plot.

ccClMethods

The i-th element of this vector goes into the clMethod argument of consensusCluster() for the i-th dataset. If only one string is provided, then the same method is used for all datasets.

ccDistHCs

The i-th element of this vector goes into the dist argument of consensusCluster() for the i-th dataset.

widestGap

Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE.

dunns

Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE.

dunn2s

Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE.

Value

The function returns a list contatining:

consensusMatrices

an array containing one consensus matrix per data set.

weights

a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix.

weightedKM

the weighted kernel matrix obtained by taking a weighted sum of all kernels, where the weights are those specified in the weights matrix.

globalClusterLabels

a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices.

bestK

a vector containing the best number of clusters between 2 and maxIndividualK for each kernel. These are chosen so as to maximise the silhouette and only returned if the number of clusters individualK is not provided.

globalK

the best number of clusters for the final (global) clustering. This is chosen so as to maximise the silhouette and only returned if the final number of clusters globalK is not provided.

References

Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.

Examples

if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Load synthetic data data1 <- as.matrix(read.csv(system.file('extdata', 'dataset1.csv', package = 'klic'), row.names = 1)) data2 <- as.matrix(read.csv(system.file('extdata', 'dataset2.csv', package = 'klic'), row.names = 1)) data3 <- as.matrix(read.csv(system.file('extdata', 'dataset3.csv', package = 'klic'), row.names = 1)) data <- list(data1, data2, data3) # Perform clustering with KLIC assuming to know the # number of clusters in each individual dataset and in # the final clustering klicOutput <- klic(data, 3, individualK = c(4, 4, 4), globalK = 4, B = 30, C = 5) # Extract cluster labels klic_labels <- klicOutput$globalClusterLabels cluster_labels <- as.matrix(read.csv(system.file('extdata', 'cluster_labels.csv', package = 'klic'), row.names = 1)) # Compute ARI ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels) }
#> All datasets contain the same number of observations 100 . #> We assume that the observations are the same in each dataset and that they are in the same order. #> [1] "*** Generating similarity matrices ***" #> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100% #> [1] "*** Finding the global clustering ***"