This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.

klic(
data,
M,
individualK = NULL,
individualMaxK = 6,
individualClAlgorithm = "kkmeans",
globalK = NULL,
globalMaxK = 6,
B = 1000,
C = 100,
scale = FALSE,
savePNG = FALSE,
fileName = "klic",
verbose = TRUE,
annotations = NULL,
ccClMethods = "kmeans",
ccDistHCs = "euclidean",
widestGap = FALSE,
dunns = FALSE,
dunn2s = FALSE
)

## Arguments

data List of M datasets, each of size N X P_m, m = 1, ..., M. number of datasets. Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette. Maximum number of clusters considered for the individual data. Default is 6. Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters. Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette. Maximum number of clusters considered for the final clustering. Default is 6. Number of iterations for consensus clustering. Default is 1000. Maximum number of iterations for localised kernel k-means. Default is 100. Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance. Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE. If savePNG is TRUE, this is the name of the png file. Can be used to specify the folder path too. Default is "klic". Boolean. Default is TRUE. Data frame containing annotations for final plot. The i-th element of this vector goes into the clMethod argument of consensusCluster() for the i-th dataset. If only one string is provided, then the same method is used for all datasets. The i-th element of this vector goes into the dist argument of consensusCluster() for the i-th dataset. Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE. Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE. Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE.

## Value

The function returns a list contatining:

consensusMatrices

an array containing one consensus matrix per data set.

weights

a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix.

weightedKM

the weighted kernel matrix obtained by taking a weighted sum of all kernels, where the weights are those specified in the weights matrix.

globalClusterLabels

a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices.

bestK

a vector containing the best number of clusters between 2 and maxIndividualK for each kernel. These are chosen so as to maximise the silhouette and only returned if the number of clusters individualK is not provided.

globalK

the best number of clusters for the final (global) clustering. This is chosen so as to maximise the silhouette and only returned if the final number of clusters globalK is not provided.

## References

Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.

## Examples

if(requireNamespace("Rmosek", quietly = TRUE) &&
(!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Load synthetic data data1 <- as.matrix(read.csv(system.file('extdata', 'dataset1.csv', package = 'klic'), row.names = 1)) data2 <- as.matrix(read.csv(system.file('extdata', 'dataset2.csv', package = 'klic'), row.names = 1)) data3 <- as.matrix(read.csv(system.file('extdata', 'dataset3.csv', package = 'klic'), row.names = 1)) data <- list(data1, data2, data3) # Perform clustering with KLIC assuming to know the # number of clusters in each individual dataset and in # the final clustering klicOutput <- klic(data, 3, individualK = c(4, 4, 4), globalK = 4, B = 30, C = 5) # Extract cluster labels klic_labels <- klicOutput$globalClusterLabels

'cluster_labels.csv', package = 'klic'), row.names = 1))
# Compute ARI
#> [1] "*** Finding the global clustering ***"