This function allows to choose the number of clusters in a dataset based on the area under the curve of the empirical distribution function of a consensus matrix, calculated for different (consecutive) cluster numbers, as explained in the article by Monti et al. (2003), Section 3.3.1.

chooseKusingAUC(areaUnderTheCurve, savePNG = FALSE, fileName = "deltaAUC.png")

Arguments

areaUnderTheCurve

Vector of length maxK-1 containing the area under the curve of the empirical distribution function of the consensus matrices obtained with K varying from 2 to maxK.

savePNG

Boolean. If TRUE, a plot of the area under the curve for each value of K is saved as a png file. The file is saved in a subdirectory of the working directory, called "delta-auc". Default is FALSE.

fileName

If savePNG is TRUE, this is the name of the png file. Can be used to specify the folder path too. Default is "deltaAUC". The ".png" extension is automatically added to this string.

Value

This function returns a list containing:

deltaAUC

a vector of length maxK-1 where element i is the area under the curve for K = i+1 minus the area under the curve for K = i (for i = 2 this is simply the area under the curve for K = i)

K

the lowest among the values of K that are chosen by the algorithm.

References

Monti, S., Tamayo, P., Mesirov, J. and Golub, T., 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, 52(1-2), pp.91-118.

Examples

# Assuming that we want to choose among any value of K (number of clusters) # between 2 and 10 and that the area under the curve is as follows: areaUnderTheCurve <- c(0.05, 0.15, 0.4, 0.5, 0.55, 0.56, 0.57, 0.58, 0.59) # The optimal value of K can be chosen with: K <- chooseKusingAUC(areaUnderTheCurve)$K