Clustering.jl provides functionalities in three aspects that are related to data clustering:
- Clustering initialization, e.g. K-means++ seeding.
- Clustering algorithms, e.g. K-means, K-medoids, Affinity propagation, and DBSCAN, etc.
- Clustering evaluation, e.g. Silhouettes and variational information.
A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:
- Sample matrix
X, where each column
X[:,i]corresponds to an observed sample.
- Distance matrix
D[i,j]indicates the distance between samples
j, or the cost of assigning one to the other.
Many clustering algorithms are iterative procedures. There options are usually provided to the users to control the behavior the algorithm.
Maximum number of iterations.
Tolerable change of objective at convergence.
The algorithm is considered to be converged when the change of objective value between consecutive iteration is below the specified value.
The level of information to be displayed.
This should be a symbol, which may take either of the following values:
:none: nothing will be shown
:final: only shows a brief summary when the
- algorithm ends
:iter: shows progess at each iteration
A clustering algorithm would return a struct that captures both the clustering results (e.g. assignments of samples to clusters) and information about the clustering procedure (e.g. the number of iterations or whether the iterative update converged).
Generally, the resultant struct is defined as an instance of a sub-type of
ClusteringResult. The following generic methods are implemented for these subtypes (let
R be an instance):
Get the number of clusters
Get a vector of assignments.
a = assignments(R), then
a[i]is the index of the cluster to which the
i-th sample is assigned.
Get sample counts of clusters.
c = counts(R), then
c[k]is the number of samples assigned to the