Hierarchical Clustering¶

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

Functions

hclust(D; linkage=:single)¶

Perform hierarchical clustering on distance matrix D with specified cluster linkage function.

Parameters:	D – The pairwise distance matrix. `D[i,j]` is the distance between points `i` and `j`. linkage – A Symbol specifying how the distance between clusters (aka cluster linkage) is measured. It determines what clusters are merged on each iteration. Valid choices are:

:single: use the minimum distance between any of the members
:average: use the mean distance between any of the cluster’s members
:complete: use the maximum distance between any of the members.
:ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters.
:ward_presquared: same as :ward, but assumes that the distances in D are already squared.

The function returns an object of type Hclust with the fields

merges the sequence of subtree merges. Leafs are indicated by negative numbers, the ids of non-trivial subtrees refer to the rows in the merges matrix and the elements of the heights vector.
heights subtrees heights, i.e. the distances between left and right top branches of each subtree.
order indices of points ordered such that there are no intersecting branches on the dendrogram plot. This ordering brings points of the same cluster close together.
linkage the cluster linkage used.

Example:

D = rand(1000, 1000)
D += D'  # symmetric distance matrix (optional)
result = hclust(D, linkage=:single)

cutree(result; [k=nothing], [h=nothing])¶

Cuts the dendrogram to produce clusters at the specified level of granularity.

Parameters:	result – Object of type `Hclust` holding results of a call to `hclust()`. k – Integer specifying the number of desired clusters. h – Real specifying the height at which to cut the tree.

If both k and h are specified, it’s guaranteed that the number of clusters is ≥ k and their height ≤ h.

The output is a vector specifying the cluster index for each datapoint.