Hierarchical Clustering¶
Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.
Functions
-
hclust(D; linkage=:single)¶ Perform hierarchical clustering on distance matrix D with specified cluster linkage function.
Parameters: - D – The pairwise distance matrix.
D[i,j]is the distance between pointsiandj. - linkage – A Symbol specifying how the distance between clusters (aka cluster linkage) is measured. It determines what clusters are merged on each iteration. Valid choices are:
:single: use the minimum distance between any of the members:average: use the mean distance between any of the cluster’s members:complete: use the maximum distance between any of the members.:ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters.:ward_presquared: same as:ward, but assumes that the distances inDare already squared.
- D – The pairwise distance matrix.
- The function returns an object of type Hclust with the fields
mergesthe sequence of subtree merges. Leafs are indicated by negative numbers, the ids of non-trivial subtrees refer to the rows in themergesmatrix and the elements of theheightsvector.heightssubtrees heights, i.e. the distances between left and right top branches of each subtree.orderindices of points ordered such that there are no intersecting branches on the dendrogram plot. This ordering brings points of the same cluster close together.linkagethe cluster linkage used.
Example:
D = rand(1000, 1000) D += D' # symmetric distance matrix (optional) result = hclust(D, linkage=:single)
-
cutree(result; [k=nothing], [h=nothing])¶ Cuts the dendrogram to produce clusters at the specified level of granularity.
Parameters: - result – Object of type
Hclustholding results of a call tohclust(). - k – Integer specifying the number of desired clusters.
- h – Real specifying the height at which to cut the tree.
- result – Object of type
If both k and h are specified, it’s guaranteed that the number of clusters is ≥ k and their height ≤ h.
The output is a vector specifying the cluster index for each datapoint.