Clustering Initialization¶
A clustering algorithm usually relies on an initialization scheme to bootstrap the clustering procedure.
Seeding¶
Seeding is an important family of methods for clustering initialization, which generally refers to an procedure to select a few seeds from a data set, each serving as the initial center of a cluster.
Seeding functions¶
The packages provide two functions initseeds and initseeds_by_costs for seeding:
-
initseeds(algname, X, k)¶ Select
kseeds from a given sample matrixX.It returns an integer vector of length
kthat contains the indexes of chosen seeds.Here,
algnameindicates the seeding algorithm, which should be a symbol that may take either of the following values:algname description :randRandomly select a subset as seeds :kmppKmeans++ algorithm, i.e. choose seeds sequentially, the probability of an sample to be chosen is proportional to the minimum cost of assigning it to existing seeds.
Reference:
D. Arthur and S. Vassilvitskii (2007). K-means++: the Advantages of Careful Seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.
:kmcenChoose the ksamples with highest centrality as seeds.
-
initseeds_by_costs(algname, C, k)¶ Select
kseeds based on a cost matrixC.Here,
C[i,j]is the cost of binding samplesiandjto the same cluster. One may, for example, use the squared Euclidean distance between samples as the costs.The argument
algnamedetermines the choice of algorithm (see above).
In practice, we found that Kmeans++ is the most effective method for initial seeding. Thus, we provide specific functions to simply the use of Kmeans++ seeding:
-
kmpp(X, k)¶ Use Kmeans++ to choose
kseeds from a data set given by a sample matrixX.
-
kmpp_by_costs(C, k)¶ Use Kmeans++ to choose
kseeds based on a cost matrixC.
Internals¶
In this package, each seeding algorithm is represented by a sub-type of SeedingAlgorithm. Particularly, the random selection algorithm, Kmean++, and centrality-based algorithm are respectively represented by sub-types RandSeedAlg, KmppAlg, and KmCentralityAlg.
For each sub type, the following methods are implemented:
-
initseeds!(iseeds, alg, X) Select seeds from a given sample matrix
X, and write the results toiseeds.Parameters: - iseeds – An pre-allocated array to store the indexes of the chosen seeds.
- alg – The algorithm instance.
- X – The given data matrix. Each column of
Xis a sample.
Returns: iseeds
-
initseeds_by_costs!(iseeds, alg, C) Select seeds based on a given cost matrix
C, and write the results toiseeds.Parameters: - iseeds – An pre-allocated array to store the indexes of the chosen seeds.
- alg – The algorithm instance.
- C – The cost matrix. The value of
C[i,j]is the cost of binding samplesiandjinto the same cluster.
Returns: iseeds
Note: For both functions above, the length of iseeds determines the number of seeds to be selected.
To define a new seeding algorithm, one has to first define a sub type of SeedingAlgorithm and implement the two functions above.