DBSCAN¶

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm is proposed by:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.

Density Reachability

DBSCAN’s definition of cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$ .

A cluster, which is a subset of the given set of points, satisfies two properties:

All points within the cluster are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a cluster, there exists a point $o$ sucht that both $p$ and $q$ are density reachable from $o$ .
If a point is density connected to any point of a cluster, it is also part of the cluster.

Functions

There are two different implementations of DBSCAN algorithm called by dbscan function in this package:

Using a distance (adjacency) matrix and is O(N^2) in memory usage. Note that the boundary points are not unique.

dbscan(D, eps, minpts)

Perform DBSCAN algorithm based on a given distance matrix.

Parameters:	D – The pairwise distance matrix. `D[i,j]` is the distance between points `i` and `j`. eps – The radius of a neighborhood. minpts – The minimum number of neighboring points (including self) to qualify a point as a density point.

The algorithm returns an instance of DbscanResult, defined as below:

type DbscanResult <: ClusteringResult
    seeds::Vector{Int}          # starting points of clusters, size (k,)
    assignments::Vector{Int}    # assignments, size (n,)
    counts::Vector{Int}         # number of points in each cluster, size (k,)
end

2. Using an adjacency list which is build on the fly. The performance is much better both in terms of runtime and memory usage. Also, the result is given in a DbscanCluster that provides the indices of all the core points and boundary points, such that boundary points can be associated with multiple clusters.

dbscan(points, radius, leafsize=20, min_neighbors=1, min_cluster_size=1)

Perform DBSCAN algorithm based on a collection of points.

Parameters:	points – matrix of points (column based) radius – The radius of a neighborhood. leafsize – number of points binned in each leaf node in the KDTree min_neighbors – minimum number of neighbors to be a core point min_cluster_size – minimum number of points to be a valid cluster

The algorithm returns an instance of DbscanCluster, defined as below:

immutable DbscanCluster <: ClusteringResult: size::Int # number of points in cluster core_indices::Vector{Int} # core points indices boundary_indices::Vector{Int} # boundary points indices

end