DBSCAN¶
Densitybased Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through densitybased expansion of seed points. The algorithm is proposed by:
Martin Ester, Hanspeter Kriegel, Jörg S, and Xiaowei Xu A densitybased algorithm for discovering clusters in large spatial databases with noise. 1996.
Density Reachability
DBSCAN’s definition of cluster is based on the concept of density reachability: a point is said to be directly density reachable by another point if the distance between them is below a specified threshold and is surrounded by sufficiently many points. Then, is considered to be density reachable by if there exists a sequence such that and is directly density reachable from .
A cluster, which is a subset of the given set of points, satisfies two properties:
 All points within the cluster are mutually densityconnected, meaning that for any two distinct points and in a cluster, there exists a point sucht that both and are density reachable from .
 If a point is density connected to any point of a cluster, it is also part of the cluster.
Functions
There are two different implementations of DBSCAN algorithm called by dbscan
function in this package:
 Using a distance (adjacency) matrix and is O(N^2) in memory usage. Note that the boundary points are not unique.

dbscan
(D, eps, minpts) Perform DBSCAN algorithm based on a given distance matrix.
Parameters:  D – The pairwise distance matrix.
D[i,j]
is the distance between pointsi
andj
.  eps – The radius of a neighborhood.
 minpts – The minimum number of neighboring points (including self) to qualify a point as a density point.
The algorithm returns an instance of
DbscanResult
, defined as below:type DbscanResult <: ClusteringResult seeds::Vector{Int} # starting points of clusters, size (k,) assignments::Vector{Int} # assignments, size (n,) counts::Vector{Int} # number of points in each cluster, size (k,) end
 D – The pairwise distance matrix.
2. Using an adjacency list which is build on the fly. The performance is much better both in terms of runtime and memory usage. Also, the result is given in a DbscanCluster that provides the indices of all the core points and boundary points, such that boundary points can be associated with multiple clusters.

dbscan
(points, radius, leafsize=20, min_neighbors=1, min_cluster_size=1) Perform DBSCAN algorithm based on a collection of points.
Parameters:  points – matrix of points (column based)
 radius – The radius of a neighborhood.
 leafsize – number of points binned in each leaf node in the KDTree
 min_neighbors – minimum number of neighbors to be a core point
 min_cluster_size – minimum number of points to be a valid cluster
The algorithm returns an instance of
DbscanCluster
, defined as below: immutable DbscanCluster <: ClusteringResult
 size::Int # number of points in cluster core_indices::Vector{Int} # core points indices boundary_indices::Vector{Int} # boundary points indices
end