This implementation and notebook is inspired from the original DBSCAN algorithm and article as given in DBSCAN Wikipedia.
Stands for Density-based spatial clustering of applications with noise .
DBSCAN is clustering algorithm that tries to captures the intuition that if two points belong to the same cluster they should be close to one another. It does so by finding regions that are densely packed together, i.e, the points that have many close neighbours.
This algorithm is way better than other clustering algorithms such as k-means whose only job is to find circular blobs. It is smart enough to figure out the number of clusters in the dataset on its own, unlike k-means where you need to specify 'k'. It can also find clusters of arbitrary shapes, not just circular blobs. Its too robust to be affected by outliers (the noise points) and isn't fooled by them, unlike k-means where the entire centroid get pulled thanks to pesky outliers. Plus, you can fine-tune its parameters depending on what you are clustering.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
We will take the moons dataset which is pretty good at showing the power of DBSCAN.
Lets generate 200 random points in the shape of two moons
from sklearn.datasets import make_moons
x, label = make_moons(n_samples=200, noise=0.1, random_state=19)
You will observe that the points are in the shape of two crescent moons.
The challenge here is to cluster the two moons.
plt.plot(x[:,0], x[:,1],'ro')
[<matplotlib.lines.Line2D at 0x11a00e588>]
The DBSCAN algorithm can be abstracted into the following steps:
Initially we label all the points in the dataset as undefined .
points is our database of all points in the dataset.
points = { (point[0],point[1]):{'label':'undefined'} for point in x }
def euclidean_distance(q, p):
"""
Calculates the Euclidean distance
between points P and Q
"""
a = pow((q[0] - p[0]), 2)
b = pow((q[1] - p[1]), 2)
return pow((a + b), 0.5)
def find_neighbors(db, q, eps):
"""
Finds all points in the DB that
are within a distance of eps from Q
"""
return [p for p in db if euclidean_distance(q, p) <= eps]
def plot_cluster(db, clusters):
"""
Extracts all the points in the DB and puts them together
as seperate clusters and finally plots them
"""
temp = []
noise = []
for i in clusters:
stack = []
for k, v in db.items():
if v["label"] == i:
stack.append(k)
elif v["label"] == "noise":
noise.append(k)
temp.append(stack)
color = iter(plt.cm.rainbow(np.linspace(0, 1, len(clusters))))
for i in range(0, len(temp)):
c = next(color)
x = [l[0] for l in temp[i]]
y = [l[1] for l in temp[i]]
plt.plot(x, y, "ro", c=c)
x = [l[0] for l in noise]
y = [l[1] for l in noise]
plt.plot(x, y, "ro", c="0")
Initialize an empty list, clusters = $[ ]$ and cluster identifier, c = 0
For each point p in our database/dict db :
1.1 Check if p is already labelled. If it's already labelled (means it already been associated to a cluster), continue to the next point,i.e, go to step 1
1.2. Find the list of neighbors of p , i.e, points that are within a distance of eps from p
1.3. If p does not have atleast min_pts neighbours, we label it as noise and go back to step 1
1.4. Initialize the cluster, by incrementing c by 1
1.5. Append the cluster identifier c to clusters
1.6. Label p with the cluster identifier c
1.7 Remove p from the list of neighbors (p will be detected as its own neighbor because it is within eps of itself)
1.8. Initialize the seed_set as a copy of neighbors
1.9. While the seed_set is not empty:
1.9.1. Removing the 1st point from seed_set and initialise it as q
1.9.2. If it's label is noise, label it with c
1.9.3. If it's not unlabelled, go back to step 1.9
1.9.4. Label q with c
1.9.5. Find the neighbours of q
1.9.6. If there are atleast min_pts neighbors, append them to the seed_set
def dbscan(db,eps,min_pts):
'''
Implementation of the DBSCAN algorithm
'''
clusters = []
c = 0
for p in db:
if db[p]["label"] != "undefined":
continue
neighbors = find_neighbors(db, p, eps)
if len(neighbors) < min_pts:
db[p]["label"] = "noise"
continue
c += 1
clusters.append(c)
db[p]["label"] = c
neighbors.remove(p)
seed_set = neighbors.copy()
while seed_set != []:
q = seed_set.pop(0)
if db[q]["label"] == "noise":
db[q]["label"] = c
if db[q]["label"] != "undefined":
continue
db[q]["label"] = c
neighbors_n = find_neighbors(db, q, eps)
if len(neighbors_n) >= min_pts:
seed_set = seed_set + neighbors_n
return db, clusters
eps = 0.25
min_pts = 12
db,clusters = dbscan(points,eps,min_pts)
plot_cluster(db,clusters)
I encourage you to try with different datasets and playing with the values of eps and min_pts.
Also, try kmeans on this dataset and see how it compares to dbscan.
I hope by now you are convinced about about how cool dbscan is. But it has its pitfalls.