Struct cogset::Dbscan
[-] [+]
[src]
pub struct Dbscan<P: RegionQuery + ListPoints> where P: Hash + Eq + Clone {
// some fields omitted
}
Clustering via the DBSCAN algorithm[1].
[DBSCAN] is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).wikipedia
An instance of Dbscan
is an iterator over clusters of
P
. Points classified as noise once all clusters are found are
available via noise_points
.
This uses the P::Point
yielded by the iterators provided by
ListPoints
and RegionQuery
as a unique identifier for each
point. The algorithm will behave strangely if the identifier is
not unique or not stable within a given execution of DBSCAN. The
identifier is cloned several times in the course of execution, so
it should be cheap to duplicate (e.g. a usize
index, or a &T
reference).
[1]: Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231.
Examples
A basic example:
use cogset::{Dbscan, BruteScan, Euclid}; let points = [Euclid([0.1]), Euclid([0.2]), Euclid([1.0])]; let scanner = BruteScan::new(&points); let mut dbscan = Dbscan::new(scanner, 0.2, 2); // get the clusters themselves let clusters = dbscan.by_ref().collect::<Vec<_>>(); // the first two points are the only cluster assert_eq!(clusters, &[&[0, 1]]); // now the noise let noise = dbscan.noise_points(); // which is just the last point assert_eq!(noise.iter().cloned().collect::<Vec<_>>(), &[2]);
A more complicated example that renders the output nicely:
use std::str; use cogset::{Dbscan, BruteScan, Euclid}; fn write_points<I>(output: &mut [u8; 76], byte: u8, it: I) where I: Iterator<Item = Euclid<[f64; 1]>> { for p in it { output[(p.0[0] * 30.0) as usize] = byte; } } // the points we're going to cluster, considered as points in ℝ // with the conventional distance. let points = [Euclid([0.25]), Euclid([0.9]), Euclid([2.0]), Euclid([1.2]), Euclid([1.9]), Euclid([1.1]), Euclid([1.35]), Euclid([1.85]), Euclid([1.05]), Euclid([0.1]), Euclid([2.5]), Euclid([0.05]), Euclid([0.6]), Euclid([0.55]), Euclid([1.6])]; // print the points before clustering let mut original = [b' '; 76]; write_points(&mut original, b'x', points.iter().cloned()); println!("{}", str::from_utf8(&original).unwrap()); // set-up the data structure that will manage the queries that // Dbscan needs to do. let scanner = BruteScan::new(&points); // create the clusterer: we need 3 points to consider a group a // cluster, and we're only looking at points 0.2 units apart. let min_points = 3; let epsilon = 0.2; let mut dbscan = Dbscan::new(scanner, epsilon, min_points); let mut clustered = [b' '; 76]; // run over all the clusters, writing each to the output for (i, cluster) in dbscan.by_ref().enumerate() { // since we used `BruteScan`, `cluster` is a vector of indices // into `points`, not the points themselves, so lets map back // to the points. let actual_points = cluster.iter().map(|idx| points[*idx]); write_points(&mut clustered, b'0' + i as u8, actual_points) } // now run over the noise points, i.e. points that aren't close // enough to others to be in a cluster. let noise = dbscan.noise_points(); write_points(&mut clustered, b'.', noise.iter().map(|idx| points[*idx])); // print the numbered clusters println!("{}", str::from_utf8(&clustered).unwrap());
Output:
x x x x x x x x x x x x x x x
0 0 0 . . 2 2 2 2 2 . 1 1 1 .
Methods
impl<P: RegionQuery + ListPoints> Dbscan<P> where P: Hash + Eq + Clone
fn new(points: P, eps: f64, min_points: usize) -> Dbscan<P>
Create a new DBSCAN instance, with the given eps
and
min_points
.
eps
is the maximum distance between points when creating
neighbours to construct clusters. min_points
is the minimum
of points for a cluster.
This does not perform any significant computation immediately;
clusters are found on the fly via the Iterator
instance.
fn noise_points(&self) -> &HashSet<P>
Points that have been classified as noise once the algorithm finishes.
This only makes sense to call once the iterator is exhausted, and will give unspecified nonsense if called earlier.