CLUSTERING DATA IN PYTHON
[code] Python (2.6), GPL

Previously, I demonstrated clustering approximate geographic points from OpenPaths to identify places of interest. Heres the code I used to accomplish that, which can be applied to any dataset, it doesnt have to be limited to two dimensions. The algorithm is known as agglomerative hierarchical clustering because it works by repeatedly grouping the closest two nodes together, starting with each data point as a node, and ending with the entire set in a single monster node. (“Closest” is defined by any distance metric; for many applications, it will be euclidean distance, but for geographic data, Im using the haversine distance formula.) Along the way, youve constructed a binary tree which represents the hierarchical relationships between the vectors in the set. The result is a kind of ad-hoc taxonomy, and is frequently used to hypothesize relatedness between images, documents, proteins, users, etc. (heres a nice diagram courtesy Razvan Musaloiu-E.)

To use agglomerative clustering as a classifier, however, what we really want is just a flat set of clusters. It’s akin to choosing the branches of the tree that best represent the natural divisions in the data. Cut too close to the trunk and the clusters will be too general; cut by the leaves and youve got too much noise. With geographic information from OpenPaths, at least, we have a heuristic to decide what is appropriate — the platform is only going to be accurate to a quarter-mile or so, so we can look for the largest branches that do not exceed that limit. The code provides a method, get_pruned, that takes such a parameter and returns a resulting set of clusters.

Clustering packages for python are out there, but I prefer touching the code and this is a dirt simple implementation that nonetheless should prove effective for most creative coding purposes.

→ 2012-01-10