CLUSTERING DATA IN PYTHON
[code] Python (2.6), GPL
Previously, I demonstrated clustering approximate geographic points from OpenPaths to identify places of interest. Heres the code I used to accomplish that, which can be applied to any dataset, it doesnt have to be limited to two dimensions. The algorithm is known as agglomerative hierarchical clustering because it works by repeatedly grouping the closest two nodes together, starting with each data point as a node, and ending with the entire set in a single monster node. (“Closest” is defined by any distance metric; for many applications, it will be euclidean distance, but for geographic data, Im using the haversine distance formula.) Along the way, youve constructed a binary tree which represents the hierarchical relationships between the vectors in the set. The result is a kind of ad-hoc taxonomy, and is frequently used to hypothesize relatedness between images, documents, proteins, users, etc. (heres a nice diagram courtesy Razvan Musaloiu-E.)
To use agglomerative clustering as a classifier, however, what we really want is just a flat set of clusters. It’s akin to choosing the branches of the tree that best represent the natural divisions in the data. Cut too close to the trunk and the clusters will be too general; cut by the leaves and youve got too much noise. With geographic information from OpenPaths, at least, we have a heuristic to decide what is appropriate — the platform is only going to be accurate to a quarter-mile or so, so we can look for the largest branches that do not exceed that limit. The code provides a method, get_pruned, that takes such a parameter and returns a resulting set of clusters.
Clustering packages for python are out there, but I prefer touching the code and this is a dirt simple implementation that nonetheless should prove effective for most creative coding purposes.
NYC VERTICES:
In several recent posts, I’ve talked about experiments with personal geographic data collected via OpenPaths. In those examples, location is treated in absolute terms, latitude and longitude.
However, I am working toward something here. Most of my past work has been concerned with the relative qualities of place, the psychogeography that isn’t necessarily keyable to coordinates (see our article in Urban Omnibus). Presently, I’m developing some analytics to try and bridge that gap.
The first order of business is to begin thinking about location in terms of place. Place is a concept that is relative to the context of the individual — but using geodata we can at least identify significant patterns that suggest loci of activity.
Starting from my path data, I used a clustering algorithm (I’ll post the code next) to construct a network graph of my arrivals and departures around the city. What you see here are the locations of all of my “significant” places around the city, over the last 6 months or so. NYT Labs is the big green point up top — home is purple toward the bottom. The lines show the strength of the connections between them (eg, from home I’m most likely to go to the lab).
The conceptual shift here, and I think it’s an important one, is to begin to treat location as behavior. More to come.
(drawn with python)





