CLUSTERING DATA IN PYTHON
[code] Python (2.6), GPL
Previously, I demonstrated clustering approximate geographic points from OpenPaths to identify places of interest. Heres the code I used to accomplish that, which can be applied to any dataset, it doesnt have to be limited to two dimensions. The algorithm is known as agglomerative hierarchical clustering because it works by repeatedly grouping the closest two nodes together, starting with each data point as a node, and ending with the entire set in a single monster node. (“Closest” is defined by any distance metric; for many applications, it will be euclidean distance, but for geographic data, Im using the haversine distance formula.) Along the way, youve constructed a binary tree which represents the hierarchical relationships between the vectors in the set. The result is a kind of ad-hoc taxonomy, and is frequently used to hypothesize relatedness between images, documents, proteins, users, etc. (heres a nice diagram courtesy Razvan Musaloiu-E.)
To use agglomerative clustering as a classifier, however, what we really want is just a flat set of clusters. It’s akin to choosing the branches of the tree that best represent the natural divisions in the data. Cut too close to the trunk and the clusters will be too general; cut by the leaves and youve got too much noise. With geographic information from OpenPaths, at least, we have a heuristic to decide what is appropriate — the platform is only going to be accurate to a quarter-mile or so, so we can look for the largest branches that do not exceed that limit. The code provides a method, get_pruned, that takes such a parameter and returns a resulting set of clusters.
Clustering packages for python are out there, but I prefer touching the code and this is a dirt simple implementation that nonetheless should prove effective for most creative coding purposes.
NYC VERTICES:
In several recent posts, I’ve talked about experiments with personal geographic data collected via OpenPaths. In those examples, location is treated in absolute terms, latitude and longitude.
However, I am working toward something here. Most of my past work has been concerned with the relative qualities of place, the psychogeography that isn’t necessarily keyable to coordinates (see our article in Urban Omnibus). Presently, I’m developing some analytics to try and bridge that gap.
The first order of business is to begin thinking about location in terms of place. Place is a concept that is relative to the context of the individual — but using geodata we can at least identify significant patterns that suggest loci of activity.
Starting from my path data, I used a clustering algorithm (I’ll post the code next) to construct a network graph of my arrivals and departures around the city. What you see here are the locations of all of my “significant” places around the city, over the last 6 months or so. NYT Labs is the big green point up top — home is purple toward the bottom. The lines show the strength of the connections between them (eg, from home I’m most likely to go to the lab).
The conceptual shift here, and I think it’s an important one, is to begin to treat location as behavior. More to come.
(drawn with python)
One thing I wanted to try with OpenPaths data is to reconstruct the paths I have taken. I’m using the as yet unreleased iPhone app, which records a point every time there is a “significant” location change, as determined by the iOS API (this runs in the background without much battery drain — the reason we aren’t using continuous GPS is that such an app would quickly burn through it).
This image shows all of my points around the city in a given time period (without a base map, to preserve some privacy, and it’s kind of more interesting that way). The lines are determined by grouping series of points that are within 10 minutes of each other and inferring the start point.
Additionally, I estimated the maximum speed of each path by looking at the fastest few segments. By clustering the speeds, using the k-means algorithm, the paths are classified by mode of transportation. Red is car or Amtrak, purple is bike (my primary mode), and green is walking.
The arc of my frequent rides between Brooklyn and midtown are pretty clear, and in general you get a sense of the areas that I habitually cover with the different modes. Note Prospect Park at the bottom. One thing that’s a significant omission is the bulk of my subway travel — any ideas on how I could infer this would be appreciated. And I certainly walk around more than is reflected here, but the location changes aren’t big or sustained enough to register.
A month ago I was hiking in the Fossil Ridge Wilderness Area. I wore several sensors over the four days — unfortunately, I lost the Fitbit and the Q-Sensor had a clock error which corrupted its data. Disappointing, but I did have two Garmin devices that held up, recording position and heartrate.
I finally managed to create a map from the data. I suppose I could have done this more efficiently with Google Earth, but I decided to parse the GPX files and draw my own paths with python, overlaying the result on a Google terrain map, which looks a lot nicer. The brighter red the path, the higher my heartrate — however, Im not sure that this ended up revealing anything particularly compelling. It is just, a la Certeau, a curious relic that substitutes for a rich experience.
Regardless, making it was a good exercise. Two complementary pieces of code were essential when doing this kind of thing outside of a mapping platform. First, when finding the geographic distance between two latitude/longitude pairs, Euclidean distance doesnt work, as we’re operating on an elliptical sphere. For that, there is the haversine formula:
”“” Convert the distance between two points, specified (lon, lat), \
to miles (or kilometers)
”“”
LON, LAT = 0, 1
pt0 = math.radians(pt0[LON]), math.radians(pt0[LAT])
pt1 = math.radians(pt1[LON]), math.radians(pt1[LAT])
lon_delta = pt1[LON] - pt0[LON]
lat_delta = pt1[LAT] - pt0[LAT]
a = math.sin(lat_delta / 2)**2 + math.cos(pt0[LAT]) * math.cos(pt1[LAT]) * math.sin(lon_delta / 2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = 6371 * c # radius of Earth in km
if miles:
d *= 0.621371192
return d
Secondly, in order to plot data on a two-dimensional plane and have it line up with an image from Google, I needed to use the formula for the Mercator projection. This one is compliments of OpenStreetMap. It gives you coordinates relative to the globe, which have to be further scaled.
”“” Project a (lon, lat) point to x,y space using
the Mercator projection
http://wiki.openstreetmap.org/wiki/Mercator#Python_Implementation
”“”
def merc_x(lon):
r_major = 6378137.000
return r_major * math.radians(lon)
def merc_y(lat):
if lat > 89.5:
lat = 89.5
if lat < -89.5:
lat = -89.5
r_major = 6378137.000
r_minor = 6356752.3142
temp = r_minor / r_major
eccent = math.sqrt(1 - temp**2)
phi = math.radians(lat)
sinphi = math.sin(phi)
con = eccent * sinphi
com = eccent / 2
con = ((1.0 - con) / (1.0 + con))**com
ts = math.tan((math.pi / 2 - phi) / 2) / con
y = 0-r_major * math.log(ts)
return y
return merc_x(pt[0]), merc_y(pt[1])







