It was a grand plan. I was going to learn all sorts of new things: Pentaho, R, Processing. All sorts of new things. In the end, I got caught up in getting something done and learned none of those things. Again.
Using trusty old Python with the beautiful NetworkX module and the shockingly fast–if a bit rough around the edges–graph visualization tool Gephi, I pulled Crunchbase data to create a social network map of how venture investors coinvest. You can skip straight on down if you want, playing with it is more fun than reading about it (and more fun than learning Pentaho, it turns out.)
I can’t believe Crunchbase didn’t rate limit me* but most of the data is from their excellent database. I augmented it with info individuals have made available on AngelList. I didn’t include any non-public info, even though I know of several excellent angels who didn’t make the map because they’ve kept their activities under the radar.
I then had way too many nodes to make any visualization make any sense. So I did two things: any person mentioned as an investor who was also a venture firm employee was folded into the firm. I also made some fixes I knew of (merging my friend Roger Ehrenberg’s IA Capital into IA Ventures, for instance.) I know some venture partners invest as angels outside their firms, but since this is a map of social connections, I think the step not only makes sense, but weights individuals more accurately. (Roger Ehrenberg, for instance, would not get the weight his activities deserve if his investing activity was split among three entities.)
Then, again to make it manageable, I took out any investors with fewer than five investments. Ran it through Fruchterman-Reingold. Colored venture firms red, people green and others (corporates, incubators) blue. Made node size proportional to number of investments.
The result is below, in Zoom.it. Some things that stand out:
– The network is incredibly connected. If you go into the “core”, where the Sand Hill Road firms are, there are so many edges, they are indistinguishable. Generally, in this visualization, the drawn edges are more or less decorative, because there are too many to have them make sense.
– Because of the dense interconnectivity, there are not many noticeable subnetworks, from 50,000 feet. Here’s a map key, such as it is, showing some areas that are distinguishable. The separation between biotech and the core is no more noticeable, to my eye, than that between web 2.0 and the core. I do find that the further I get from my own node, the less I know about the investors.
I should note the usual caveats. Crunchbase data is not a complete record of investment activity, in fact it tends to be severely self-selecting. I assume both non-US and non-Internet-tech are underrepresented. I know non-VC investment is underrepresented. Also, my few fixes are not all-encompassing. This was a project I had time for because of a couple of long train rides. I do have the raw dataset (both gephi, graphml and pickled networkx graphs) for the entire network. If you want them, let me know.
Drag and zoom. Find your friends.
* Or maybe because I was hitting their API while on the Acela, they figured it couldn’t possibly be programmatic. In any case, to my fellow train passengers, I apologize for hogging the bandwidth.