Day 3 (the second day of talks) kicked off with an hour of short plenary talks, just like the previous day. The two I found most interesting were DJ Patil’s talk about “Innovating Data Teams” at LinkedIn, and Carol McCall’s talk entitled “Can Big Data Fix Healthcare?“. DJ Patil’s talk centered around the idea that data scientists need to be able to “ship” products, so at LinkedIn they made the data sciences a top-level product team. This is an interesting approach, and I think one that will be more common as businesses realize that “data exhaust” – the byproduct of their traditional business – is actually a sensory feed. One interesting slide that DJ presented was a simple diagram of “Where did all the people from the failed banks go?“. As he points out, the list of companies absent from the diagram is as interesting as the companies that are named there.
Carol’s talk was actually rather inspiring for me. She presented some examples of how good data analysis could transform the health care industry, from reducing the number of adverse drug reactions (typically in seniors, who are taking lots of medicines) to lowering the overall cost of care. It’s really an area that is ripe for innovation, except that everyone is making so much money under the status quo. The call seems to be to deliver interesting insights from data mining that will provide real insight and eye-opening revelations about where the industry is doing things poorly, and maybe provide guidance for improvement.
The next two talks I attended were about government data. Virginia Carlson, from the Metro Chicago Information Center (MCIC), talked about the fact that although governments do have a lot of data, it’s less than what people would like, and it tends to be scattered in a variety of places and tends to be of a rather heterogenous form. Most importantly, one should not believe that there is a good correlation between operational, administrative, and statistical data; in fact sometimes these numbers can be contradictory. Virginia has been very active in making public data available and useful to a wider audience, and it was really interesting to hear her perspective as an “insider” from the governmental side. Many of those who approach “Gov 2.0″ from a technology angle either forget or willfully ignore the fact that there are legacy systems and bureaucracies that assemble and produce this data, and dealing with them is a nontrivial challenge.
Jon Bruner from Forbes gave a fun talk about how the Forbes data team processes the FEC political contributions database. He focused on a particular challenge they faced, namely, cleansing and normalizing the identities of contributors when a single person could report their identity in a number of ways. For instance, Bill Gates can identify himself as “Bill Gates”, “William Gates III”, etc. For his address, he might list any of several residences. Lastly, for his occupation, he can report himself as being “self-employed”, “retired”, “CEO & Chairman of Microsoft”, “Chairman of Bill & Melinda Gates Foundation”, etc. There is no regular structure to this stuff, and so a single person like him could show up as a couple dozen separate rows in the database.
The Forbes team tackled this problem by creating a “data-cleaning wizard” which displayed likely matches to researchers and gave them the ability to manually indicate that a certain number of those matches were all the same person. Bruner reports that this worked well and gave them the cleaned data they needed to faithfully add political contribution data to this year’s Forbes 400 feature. However, I am surprised that a more automated approach was not used, since it seems like basic clustering would do a reasonable job (assuming you split up first and last names, and used common delimiters like “/” and such to separate occupation title from company name).
Jon blogged about his scripts for downloading and playing with the FEC data, and they were a somewhat helpful guide as I combed through the FEC data site at ftp.fec.gov/FEC. (If you want to play with this data yourself, I highly advise reading the FEC’s documentation about the data files.)
The last formal talk I attended was J. J. Toothman’s “Data as Art” talk. It was somewhat difficult to decipher the point of the talk from the conference description, but I felt it would be interesting nonetheless. As it turns out, Toothman was really just showcasing some of the more interesting data-driven visuals he had seen. Even he admitted that some of these didn’t really qualify as “infographics”, but needed a different term since they are more about evoking an emotional response in the viewer than conveying a specific data-related message. (I shouted out “emographics” as a suggestion.)
While I found several of his examples quite interesting, I really am torn about this entire data-driven/algorithmic art phenomenon. On the one hand, when it works well, it’s marvelous. On the other hand, when its creator has to expend a whole lot of verbiage to explain the significance of the data or the algorithms that transform it, then it feels a bit like they’re trying too hard to be hip.
If the transformation of data is fairly linear, and the user/artist has reasonably fine-grained control over the output, then it’s a computational paintbrush, and perhaps a description of the artist’s method and motivations would be in order. But if the algorithm is highly nonlinear or randomized, and the person providing the data input cannot accurately control the output, then it’s just algorithmic art and very little exposition should be necessary. I suppose that I feel like the art should pretty much stand on its own; the effort of the data artist should be embodied in the algorithm and resultant visuals.
This point actually generalizes to infographics and charting. The point of a chart or data graphic is to provide more insight, not to provide window dressing. If an infographic needs a lot of exposition about what various symbols, colors, shapes, and positions mean, then it has failed.