I’m at the Supercomputing 2011 conference and am very much looking forward to the workshops tomorrow. However, the main Workshop Agenda doesn’t list the detailed program for each workshop. It seems that the individual workshop organizers have placed detailed schedules on various places on the internet, so I have aggregated them here.
I wrote a couple of comments on there clarifying what I thought was novel about the emerging field of “data science”, namely, that:
Data Science is an imprecise term about an inchoate field, but it seems to me to refer to the art of creatively exploring a very rich dataset, whose sheer dimensionality and volume render them intractable with more traditional BI approaches. One has to use intuition to form hypotheses about correlations and relationships, and use those to drive a query-visualization-refine cycle on the data that may cross many dimensions and traditional data silos.
I also do concede that existing BI tools have perhaps a more solid grounding than ad-hoc MapReduce jobs:
Ironically, I think that data scientists and the modern “big data” movement have a lot to gain from the hypercube and OLAP techniques that the BI community has refined over the years. I know that Hadoop/MapReduce is the popular approach right now, but I think that it’s best suited for batch jobs at best. It’s very hard to scale batch-oriented workflows to suit the needs of interactive analysis. (Apache Hive is an effort in this direction, but architecturally it lacks the sophistication of things like Essbase or KDB+.) I think these are only the first iteration of a new generation of tools in this space, but they are good enough for a wide variety of problems people have right now.
Ultimately, though, I had the realization that Edd’s original allusion and comparison to the emergence of Linux was very insightful.
Linux’s grassroots adoption led to a commoditization of the server space, and the maturity of the LAMP stack for web servers basically made much of the modern web ecosystem possible. If everyone still had to pay a Windows NT tax for every server instance, you can bet the startup scene would look radically different. Likewise, legacy vendors of BI tools may have some good technology they’ve been honing over a few decades, but their high costs have rendered them inaccessible to a new crop of analysts and statisticians. Instead, those guys take what technologies they can find, piece them together to solve their business challenges, and in the process, they create a new set of tools that not only afford the data exploration they desire, but also commoditizes the traditional software ecosystem around data warehousing, ETL, and reporting. As that commoditization happens, a whole new ecosystem of rich analytics and data-driven business will emerge, and the level of business insight that any company, in any industry, will have 10 years from now will be simply stunning.