Comments on “Why Do We Need Data Science?”

Edd Dumbill, the general manager for the Strata Conference, recently wrote a nice post on Google+ entitled “Why Do We Need Data Science?

I wrote a couple of comments on there clarifying what I thought was novel about the emerging field of “data science”, namely, that:

Data Science is an imprecise term about an inchoate field, but it seems to me to refer to the art of creatively exploring a very rich dataset, whose sheer dimensionality and volume render them intractable with more traditional BI approaches. One has to use intuition to form hypotheses about correlations and relationships, and use those to drive a query-visualization-refine cycle on the data that may cross many dimensions and traditional data silos.

I also do concede that existing BI tools have perhaps a more solid grounding than ad-hoc MapReduce jobs:

Ironically, I think that data scientists and the modern “big data” movement have a lot to gain from the hypercube and OLAP techniques that the BI community has refined over the years. I know that Hadoop/MapReduce is the popular approach right now, but I think that it’s best suited for batch jobs at best. It’s very hard to scale batch-oriented workflows to suit the needs of interactive analysis. (Apache Hive is an effort in this direction, but architecturally it lacks the sophistication of things like Essbase or KDB+.) I think these are only the first iteration of a new generation of tools in this space, but they are good enough for a wide variety of problems people have right now.

Ultimately, though, I had the realization that Edd’s original allusion and comparison to the emergence of Linux was very insightful.

Linux’s grassroots adoption led to a commoditization of the server space, and the maturity of the LAMP stack for web servers basically made much of the modern web ecosystem possible. If everyone still had to pay a Windows NT tax for every server instance, you can bet the startup scene would look radically different. Likewise, legacy vendors of BI tools may have some good technology they’ve been honing over a few decades, but their high costs have rendered them inaccessible to a new crop of analysts and statisticians. Instead, those guys take what technologies they can find, piece them together to solve their business challenges, and in the process, they create a new set of tools that not only afford the data exploration they desire, but also commoditizes the traditional software ecosystem around data warehousing, ETL, and reporting. As that commoditization happens, a whole new ecosystem of rich analytics and data-driven business will emerge, and the level of business insight that any company, in any industry, will have 10 years from now will be simply stunning.

About these ads

One Response to Comments on “Why Do We Need Data Science?”

  1. [...] merit with applying the scientific method to business activities.   Peter Wang of Streamitive commented on Dumbill’s post as well, and has some interesting [...]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: