December 1, 2011

Do you find yourself doing this all the time?

$ export PYTHONPATH=$PYTHONPATH:/path/to/some/package

If so, then save yourself some keystrokes! Just put the following into your .bash_profile:

function pypath {
if [ "$1" = "" ]; then
export PYTHONPATH=$PYTHONPATH:`pushd -n $1; pwd -P; popd -n`

Now just type “pypath” inside any directory, and it gets appended to your PYTHONPATH. If you type “pypath /some/other/dir”, then that directory will get appended to your PYTHONPATH. Easy!

SC’11 Workshop Agendas

November 13, 2011

I’m at the Supercomputing 2011 conference and am very much looking forward to the workshops tomorrow. However, the main Workshop Agenda doesn’t list the detailed program for each workshop. It seems that the individual workshop organizers have placed detailed schedules on various places on the internet, so I have aggregated them here.

Comments on “Why Do We Need Data Science?”

November 11, 2011

Edd Dumbill, the general manager for the Strata Conference, recently wrote a nice post on Google+ entitled “Why Do We Need Data Science?

I wrote a couple of comments on there clarifying what I thought was novel about the emerging field of “data science”, namely, that:

Data Science is an imprecise term about an inchoate field, but it seems to me to refer to the art of creatively exploring a very rich dataset, whose sheer dimensionality and volume render them intractable with more traditional BI approaches. One has to use intuition to form hypotheses about correlations and relationships, and use those to drive a query-visualization-refine cycle on the data that may cross many dimensions and traditional data silos.

I also do concede that existing BI tools have perhaps a more solid grounding than ad-hoc MapReduce jobs:

Ironically, I think that data scientists and the modern “big data” movement have a lot to gain from the hypercube and OLAP techniques that the BI community has refined over the years. I know that Hadoop/MapReduce is the popular approach right now, but I think that it’s best suited for batch jobs at best. It’s very hard to scale batch-oriented workflows to suit the needs of interactive analysis. (Apache Hive is an effort in this direction, but architecturally it lacks the sophistication of things like Essbase or KDB+.) I think these are only the first iteration of a new generation of tools in this space, but they are good enough for a wide variety of problems people have right now.

Ultimately, though, I had the realization that Edd’s original allusion and comparison to the emergence of Linux was very insightful.

Linux’s grassroots adoption led to a commoditization of the server space, and the maturity of the LAMP stack for web servers basically made much of the modern web ecosystem possible. If everyone still had to pay a Windows NT tax for every server instance, you can bet the startup scene would look radically different. Likewise, legacy vendors of BI tools may have some good technology they’ve been honing over a few decades, but their high costs have rendered them inaccessible to a new crop of analysts and statisticians. Instead, those guys take what technologies they can find, piece them together to solve their business challenges, and in the process, they create a new set of tools that not only afford the data exploration they desire, but also commoditizes the traditional software ecosystem around data warehousing, ETL, and reporting. As that commoditization happens, a whole new ecosystem of rich analytics and data-driven business will emerge, and the level of business insight that any company, in any industry, will have 10 years from now will be simply stunning.

More Thoughts On Arrays in PyPy

October 19, 2011

My previous blog post was trying to diplomatically point out that Numpy is a whole lot more than merely a fast array library. Somehow the more subtle point seems to be lost on a number of people, so this follow-up post is going to be more blunt. (But not mean or angry – I was feeling rather angry when I first started this blog entry, but then I realized it was because I was hungry, so I had a Snickers bar and re-wrote it.)

I think people should re-read Travis’s blog post on porting Numpy to PyPy, and really read it carefully. Travis not only created Numpy and helped create Scipy, but in his professional life he has been invited to labs and universities all over the world to speak about the use of Python in scientific computing. He has seen, first-hand, what works and what doesn’t when trying to push Python into this space, so even though he makes many high-level points in his blog, he has been in the trenches enough to have a very good sense of what users are looking for. He does not use his words lightly, and he very much means what he says.

In looking over the discussion on this subject, I realized that Travis has basically already said all that needs to be said. However, certain things bear highlighting.

A Low-level API Is A Feature

Travis: “Most of the scientists and engineers who have come to Python over the past years have done so because it is so easy to integrate their legacy C/C++ and Fortran code into Python.

Most of these [Scipy,Matplotlib,Sage] rely on not just the Python C-API but also the NumPy C-API which you would have to have a story for to make a serious technical user of Python get excited about a NumPy port to PyPy.” (emphasis mine)

The bottom line: There is no such thing as “Numpy on PyPy” without a low-level API for extensions.

Call it “Array Intrinsics for PyPy” if you will. Call it “Fast Arrays for PyPy”. But to call it “Numpy Support for PyPy” without offering the integration ability that Travis alludes to, and without actually using any of Numpy’s code (for FFTs, linear algebra, etc.) is kind of false advertising. (David Cournapeau tries to gently make this point on the mailing list by saying that “calling it numpy is a bit confusing”.)

As I pointed out in a previous post, there are dynamic compilers and JITters for Numpy right now; however, none of them call themselves “PyPy for Numpy”, because there is a huge portion of the PyPy feature set they don’t support. Just because they do some JITting, they are not PyPy. Likewise, writing an array object with operator overloading does not mean you have Numpy.

Travis: “It’s not to say that there isn’t some value in re-writing NumPy in PyPy, it just shouldn’t be over-sold and those who fund it should understand what they aren’t getting in the transaction.” (emphasis mine)

On the pypy-dev mailing list, Jacob Hallen said: “We did a survey last spring, in which an overwhelming number of people asked for numpy support.” I am curious about the exact wording of this survey. By “Numpy support”, did respondents mean “a fast array library, with no hope of integration with Scipy, Scikits, Matplotlib, etc.”? If so, then the PyPy team may very well be able to meet their demand. However, I will wager that when most people ask for “Numpy support”, they also imply some level of compatibility or at least a reasonable timeline for integration with the rest of the Scipy stack. It might be worth re-polling those folks who wanted “Numpy support” and ask them how useful a fast array library in PyPy would be, if it could not be integrated with Scipy, Matplotlib, etc., and if there was no way to integrate their own extension modules.

In my experience, there is virtually no one who uses Numpy that does not also (directly or indirectly) use something that consumes its C-API. Most folks get at that API via Swig, Cython, Weave, f2py, etc., but nonetheless they are very much relying on it.

The Joy of Glue

On the PyPy’s donation page, there is the claim:

“bringing NumPy into the equation is a reasonable next step – as it’s a very convenient and popular tool for doing this kind of work. The resulting implementation should move Python in scientific world from being a merely glue language into being the main implementation language for a lot of people in the scientific/numeric worlds.”

This statement encapsulates most of the philosophical differences between the PyPy guys and the Scipy guys in this discussion. I’m pretty sure that while almost everyone in the Scipy world would rather be coding Python than wading through mounds of legacy C or FORTRAN, most of them have been in the trenches long enough to know that reimplementing libraries is simply not possible in most real-world cases. Even if they could make all new development happen in RPython, they would still need to interface with those legacy libraries. If the PyPy devs want to field their Python implementation as a serious contender in the scientific computing space, they absolutely have to have a way to use it as a “glue” language.

Furthermore, the use of the word “merely” in “merely glue language” really highlights the philosophic difference. I’m sure that for language purists who enjoy geeking out on compilers and VMs and Lambda-the-Ultimate, something as mundane as a low-level VM API for an external language as unsexy as FORTRAN deserves the sneer of “merely”. But for many tens of thousands of scientists and researchers around the world who rely on Python to get their jobs done, the ability to glue together disparate tools, frameworks, languages, and data is utterly priceless. I would even argue that this is one of its essential features; without it, to many people, Python would just be a slightly prettier MATLAB. (Actually, lacking an infix matrix multiplication operator, some would argue that it’s not even that much prettier.)

A High-Level Future

In truth, I am personally torn by this brouhaha with PyPy, because I actually agree with their ultimate goals. I, too, envision a better world in which people are writing high-level languages to achieve better performance than what is even possible at the C level. However, I have seen enough of the problem space to know that (1) any solution that doesn’t integrate with legacy code will be Dead On Arrival, and (2) JIT optimizations of the normal PyPy variety (i.e. approaches that work well for Algol-derived languages) have limited efficacy in the space of serious scientific and high-performance computing. As Travis says in his blog post:

“I am also a true-believer in the ability for high-level languages to achieve faster-than-C speeds. In fact, I’m not satisfied with a Python JIT. I want the NumPy constructs such as vectorization, fancy indexing, and reduction to be JIT compiled.”

It is very important to note that Travis says “JIT compiled” and not “JIT optimized“. In a subsequent post, I will discuss why JIT optimization is not the end-all and be-all for scientific computing, and what could be even better.

How Can PyPy and Numpy Work Together?

It occurs to me that it might help illuminate the discussion if we looked at a concrete example of an alternative approach to collaboration between Numpy and PyPy. A few years ago, Ilan Schnell implemented a very interesting “fast_vectorize” decorator for Numpy ufuncs, which used PyPy’s translator to dynamically generate a natively-compiled version of ufuncs written in Python. It only took him about a week, and I believe most of it was isolating out the RPython-to-C translation functionality.

Furthermore, at this year’s Scipy conference alone, I counted at least three projects that were doing AST walking and code munging to attempt to convert high-level Python code into kernels for fast evaluation, and not merely for Numpy. These included differential equation solvers and the like. Supporting efforts for dynamic compilation of Python would be massively useful for the Scipy community at large, and there is almost no one better poised to do this than the PyPy devs. I think that would be a far, far more useful application of their efforts.


I understand that speed has become PyPy’s raison d’etre. I also understand that they’ve had wonderfully encouraging results with optimizing vanilla procedural code, like what is found in the Python standard library. But I think it is unwise announce a big, loud foray into scientific computing while ignoring the wisdom of their fellow Python coders who have been toiling in this space for over a decade.

If the goal is to create a nifty little array module for users who are happy living within the walls of the PyPy interpreter, then I wish them luck and think they will do very well. But if their goal is have an impact on Python’s role in scientific computing at large, on the scale of what Numpy and Scipy have achieved, then they cannot ignore the integration question, and, IMHO, Cython currently looks to be the most promising avenue for that.

Numpy isn’t just about fast arrays

October 17, 2011

I’ve been following the rapidly heating discussion about the efforts to develop array intrinsics in PyPy (aka “Numpy on PyPy”), and realized that the two groups of very smart people – Scipy/Numpy folks on the one hand, and the PyPy devs on the other – are actually not in disagreement, but really just talking past one another.

The core problem is that they have very different concepts of what Numpy actually is. From the PyPy perspective, it appears to be primarily about doing fast array computation from a high-level language like Python. This is why they talk about translating RPython to C and making optimized ufuncs and automatically building SSE-optimized versions of fused loops. This is also why they don’t understand what the big deal is with talking to C/FORTRAN extensions – they can emit code for faster array processing with their compiler, and who wouldn’t want that?

I think I understand the perspective on the PyPy side. The low-hanging fruit of basic array computation in Numpy can definitely be improved, and it is a very tempting target. Lots of people feel this way – witness the auto-parallelism and loop fusion work done by the Numexpr and Theano guys, or the stream/generator fusion in my project Metagraph, or even the SSE optimization in CoreFunc.

For the Scipy guys, however, Numpy is not about fast arrays, although that is very nice. It is about *data*. Numpy is the lingua franca for data access and transmission between a widely disparate set of numerical and scientific libraries in Python (save for the notable exception of PIL). I think it would be fair to say that for most Scipy folks, the data connectivity is every bit as important – if not more so – than the absolute speed of the array computations. It means that instead of interfacing N^2 different libraries to each other, any external library only has to consume and expose a single interface.

This is why the Scipy folks keep harping about Cython – it’s rapidly becoming (or has already become) the lingua franca of exposing legacy libraries to Python. Their user base has tons of legacy code or external libraries that they need to interface, and most of the reason Python has had such a great adoption curve in that space is because Numpy has made the data portion of that interface easy. Cython makes the code portion quite painless, as well.

Now, those users don’t particularly care whether the tools they use to build interfaces target the CPython API, which is why the Cython approach is such a great one. It could serve as a common meeting point between folks that have to wrap extension modules, and folks that have to build fast (and safe) VMs.

Furthermore, Travis Oliphant has repeatedly expressed his desire for Numpy itself to move to a Cython implementation, and to mainline the efforts around making the core multidimensional array bits a pure C library that other projects (and languages) can use. That code is not merged in, but the lion’s share of the work has already been done.

Perhaps an analogy would be a locomotive company that decides to apply their engineering expertise to making a car with the same excellent fuel efficiency and towing capacity of a locomotive, but with the proviso that it can only run on railroad tracks. In contrast, the Numpy Jeep might not tow as much, but it can go through forests and over fields and into town, talking to any data from any library, on just about any platform (including supercomputers). That’s actually its most useful characteristic. IMHO Travis’s innovation was the dtype, not overloading __getitem__.

I think if the PyPy folks really want to do something interesting in this space, they should worry less about making their railcar run faster, and more on how to retrofit it with caterpillar tracks, so that it can cover all manner of terrain. Better yet, the Cython guys have been building tracked vehicles for Sage for years, and all of the Scipy world is moving in that direction anyway. We can all be a big happy family!

Update 10/19: Edited title to clarify that while performance is important to Numpy, it is far from being its only feature.  What the comments and Twitter feedback suggest is that some people have misunderstood.  TL;DR: The Numpy C-API is as much a feature as its fast array handling, and the PyPy folks seem to be treating it as an implementation detail, or a “nice to have”.

Coherent Abstractions

October 17, 2011

The key question to ask about a piece of software is, “what abstractions are supported by the deployment runtime environment?”

In general, there is an inverse relationship between the complexity of a program and its correctness.  Programmers of higher quality and skill are able to push this 1/x curve a bit higher along the correctness axis, but they are fundamentally limited by complexity as well.  The only way to reach a different zone of correctness is to step up to a higher level of abstraction.

But higher levels of abstraction are useless unless they form a coherent system.  This is because complexity is a function of the logical coherency of the abstractions you are dealing with.  In general, something is “complicated” because (1) you have to keep lots of concepts in your head at once in order to understand its function, or (2) you have to understand its underlying construction (down the stack of abstraction), or (3) both.

Programs are sets of abstractions built on top of the abstractions offered by the language, which usually are there to offer a coherent runtime to the programs.  But all programs of reasonable complexity will develop their own set of abstractions that have rules about proper interactions, contracts that need to be maintained, etc.  The approach of domain-specific languages or the “little languages” philosophy is that these abstractions and rules and contracts need to be bundled up into a coherent runtime of its own, on top of which the actual program is written.

But language design is very hard.  This is because of the relationship between expressiveness and complexity: the more expressive a language is, the more complex the language primitives are.  A good language designer is able to discover the minimum set of primitives needed to maximize expressiveness, but these craftsmen are rare.

I think part of the reason that statically-typed, heavily object-oriented languages like Java and C++ get a bad rap is because they force all software developers to extend the type system via creation of classes, and thereby become amateur language designers.  The problem is that not very many people are good at distilling an architecture down to a core, coherent set of abstractions, and furthermore, this is not always the most appropriate approach for a great number of actual, production programming challenges.  The next time you are wading through a giant codebase of FactoryManagers and StreamProxies and QueueHandlers, realize that the core shortcoming is one of language: the programmers were mostly likely taught to model objects (hence the proliferation of nouns ending in -er) but not to understand the verbs (i.e. operators).  Without verbs, you have a very poor syntax indeed.

So the question is: if we re-center the task of software development around this notion of building coherent sets of abstractions at higher and higher levels, until the application is easily expressible, then what does the basic project lifecycle look like?  What are the skills that a software developer needs?  What are the software processes? What does the testing framework look like?

Strata 2011 Review – Day 3

February 15, 2011

Day 3 (the second day of talks) kicked off with an hour of short plenary talks, just like the previous day. The two I found most interesting were DJ Patil’s talk about “Innovating Data Teams” at LinkedIn, and Carol McCall’s talk entitled “Can Big Data Fix Healthcare?“. DJ Patil’s talk centered around the idea that data scientists need to be able to “ship” products, so at LinkedIn they made the data sciences a top-level product team. This is an interesting approach, and I think one that will be more common as businesses realize that “data exhaust” – the byproduct of their traditional business – is actually a sensory feed. One interesting slide that DJ presented was a simple diagram of “Where did all the people from the failed banks go?“. As he points out, the list of companies absent from the diagram is as interesting as the companies that are named there.

Carol’s talk was actually rather inspiring for me. She presented some examples of how good data analysis could transform the health care industry, from reducing the number of adverse drug reactions (typically in seniors, who are taking lots of medicines) to lowering the overall cost of care. It’s really an area that is ripe for innovation, except that everyone is making so much money under the status quo. The call seems to be to deliver interesting insights from data mining that will provide real insight and eye-opening revelations about where the industry is doing things poorly, and maybe provide guidance for improvement.

I then attended Pablo Castro’s talk on the Open Data Protocol, a Microsoft-sponsored project for making various web services speak the same language and also provide a consistent querying API that can be accessed just through the URI. His demo page, What Can I Watch?, consists of a simple bit of Javascript that dynamically populates the body of the page with the JSON results of accessing the Netflix OpenData service. It all looks very neat, but it looks like a Python implementation of an OData producer is still missing.

The next two talks I attended were about government data. Virginia Carlson, from the Metro Chicago Information Center (MCIC), talked about the fact that although governments do have a lot of data, it’s less than what people would like, and it tends to be scattered in a variety of places and tends to be of a rather heterogenous form. Most importantly, one should not believe that there is a good correlation between operational, administrative, and statistical data; in fact sometimes these numbers can be contradictory. Virginia has been very active in making public data available and useful to a wider audience, and it was really interesting to hear her perspective as an “insider” from the governmental side. Many of those who approach “Gov 2.0” from a technology angle either forget or willfully ignore the fact that there are legacy systems and bureaucracies that assemble and produce this data, and dealing with them is a nontrivial challenge.

Jon Bruner from Forbes gave a fun talk about how the Forbes data team processes the FEC political contributions database. He focused on a particular challenge they faced, namely, cleansing and normalizing the identities of contributors when a single person could report their identity in a number of ways. For instance, Bill Gates can identify himself as “Bill Gates”, “William Gates III”, etc. For his address, he might list any of several residences. Lastly, for his occupation, he can report himself as being “self-employed”, “retired”, “CEO & Chairman of Microsoft”, “Chairman of Bill & Melinda Gates Foundation”, etc. There is no regular structure to this stuff, and so a single person like him could show up as a couple dozen separate rows in the database.

The Forbes team tackled this problem by creating a “data-cleaning wizard” which displayed likely matches to researchers and gave them the ability to manually indicate that a certain number of those matches were all the same person. Bruner reports that this worked well and gave them the cleaned data they needed to faithfully add political contribution data to this year’s Forbes 400 feature. However, I am surprised that a more automated approach was not used, since it seems like basic clustering would do a reasonable job (assuming you split up first and last names, and used common delimiters like “/” and such to separate occupation title from company name).

Jon blogged about his scripts for downloading and playing with the FEC data, and they were a somewhat helpful guide as I combed through the FEC data site at ftp.fec.gov/FEC. (If you want to play with this data yourself, I highly advise reading the FEC’s documentation about the data files.)

The last formal talk I attended was J. J. Toothman’s “Data as Art” talk. It was somewhat difficult to decipher the point of the talk from the conference description, but I felt it would be interesting nonetheless. As it turns out, Toothman was really just showcasing some of the more interesting data-driven visuals he had seen. Even he admitted that some of these didn’t really qualify as “infographics”, but needed a different term since they are more about evoking an emotional response in the viewer than conveying a specific data-related message. (I shouted out “emographics” as a suggestion.)

While I found several of his examples quite interesting, I really am torn about this entire data-driven/algorithmic art phenomenon. On the one hand, when it works well, it’s marvelous. On the other hand, when its creator has to expend a whole lot of verbiage to explain the significance of the data or the algorithms that transform it, then it feels a bit like they’re trying too hard to be hip.

If the transformation of data is fairly linear, and the user/artist has reasonably fine-grained control over the output, then it’s a computational paintbrush, and perhaps a description of the artist’s method and motivations would be in order. But if the algorithm is highly nonlinear or randomized, and the person providing the data input cannot accurately control the output, then it’s just algorithmic art and very little exposition should be necessary. I suppose that I feel like the art should pretty much stand on its own; the effort of the data artist should be embodied in the algorithm and resultant visuals.

This point actually generalizes to infographics and charting. The point of a chart or data graphic is to provide more insight, not to provide window dressing. If an infographic needs a lot of exposition about what various symbols, colors, shapes, and positions mean, then it has failed.


Get every new post delivered to your Inbox.