Numpy isn’t just about fast arrays

I’ve been following the rapidly heating discussion about the efforts to develop array intrinsics in PyPy (aka “Numpy on PyPy”), and realized that the two groups of very smart people – Scipy/Numpy folks on the one hand, and the PyPy devs on the other – are actually not in disagreement, but really just talking past one another.

The core problem is that they have very different concepts of what Numpy actually is. From the PyPy perspective, it appears to be primarily about doing fast array computation from a high-level language like Python. This is why they talk about translating RPython to C and making optimized ufuncs and automatically building SSE-optimized versions of fused loops. This is also why they don’t understand what the big deal is with talking to C/FORTRAN extensions – they can emit code for faster array processing with their compiler, and who wouldn’t want that?

I think I understand the perspective on the PyPy side. The low-hanging fruit of basic array computation in Numpy can definitely be improved, and it is a very tempting target. Lots of people feel this way – witness the auto-parallelism and loop fusion work done by the Numexpr and Theano guys, or the stream/generator fusion in my project Metagraph, or even the SSE optimization in CoreFunc.

For the Scipy guys, however, Numpy is not about fast arrays, although that is very nice. It is about *data*. Numpy is the lingua franca for data access and transmission between a widely disparate set of numerical and scientific libraries in Python (save for the notable exception of PIL). I think it would be fair to say that for most Scipy folks, the data connectivity is every bit as important – if not more so – than the absolute speed of the array computations. It means that instead of interfacing N^2 different libraries to each other, any external library only has to consume and expose a single interface.

This is why the Scipy folks keep harping about Cython – it’s rapidly becoming (or has already become) the lingua franca of exposing legacy libraries to Python. Their user base has tons of legacy code or external libraries that they need to interface, and most of the reason Python has had such a great adoption curve in that space is because Numpy has made the data portion of that interface easy. Cython makes the code portion quite painless, as well.

Now, those users don’t particularly care whether the tools they use to build interfaces target the CPython API, which is why the Cython approach is such a great one. It could serve as a common meeting point between folks that have to wrap extension modules, and folks that have to build fast (and safe) VMs.

Furthermore, Travis Oliphant has repeatedly expressed his desire for Numpy itself to move to a Cython implementation, and to mainline the efforts around making the core multidimensional array bits a pure C library that other projects (and languages) can use. That code is not merged in, but the lion’s share of the work has already been done.

Perhaps an analogy would be a locomotive company that decides to apply their engineering expertise to making a car with the same excellent fuel efficiency and towing capacity of a locomotive, but with the proviso that it can only run on railroad tracks. In contrast, the Numpy Jeep might not tow as much, but it can go through forests and over fields and into town, talking to any data from any library, on just about any platform (including supercomputers). That’s actually its most useful characteristic. IMHO Travis’s innovation was the dtype, not overloading __getitem__.

I think if the PyPy folks really want to do something interesting in this space, they should worry less about making their railcar run faster, and more on how to retrofit it with caterpillar tracks, so that it can cover all manner of terrain. Better yet, the Cython guys have been building tracked vehicles for Sage for years, and all of the Scipy world is moving in that direction anyway. We can all be a big happy family!

Update 10/19: Edited title to clarify that while performance is important to Numpy, it is far from being its only feature.  What the comments and Twitter feedback suggest is that some people have misunderstood.  TL;DR: The Numpy C-API is as much a feature as its fast array handling, and the PyPy folks seem to be treating it as an implementation detail, or a “nice to have”.

About these ads

7 Responses to Numpy isn’t just about fast arrays

  1. Thanks Peter. I agree with everything you said in this well-articulated post.

  2. joehillen says:

    There is an effort to make Cython code compile to pure python, https://github.com/hardshooter/CythonCTypesBackend

    Maybe this would be a good compromise so that the PyPy can focus on PyPy and the Numpy people can move towards a code base that would fit them better anyway.

  3. Samuel John says:

    Great article!

  4. Foo says:

    PyPy is a project to make a faster railcar engine. You can’t blame them for focusing on making advances in their area of expertise. There are plenty of other people qualified to work on the caterpillar track retrofit.

  5. Jacob Hallén says:

    This is an interesting point of view. Currently PyPy is catering to the people who want the fast railcar engine, much because this is what our users asked for. There is a minority of numpy users who want more speed, and this is what PyPy is in a unique position to deliver.

    Once that is done, there will be a new makeup of the PyPy user community, with different needs. These are likely the numpy users who want a standardized way to store data. PyPy has been an agile project from the start, and we try to avoid doing things that people don’t want. The best way to know what is wanted is to ask at a point in time when we have resources to tackle the new challenges. The best way to make resources available is to become a PyPy developer. The second best one is to contribute financially to the project. While we would love to cater to all the needs of the scientific community, we are limited by available resources.

  6. Tobu says:

    PyPy is focusing on making numpy fast because that speed is the value add of PyPy. Being PyPy compatible will require a porting effort even if the libraries target something as simple as Cython with restrictions; that porting effort requires some kind of motivation, which for PyPy will be speed, and a well-defined, fast, and reasonably high-level interface, which PyPy’s numpy port will help define.

    • Peter Wang says:

      “PyPy is focusing on making numpy fast because that speed is the value add of PyPy.”

      I guess from my perspective, this is the classic example of having a hammer and seeing all the world as nails. If the PyPy folks want something to hammer on in the “Python scientific computing” space, there are plenty of real problems that they are best positioned to solve. (See the Conclusion of my follow-up post http://blog.streamitive.com/2011/10/19/more-thoughts-on-arrays-in-pypy/).

      However, this whole discussion has a feeling of “hey, we can hammer out a fast but simple array module that doesn’t talk to any other libraries, but it’s straightforward and it’ll be fast and what could possibly be more important than speed?”

      Apologies if the above is glib, but it’s the sentiment that I am getting from some of the PyPy devs. And at best, it’s overly ambitious, and at worst it’s naive.

      “that porting effort requires some kind of motivation, which for PyPy will be speed”

      But that motivation seems hardly enough, especially since, as Travis himself has laid out in his blog post (http://technicaldiscovery.blogspot.com/2011/10/thoughts-on-porting-numpy-to-pypy.html), Numpy will be get a massive new amount of development in the next year. It may be of some benefit to existing PyPy users, but I seriously doubt it will get any adoption from the mainstream Scipy community until there is a story for integration with existing libraries. One thing it almost certainly *will* do, however, is make the Python for science ecosystem even more clouded for newcomers, who, when faced with the question of “CPython 2 or CPython 3 or Numpy-on-PyPy with Python 2 or Python 3″, will typically just answer, “MATLAB”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: