I’ve been following the rapidly heating discussion about the efforts to develop array intrinsics in PyPy (aka “Numpy on PyPy”), and realized that the two groups of very smart people – Scipy/Numpy folks on the one hand, and the PyPy devs on the other – are actually not in disagreement, but really just talking past one another.
The core problem is that they have very different concepts of what Numpy actually is. From the PyPy perspective, it appears to be primarily about doing fast array computation from a high-level language like Python. This is why they talk about translating RPython to C and making optimized ufuncs and automatically building SSE-optimized versions of fused loops. This is also why they don’t understand what the big deal is with talking to C/FORTRAN extensions – they can emit code for faster array processing with their compiler, and who wouldn’t want that?
I think I understand the perspective on the PyPy side. The low-hanging fruit of basic array computation in Numpy can definitely be improved, and it is a very tempting target. Lots of people feel this way – witness the auto-parallelism and loop fusion work done by the Numexpr and Theano guys, or the stream/generator fusion in my project Metagraph, or even the SSE optimization in CoreFunc.
For the Scipy guys, however, Numpy is not about fast arrays, although that is very nice. It is about *data*. Numpy is the lingua franca for data access and transmission between a widely disparate set of numerical and scientific libraries in Python (save for the notable exception of PIL). I think it would be fair to say that for most Scipy folks, the data connectivity is every bit as important – if not more so – than the absolute speed of the array computations. It means that instead of interfacing N^2 different libraries to each other, any external library only has to consume and expose a single interface.
This is why the Scipy folks keep harping about Cython – it’s rapidly becoming (or has already become) the lingua franca of exposing legacy libraries to Python. Their user base has tons of legacy code or external libraries that they need to interface, and most of the reason Python has had such a great adoption curve in that space is because Numpy has made the data portion of that interface easy. Cython makes the code portion quite painless, as well.
Now, those users don’t particularly care whether the tools they use to build interfaces target the CPython API, which is why the Cython approach is such a great one. It could serve as a common meeting point between folks that have to wrap extension modules, and folks that have to build fast (and safe) VMs.
Furthermore, Travis Oliphant has repeatedly expressed his desire for Numpy itself to move to a Cython implementation, and to mainline the efforts around making the core multidimensional array bits a pure C library that other projects (and languages) can use. That code is not merged in, but the lion’s share of the work has already been done.
Perhaps an analogy would be a locomotive company that decides to apply their engineering expertise to making a car with the same excellent fuel efficiency and towing capacity of a locomotive, but with the proviso that it can only run on railroad tracks. In contrast, the Numpy Jeep might not tow as much, but it can go through forests and over fields and into town, talking to any data from any library, on just about any platform (including supercomputers). That’s actually its most useful characteristic. IMHO Travis’s innovation was the dtype, not overloading __getitem__.
I think if the PyPy folks really want to do something interesting in this space, they should worry less about making their railcar run faster, and more on how to retrofit it with caterpillar tracks, so that it can cover all manner of terrain. Better yet, the Cython guys have been building tracked vehicles for Sage for years, and all of the Scipy world is moving in that direction anyway. We can all be a big happy family!
Update 10/19: Edited title to clarify that while performance is important to Numpy, it is far from being its only feature. What the comments and Twitter feedback suggest is that some people have misunderstood. TL;DR: The Numpy C-API is as much a feature as its fast array handling, and the PyPy folks seem to be treating it as an implementation detail, or a “nice to have”.