My previous blog post was trying to diplomatically point out that Numpy is a whole lot more than merely a fast array library. Somehow the more subtle point seems to be lost on a number of people, so this follow-up post is going to be more blunt. (But not mean or angry – I was feeling rather angry when I first started this blog entry, but then I realized it was because I was hungry, so I had a Snickers bar and re-wrote it.)
I think people should re-read Travis’s blog post on porting Numpy to PyPy, and really read it carefully. Travis not only created Numpy and helped create Scipy, but in his professional life he has been invited to labs and universities all over the world to speak about the use of Python in scientific computing. He has seen, first-hand, what works and what doesn’t when trying to push Python into this space, so even though he makes many high-level points in his blog, he has been in the trenches enough to have a very good sense of what users are looking for. He does not use his words lightly, and he very much means what he says.
In looking over the discussion on this subject, I realized that Travis has basically already said all that needs to be said. However, certain things bear highlighting.
A Low-level API Is A Feature
Travis: “Most of the scientists and engineers who have come to Python over the past years have done so because it is so easy to integrate their legacy C/C++ and Fortran code into Python.
Most of these [Scipy,Matplotlib,Sage] rely on not just the Python C-API but also the NumPy C-API which you would have to have a story for to make a serious technical user of Python get excited about a NumPy port to PyPy.” (emphasis mine)
The bottom line: There is no such thing as “Numpy on PyPy” without a low-level API for extensions.
Call it “Array Intrinsics for PyPy” if you will. Call it “Fast Arrays for PyPy”. But to call it “Numpy Support for PyPy” without offering the integration ability that Travis alludes to, and without actually using any of Numpy’s code (for FFTs, linear algebra, etc.) is kind of false advertising. (David Cournapeau tries to gently make this point on the mailing list by saying that “calling it numpy is a bit confusing”.)
As I pointed out in a previous post, there are dynamic compilers and JITters for Numpy right now; however, none of them call themselves “PyPy for Numpy”, because there is a huge portion of the PyPy feature set they don’t support. Just because they do some JITting, they are not PyPy. Likewise, writing an array object with operator overloading does not mean you have Numpy.
Travis: “It’s not to say that there isn’t some value in re-writing NumPy in PyPy, it just shouldn’t be over-sold and those who fund it should understand what they aren’t getting in the transaction.” (emphasis mine)
On the pypy-dev mailing list, Jacob Hallen said: “We did a survey last spring, in which an overwhelming number of people asked for numpy support.” I am curious about the exact wording of this survey. By “Numpy support”, did respondents mean “a fast array library, with no hope of integration with Scipy, Scikits, Matplotlib, etc.”? If so, then the PyPy team may very well be able to meet their demand. However, I will wager that when most people ask for “Numpy support”, they also imply some level of compatibility or at least a reasonable timeline for integration with the rest of the Scipy stack. It might be worth re-polling those folks who wanted “Numpy support” and ask them how useful a fast array library in PyPy would be, if it could not be integrated with Scipy, Matplotlib, etc., and if there was no way to integrate their own extension modules.
In my experience, there is virtually no one who uses Numpy that does not also (directly or indirectly) use something that consumes its C-API. Most folks get at that API via Swig, Cython, Weave, f2py, etc., but nonetheless they are very much relying on it.
The Joy of Glue
On the PyPy’s donation page, there is the claim:
“bringing NumPy into the equation is a reasonable next step – as it’s a very convenient and popular tool for doing this kind of work. The resulting implementation should move Python in scientific world from being a merely glue language into being the main implementation language for a lot of people in the scientific/numeric worlds.”
This statement encapsulates most of the philosophical differences between the PyPy guys and the Scipy guys in this discussion. I’m pretty sure that while almost everyone in the Scipy world would rather be coding Python than wading through mounds of legacy C or FORTRAN, most of them have been in the trenches long enough to know that reimplementing libraries is simply not possible in most real-world cases. Even if they could make all new development happen in RPython, they would still need to interface with those legacy libraries. If the PyPy devs want to field their Python implementation as a serious contender in the scientific computing space, they absolutely have to have a way to use it as a “glue” language.
Furthermore, the use of the word “merely” in “merely glue language” really highlights the philosophic difference. I’m sure that for language purists who enjoy geeking out on compilers and VMs and Lambda-the-Ultimate, something as mundane as a low-level VM API for an external language as unsexy as FORTRAN deserves the sneer of “merely”. But for many tens of thousands of scientists and researchers around the world who rely on Python to get their jobs done, the ability to glue together disparate tools, frameworks, languages, and data is utterly priceless. I would even argue that this is one of its essential features; without it, to many people, Python would just be a slightly prettier MATLAB. (Actually, lacking an infix matrix multiplication operator, some would argue that it’s not even that much prettier.)
A High-Level Future
In truth, I am personally torn by this brouhaha with PyPy, because I actually agree with their ultimate goals. I, too, envision a better world in which people are writing high-level languages to achieve better performance than what is even possible at the C level. However, I have seen enough of the problem space to know that (1) any solution that doesn’t integrate with legacy code will be Dead On Arrival, and (2) JIT optimizations of the normal PyPy variety (i.e. approaches that work well for Algol-derived languages) have limited efficacy in the space of serious scientific and high-performance computing. As Travis says in his blog post:
“I am also a true-believer in the ability for high-level languages to achieve faster-than-C speeds. In fact, I’m not satisfied with a Python JIT. I want the NumPy constructs such as vectorization, fancy indexing, and reduction to be JIT compiled.”
It is very important to note that Travis says “JIT compiled” and not “JIT optimized“. In a subsequent post, I will discuss why JIT optimization is not the end-all and be-all for scientific computing, and what could be even better.
How Can PyPy and Numpy Work Together?
It occurs to me that it might help illuminate the discussion if we looked at a concrete example of an alternative approach to collaboration between Numpy and PyPy. A few years ago, Ilan Schnell implemented a very interesting “fast_vectorize” decorator for Numpy ufuncs, which used PyPy’s translator to dynamically generate a natively-compiled version of ufuncs written in Python. It only took him about a week, and I believe most of it was isolating out the RPython-to-C translation functionality.
Furthermore, at this year’s Scipy conference alone, I counted at least three projects that were doing AST walking and code munging to attempt to convert high-level Python code into kernels for fast evaluation, and not merely for Numpy. These included differential equation solvers and the like. Supporting efforts for dynamic compilation of Python would be massively useful for the Scipy community at large, and there is almost no one better poised to do this than the PyPy devs. I think that would be a far, far more useful application of their efforts.
I understand that speed has become PyPy’s raison d’etre. I also understand that they’ve had wonderfully encouraging results with optimizing vanilla procedural code, like what is found in the Python standard library. But I think it is unwise announce a big, loud foray into scientific computing while ignoring the wisdom of their fellow Python coders who have been toiling in this space for over a decade.
If the goal is to create a nifty little array module for users who are happy living within the walls of the PyPy interpreter, then I wish them luck and think they will do very well. But if their goal is have an impact on Python’s role in scientific computing at large, on the scale of what Numpy and Scipy have achieved, then they cannot ignore the integration question, and, IMHO, Cython currently looks to be the most promising avenue for that.