More Thoughts On Arrays in PyPy

October 19, 2011

My previous blog post was trying to diplomatically point out that Numpy is a whole lot more than merely a fast array library. Somehow the more subtle point seems to be lost on a number of people, so this follow-up post is going to be more blunt. (But not mean or angry – I was feeling rather angry when I first started this blog entry, but then I realized it was because I was hungry, so I had a Snickers bar and re-wrote it.)

I think people should re-read Travis’s blog post on porting Numpy to PyPy, and really read it carefully. Travis not only created Numpy and helped create Scipy, but in his professional life he has been invited to labs and universities all over the world to speak about the use of Python in scientific computing. He has seen, first-hand, what works and what doesn’t when trying to push Python into this space, so even though he makes many high-level points in his blog, he has been in the trenches enough to have a very good sense of what users are looking for. He does not use his words lightly, and he very much means what he says.

In looking over the discussion on this subject, I realized that Travis has basically already said all that needs to be said. However, certain things bear highlighting.

A Low-level API Is A Feature

Travis: “Most of the scientists and engineers who have come to Python over the past years have done so because it is so easy to integrate their legacy C/C++ and Fortran code into Python.

Most of these [Scipy,Matplotlib,Sage] rely on not just the Python C-API but also the NumPy C-API which you would have to have a story for to make a serious technical user of Python get excited about a NumPy port to PyPy.” (emphasis mine)

The bottom line: There is no such thing as “Numpy on PyPy” without a low-level API for extensions.

Call it “Array Intrinsics for PyPy” if you will. Call it “Fast Arrays for PyPy”. But to call it “Numpy Support for PyPy” without offering the integration ability that Travis alludes to, and without actually using any of Numpy’s code (for FFTs, linear algebra, etc.) is kind of false advertising. (David Cournapeau tries to gently make this point on the mailing list by saying that “calling it numpy is a bit confusing”.)

As I pointed out in a previous post, there are dynamic compilers and JITters for Numpy right now; however, none of them call themselves “PyPy for Numpy”, because there is a huge portion of the PyPy feature set they don’t support. Just because they do some JITting, they are not PyPy. Likewise, writing an array object with operator overloading does not mean you have Numpy.

Travis: “It’s not to say that there isn’t some value in re-writing NumPy in PyPy, it just shouldn’t be over-sold and those who fund it should understand what they aren’t getting in the transaction.” (emphasis mine)

On the pypy-dev mailing list, Jacob Hallen said: “We did a survey last spring, in which an overwhelming number of people asked for numpy support.” I am curious about the exact wording of this survey. By “Numpy support”, did respondents mean “a fast array library, with no hope of integration with Scipy, Scikits, Matplotlib, etc.”? If so, then the PyPy team may very well be able to meet their demand. However, I will wager that when most people ask for “Numpy support”, they also imply some level of compatibility or at least a reasonable timeline for integration with the rest of the Scipy stack. It might be worth re-polling those folks who wanted “Numpy support” and ask them how useful a fast array library in PyPy would be, if it could not be integrated with Scipy, Matplotlib, etc., and if there was no way to integrate their own extension modules.

In my experience, there is virtually no one who uses Numpy that does not also (directly or indirectly) use something that consumes its C-API. Most folks get at that API via Swig, Cython, Weave, f2py, etc., but nonetheless they are very much relying on it.

The Joy of Glue

On the PyPy’s donation page, there is the claim:

“bringing NumPy into the equation is a reasonable next step – as it’s a very convenient and popular tool for doing this kind of work. The resulting implementation should move Python in scientific world from being a merely glue language into being the main implementation language for a lot of people in the scientific/numeric worlds.”

This statement encapsulates most of the philosophical differences between the PyPy guys and the Scipy guys in this discussion. I’m pretty sure that while almost everyone in the Scipy world would rather be coding Python than wading through mounds of legacy C or FORTRAN, most of them have been in the trenches long enough to know that reimplementing libraries is simply not possible in most real-world cases. Even if they could make all new development happen in RPython, they would still need to interface with those legacy libraries. If the PyPy devs want to field their Python implementation as a serious contender in the scientific computing space, they absolutely have to have a way to use it as a “glue” language.

Furthermore, the use of the word “merely” in “merely glue language” really highlights the philosophic difference. I’m sure that for language purists who enjoy geeking out on compilers and VMs and Lambda-the-Ultimate, something as mundane as a low-level VM API for an external language as unsexy as FORTRAN deserves the sneer of “merely”. But for many tens of thousands of scientists and researchers around the world who rely on Python to get their jobs done, the ability to glue together disparate tools, frameworks, languages, and data is utterly priceless. I would even argue that this is one of its essential features; without it, to many people, Python would just be a slightly prettier MATLAB. (Actually, lacking an infix matrix multiplication operator, some would argue that it’s not even that much prettier.)

A High-Level Future

In truth, I am personally torn by this brouhaha with PyPy, because I actually agree with their ultimate goals. I, too, envision a better world in which people are writing high-level languages to achieve better performance than what is even possible at the C level. However, I have seen enough of the problem space to know that (1) any solution that doesn’t integrate with legacy code will be Dead On Arrival, and (2) JIT optimizations of the normal PyPy variety (i.e. approaches that work well for Algol-derived languages) have limited efficacy in the space of serious scientific and high-performance computing. As Travis says in his blog post:

“I am also a true-believer in the ability for high-level languages to achieve faster-than-C speeds. In fact, I’m not satisfied with a Python JIT. I want the NumPy constructs such as vectorization, fancy indexing, and reduction to be JIT compiled.”

It is very important to note that Travis says “JIT compiled” and not “JIT optimized“. In a subsequent post, I will discuss why JIT optimization is not the end-all and be-all for scientific computing, and what could be even better.

How Can PyPy and Numpy Work Together?

It occurs to me that it might help illuminate the discussion if we looked at a concrete example of an alternative approach to collaboration between Numpy and PyPy. A few years ago, Ilan Schnell implemented a very interesting “fast_vectorize” decorator for Numpy ufuncs, which used PyPy’s translator to dynamically generate a natively-compiled version of ufuncs written in Python. It only took him about a week, and I believe most of it was isolating out the RPython-to-C translation functionality.

Furthermore, at this year’s Scipy conference alone, I counted at least three projects that were doing AST walking and code munging to attempt to convert high-level Python code into kernels for fast evaluation, and not merely for Numpy. These included differential equation solvers and the like. Supporting efforts for dynamic compilation of Python would be massively useful for the Scipy community at large, and there is almost no one better poised to do this than the PyPy devs. I think that would be a far, far more useful application of their efforts.

Conclusion

I understand that speed has become PyPy’s raison d’etre. I also understand that they’ve had wonderfully encouraging results with optimizing vanilla procedural code, like what is found in the Python standard library. But I think it is unwise announce a big, loud foray into scientific computing while ignoring the wisdom of their fellow Python coders who have been toiling in this space for over a decade.

If the goal is to create a nifty little array module for users who are happy living within the walls of the PyPy interpreter, then I wish them luck and think they will do very well. But if their goal is have an impact on Python’s role in scientific computing at large, on the scale of what Numpy and Scipy have achieved, then they cannot ignore the integration question, and, IMHO, Cython currently looks to be the most promising avenue for that.


Numpy isn’t just about fast arrays

October 17, 2011

I’ve been following the rapidly heating discussion about the efforts to develop array intrinsics in PyPy (aka “Numpy on PyPy”), and realized that the two groups of very smart people – Scipy/Numpy folks on the one hand, and the PyPy devs on the other – are actually not in disagreement, but really just talking past one another.

The core problem is that they have very different concepts of what Numpy actually is. From the PyPy perspective, it appears to be primarily about doing fast array computation from a high-level language like Python. This is why they talk about translating RPython to C and making optimized ufuncs and automatically building SSE-optimized versions of fused loops. This is also why they don’t understand what the big deal is with talking to C/FORTRAN extensions – they can emit code for faster array processing with their compiler, and who wouldn’t want that?

I think I understand the perspective on the PyPy side. The low-hanging fruit of basic array computation in Numpy can definitely be improved, and it is a very tempting target. Lots of people feel this way – witness the auto-parallelism and loop fusion work done by the Numexpr and Theano guys, or the stream/generator fusion in my project Metagraph, or even the SSE optimization in CoreFunc.

For the Scipy guys, however, Numpy is not about fast arrays, although that is very nice. It is about *data*. Numpy is the lingua franca for data access and transmission between a widely disparate set of numerical and scientific libraries in Python (save for the notable exception of PIL). I think it would be fair to say that for most Scipy folks, the data connectivity is every bit as important – if not more so – than the absolute speed of the array computations. It means that instead of interfacing N^2 different libraries to each other, any external library only has to consume and expose a single interface.

This is why the Scipy folks keep harping about Cython – it’s rapidly becoming (or has already become) the lingua franca of exposing legacy libraries to Python. Their user base has tons of legacy code or external libraries that they need to interface, and most of the reason Python has had such a great adoption curve in that space is because Numpy has made the data portion of that interface easy. Cython makes the code portion quite painless, as well.

Now, those users don’t particularly care whether the tools they use to build interfaces target the CPython API, which is why the Cython approach is such a great one. It could serve as a common meeting point between folks that have to wrap extension modules, and folks that have to build fast (and safe) VMs.

Furthermore, Travis Oliphant has repeatedly expressed his desire for Numpy itself to move to a Cython implementation, and to mainline the efforts around making the core multidimensional array bits a pure C library that other projects (and languages) can use. That code is not merged in, but the lion’s share of the work has already been done.

Perhaps an analogy would be a locomotive company that decides to apply their engineering expertise to making a car with the same excellent fuel efficiency and towing capacity of a locomotive, but with the proviso that it can only run on railroad tracks. In contrast, the Numpy Jeep might not tow as much, but it can go through forests and over fields and into town, talking to any data from any library, on just about any platform (including supercomputers). That’s actually its most useful characteristic. IMHO Travis’s innovation was the dtype, not overloading __getitem__.

I think if the PyPy folks really want to do something interesting in this space, they should worry less about making their railcar run faster, and more on how to retrofit it with caterpillar tracks, so that it can cover all manner of terrain. Better yet, the Cython guys have been building tracked vehicles for Sage for years, and all of the Scipy world is moving in that direction anyway. We can all be a big happy family!

Update 10/19: Edited title to clarify that while performance is important to Numpy, it is far from being its only feature.  What the comments and Twitter feedback suggest is that some people have misunderstood.  TL;DR: The Numpy C-API is as much a feature as its fast array handling, and the PyPy folks seem to be treating it as an implementation detail, or a “nice to have”.


Coherent Abstractions

October 17, 2011

The key question to ask about a piece of software is, “what abstractions are supported by the deployment runtime environment?”

In general, there is an inverse relationship between the complexity of a program and its correctness.  Programmers of higher quality and skill are able to push this 1/x curve a bit higher along the correctness axis, but they are fundamentally limited by complexity as well.  The only way to reach a different zone of correctness is to step up to a higher level of abstraction.

But higher levels of abstraction are useless unless they form a coherent system.  This is because complexity is a function of the logical coherency of the abstractions you are dealing with.  In general, something is “complicated” because (1) you have to keep lots of concepts in your head at once in order to understand its function, or (2) you have to understand its underlying construction (down the stack of abstraction), or (3) both.

Programs are sets of abstractions built on top of the abstractions offered by the language, which usually are there to offer a coherent runtime to the programs.  But all programs of reasonable complexity will develop their own set of abstractions that have rules about proper interactions, contracts that need to be maintained, etc.  The approach of domain-specific languages or the “little languages” philosophy is that these abstractions and rules and contracts need to be bundled up into a coherent runtime of its own, on top of which the actual program is written.

But language design is very hard.  This is because of the relationship between expressiveness and complexity: the more expressive a language is, the more complex the language primitives are.  A good language designer is able to discover the minimum set of primitives needed to maximize expressiveness, but these craftsmen are rare.

I think part of the reason that statically-typed, heavily object-oriented languages like Java and C++ get a bad rap is because they force all software developers to extend the type system via creation of classes, and thereby become amateur language designers.  The problem is that not very many people are good at distilling an architecture down to a core, coherent set of abstractions, and furthermore, this is not always the most appropriate approach for a great number of actual, production programming challenges.  The next time you are wading through a giant codebase of FactoryManagers and StreamProxies and QueueHandlers, realize that the core shortcoming is one of language: the programmers were mostly likely taught to model objects (hence the proliferation of nouns ending in -er) but not to understand the verbs (i.e. operators).  Without verbs, you have a very poor syntax indeed.

So the question is: if we re-center the task of software development around this notion of building coherent sets of abstractions at higher and higher levels, until the application is easily expressible, then what does the basic project lifecycle look like?  What are the skills that a software developer needs?  What are the software processes? What does the testing framework look like?


Follow

Get every new post delivered to your Inbox.