It has become extraordinarily difficult to write software that performs close to optimally on complex modern microarchitectures. Particularly plagued are domains that are data intensive and require complex mathematical computations such as information retrieval, scientific simulations, graphics, communication, control, and multimedia processing. In these domains, performance-critical components are usually written in C (with possible extensions) and often even in assembly, carefully “tuned” to the platform’s architecture and microarchitecture. Specifically, the tuning includes optimization for the memory hierarchy and for different forms of parallelism. The result is usually long, rather unreadable code that needs to be re-written or re-tuned with every platform upgrade. On the other hand, the performance penalty for relying on straightforward, non-tuned, more elegant implementations is typically a factor of 10, 100, or even more. The reasons for this large gap are some (likely) inherent limitations of compilers including the lack of domain knowledge, and the lack of an efficient mechanism to explore the usually large set of transformation choices. The recent end of CPU frequency scaling, and thus the end of free software speed-up, and the advent of mainstream parallelism with its increasing diversity of platforms further aggravate the problem.
No promising general solution (besides extensive and expensive hand-coding) to this problem is on the horizon. One approach that has emerged from the numerical computing and compiler community in the last decade is called automatic performance tuning, or autotuning. In its most common form it involves the consideration or enumeration of alternative implementations, usually controlled by parameters, coupled with algorithms for search to find the fastest. However, the search space still has to be identified manually, it may be very different even for related functionality, it is not clear how to handle parallelism, and a new platform may require a complete redesign of the autotuning framework.
On the other hand, since the overall problem is one of productivity, maintainability, and quality (namely performance) it falls squarely into the domain of software engineering. However, even though a large set of sophisticated software engineering theory and tools exist, it appears that to date this community has not focused much on mathematical computations nor performance in the detailed, close-to-optimal sense above. The reason for the latter may be that performance, unlike various aspects of correctness, is not syntactic in nature (and in reality is often even unpredictable and, well, messy).
The aim of this talk is to draw attention to the performance/productivity problem for mathematical applications and to make the case for a more interdisciplinary attack. As a set of thoughts in this direction we offer some of the lessons we have learned in the last decade in our own research on Spiral. Spiral can be viewed as an automatic performance programming framework for a small, but important class of functions called linear transforms. Key techniques used in Spiral include staged declarative domain-specific languages to express algorithm knowledge and algorithm transformations, the use of platform-cognizant rewriting systems for parallelism and locality optimizations, and the use of search and machine learning techniques to navigate possible spaces of choices. Experimental results show that the code generated by Spiral competes with, and sometimes outperforms, the best available human-written code. Spiral has been used to generate part of Intel’s commercial libraries IPP and MKL.
Markus Püschel is a Professor and former Department Head of Computer Science at ETH Zurich, Switzerland. Before, he was a Professor of Electrical and Computer Engineering at Carnegie Mellon University, where he still has an adjunct status. He received his Diploma (M.Sc.) in Mathematics and his Doctorate (Ph.D.) in Computer Science, in 1995 and 1998, respectively, both from the University of Karlsruhe, Germany.