Matrix arithmetic performance in gretl

Notes, June 2008

Introduction

This may be of little interest unless you're almost as obsessed with performance as the gretl authors. But just in case you are... We have been experimenting lately with the ATLAS (Automatically Tuned Linear Algebra Software) version of LAPACK and BLAS, and have various points to pass on. Unfortunately we can't just say, "If you want best performance, you should do such-and-such". Too much depends on your machine, your operating system, and the size and type of the problems you typically throw at gretl. But prior to delving into the details, here's an attempt at an Executive Summary: If your work with gretl typically involves very large datasets and the use of estimators other than OLS, you may find it worthwhile to install a suitable ATLAS lapack/blas (and ensure that gretl uses it to the max). On the other hand, if you're typically using gretl to analyse small to moderately large datasets, and most often using OLS, use of ATLAS is unlikely to be of benefit, and in fact will probably slow things down (though perhaps not enough to notice).

Background info

In gretl, matrix operations fall into two categories: those that are (by default) performed "in-house", using our own C code, and those farmed out to LAPACK.

The "in-house" code covers basic arithmetical operations on matrices (addition, subtraction, multiplication) plus some operations that occur frequently in econometric calculation, such as multiplication of a matrix into its own transpose and construction of quadratic forms. These operations are, we believe, about as well optimized as possible without making special assumptions about the machine on which they're executed (or in other words, while remaining portable across computer architectures).

The operations we farm out to LAPACK include things like QR and SVD decomposition (and inversion), eigen-analysis, and LU solution of systems of equations. One operation that straddles the divide is Cholesky decomposition: we do this "in-house" in the context of OLS, but we pass it to LAPACK when such a decomposition is specifically requested by the user (as in the cholesky matrix function).

As you may know, when LAPACK is called upon to perform some relatively high-level matrix operation (such as QR decomposition), the grunt work is in turn handed off to the BLAS (Basic Linear Algebra Subprograms). In most implementations, the LAPACK system as a whole comprises two libraries: on unix-type systems these are typically named liblapack.so and libblas.so.

Where ATLAS fits in

On most systems (but see below for special info on Apple's OS X), the default installation of LAPACK plus BLAS is the "reference" version, based on Fortran code from netlib.org. Like gretl's own matrix arithmetic code, this is intended to be portable and is not specially optimized for any particular processor. What ATLAS offers is (a) a library containing a set of highly optimized routines which can function as a drop-in replacement for the reference netlib BLAS, plus (b) a subset of the LAPACK routines, similarly optimized. Thus it is possible to build a LAPACK-plus-BLAS set of libraries which uses ATLAS lapack functionality where available (otherwise regular netlib code), in conjunction with a fully optimized BLAS. (In addition, it's possible to cajole gretl into using optimized BLAS routines in place of in-house code for matrix multiplication, as discussed below.)

Well, as some readers may know, the foregoing paragraph is a fudge. The ATLAS project (principal author and maintainer, R. Clint Whaley of the University of Texas at San Antonio) does not offer any libraries as such: what is really offers is a very clever piece of software which probes your machine in depth, and on the basis of that probe, builds libraries (BLAS and part of LAPACK) optimized for your machine. To get the full benefit of ATLAS, you'll have to configure and build it yourself, on your own computer. To be specific, the best ATLAS for your computer depends not only on the instruction set it supports (sse, sse2, and so on) but also, inter alia, on the size of L1 and L2 cache.

That said, and not wishing to underestimate anyone's acumen, we suspect that few users of gretl will be building their own ATLAS. Doing so requires sophisticated computer skills: you will have to follow a quite densely argued HOWTO, and also to modify/update the instructions therein. For example (as of this writing, in June 2008), if you are using a current version of the GNU C compiler, gcc, it will be necessary (to obtain a system that works at all) to hand-edit ATLAS's Make.inc (Google is your friend, hint libgcc_s). You will also have to figure out for yourself the linker flags required to get a program such as gretl to use your new ATLAS libraries (without the dreaded "undefined symbol" error), since no guidance on this matter is provided in the ATLAS documentation.

If you are knowledgeable and determined enough not to be deterred by the previous paragraph then fine, you probably don't need much help from us (though we will give you the linker flags). Otherwise, let's move on the easier options, which are platform-specific.

OS X

OS X is a special case: Apple maintains a tight liaison between hardware and software, and OS X ships with a version of LAPACK plus BLAS that is optimized for the supplied processor, namely the vecLib framework. You might perhaps be able to do better using ATLAS, but you're on your own! You already have something that is well in advance of basic netlib LAPACK and BLAS. The pre-built disk image (dmg) version of gretl uses Apple's vecLib. But you may wish to look into the issue of matrix multiplication below.

MS Windows

Microsoft offers no support for LAPACK/BLAS or ATLAS. The gretl installer for Windows provides the reference netlib LAPACK and BLAS, cross-compiled for win32 using gcc on Linux. Beyond that, sorry, but you're on your own. If you're on Intel, you might wish to take a look at the Intel Math Kernel Library (MKL); if you're adventurous you might consider building ATLAS yourself.

Linux

We said above that for optimal results, ATLAS should be configured and built on your own machine. Nonetheless, it may be possible to get a substantial speed-up over the reference LAPACK/BLAS by selecting from a hierarchy of pre-built ATLAS libraries. You scan the hierarchy and select the highest ATLAS variant that is compatible with your processor. This sort of choice is offered by Ubuntu and, we suppose, other modern Linux distributions. As of this writing, the highest "generic" ATLAS variant on offer is likely to be one that supports the sse2 instruction set. Most current PCs will support this. If in doubt, try (at the command prompt in a terminal window)

      cat /proc/cpuinfo | grep sse2

If you get something back from this command, you're OK for an atlas-sse2 package (or whatever your distribution calls it). On Ubuntu, at any rate, if you install ATLAS the linker takes care of using its libraries instead of the reference LAPACK/BLAS. You don't need to do anything special when building gretl.

How much do you lose by following this approach, compared to building ATLAS on your own machine? Short answer: not a heck of a lot, but for details see Appendix B.

Substituting for gretl's native matrix multiplication

As we said earlier, by default gretl does matrix multiplication using native C code. Nonetheless, you can get gretl to use the BLAS for this task if you wish. The two basic functions affected are (a) general multiplication (native code versus the BLAS function dgemm) and multiplication of a matrix into its own transpose, or vice versa (native code versus BLAS dsyrk).

Whether or not the BLAS are used depends on the variable blas_nmk_min, which can be modified via gretl's set command. (This is an experimental, undocumented feature in gretl 1.7.5.) The idea is that nmk denotes the product of the three dimensions operative in matrix multiplication, and the BLAS will be used if this product at least equals the given minimum. If you set blas_nmk_min to 0, the BLAS will always be used. If you set this variable to, say, 5000, then the BLAS will be used if and only if the product of the three dimensions is at least 5000. However, if you set blas_nmk_min to a negative number, the BLAS will never be used, and the current default is -1 (always use native code).

Note that if your BLAS is the reference netlib version, you will not get any advantage from shifting blas_nmk_min off its default value. Our experiments show that native gretl code is at least as fast as the reference netlib code: it involves fewer function calls, and takes advantage of a few gretl-specific constraints.

Size matters!

The discussion above may suggest that if you can build for yourself, or otherwise get hold of, a LAPACK/BLAS that is tuned to your processor (either via ATLAS or via Apple's vecLIB, or perhaps Intel's MKL), you'd be better off using that. And moreover, that you'd be better setting gretl's blas_nmk_min to zero, so that your optimized BLAS will handle all basic matrix multiplication.

Would that it were so easy! It's true that large matrix operations go much faster with ATLAS. But there's no free lunch, and the overhead associated with ATLAS can slow down small calculations.

To illustrate the best-case speed-up, which is certainly impressive, here are timings from an Intel Core 2 Duo (1.83GHz) for a sequence of matrix multiplications of increasing size: ATLAS does the job about 5 times faster. (See Appendix A for the script that generated these results.)

Using native matrix multiplication code:

Case 1 ------------------------------------
n =  512, time =   0.31 (speed =    43296)
n = 1024, time =   0.66 (speed =    40672)
n = 2048, time =   1.30 (speed =    41298)
Case 2 ------------------------------------
n =  512, time =   0.30 (speed =    44739)
n = 1024, time =   1.28 (speed =    41943)
n = 2048, time =   5.11 (speed =    42025)
Case 3 ------------------------------------
n =  512, time =   0.31 (speed =    43296)
n = 1024, time =   2.55 (speed =    42108)
n = 2048, time =  20.09 (speed =    42757)
    

Using ATLAS BLAS:

Case 1 ------------------------------------
n =  512, time =   0.06 (speed =   223696)
n = 1024, time =   0.12 (speed =   223696)
n = 2048, time =   0.24 (speed =   223696)
Case 2 ------------------------------------
n =  512, time =   0.05 (speed =   268435)
n = 1024, time =   0.24 (speed =   223696)
n = 2048, time =   0.93 (speed =   230912)
Case 3 ------------------------------------
n =  512, time =   0.06 (speed =   223696)
n = 1024, time =   0.46 (speed =   233422)
n = 2048, time =   3.50 (speed =   245427)
    

The largest of the above tests involves multiplication of two 2048 by 2048 matrices. In most econometric work, however, the relevant matrices are likely to be much smaller. Consider the T by k matrix of regressors, X. The number of observations, T, may sometimes be very large, but k is typically fairly small, and the key matrix X'X will rarely be anything like as large as the ones handled above. ATLAS achieves a lot of its speed-up by employing a blocking algorithm: matrix multiplications are split into blocks, sized to take best advantage of the machine's cache. This involves costs in set-up and clean-up; in relatively small problems such costs may dominate the timing.

We haven't tested very systematically as yet, but do have some sense of how this plays out in practice. For example, we have a suite of "typical" gretl scripts that we use for regression testing. The execution time for a set of 150-plus such scripts on a Core 2 Duo machine is around 3 seconds, using the reference netlib LAPACK and BLAS, and with native matrix multiplication. When we switch to maximal use of ATLAS the time increases slightly, by something like 10 percent (see also Appendix C). A difference of this order is not going to be noticeable when you're running a single "typical" script, but it may be appreciable if you're doing a Monte Carlo experiment, or running a complex estimator on a moderate-sized dataset.

Hence the point made at the outset: Will you get a speed-up from using ATLAS with gretl? Unfortunately the answer is, It depends. We'd be glad to hear from anyone who tries experiments of their own.


Footnote: assuming you've installed the ATLAS shared libraries in the default location, /usr/local/atlas/lib, the appropriate linker invocation is:

	-L/usr/local/atlas/lib -llapack -lf77blas -lcblas -latlas
      

Appendix A: gretl matrix timing script

set echo off
set messages off
# The following line is activated only for ATLAS
set blas_nmk_min 0
nulldata 10
n0 = 512
pwr = 3

printf "Case 1 ------------------------------------\n"

n = n0
loop pwr
  fpms = n0*n*n0
  a = mnormal(n0,n)
  b = mnormal(n,n0)
  set stopwatch
  c = a*b
  f0 = $stopwatch
  printf "n = %4d, time = %6.2f (speed = %8.0f)\n", n, f0, fpms/(10000*f0)
  n *= 2
end loop

printf "Case 2 ------------------------------------\n"

n = n0
loop pwr
  fpms = n*n0*n
  a = mnormal(n,n0)
  b = mnormal(n0,n)
  set stopwatch
  c = a*b
  f0 = $stopwatch
  printf "n = %4d, time = %6.2f (speed = %8.0f)\n", n, f0, fpms/(10000*f0)
  n *= 2
end loop

printf "Case 3 ------------------------------------\n"

n = n0
loop pwr  
  fpms = n*n*n
  a = mnormal(n,n)
  b = mnormal(n,n)
  set stopwatch
  c = a*b
  f0 = $stopwatch
  printf "n = %4d, time = %6.2f (speed = %8.0f)\n", n, f0, fpms/(10000*f0)
  n *= 2
end loop

Appendix B: packaged versus hand-built ATLAS

We tried the following experiment with Ubuntu 8.04 "Hardy" running on a dual-core Pentium D machine: first, install the highest available Ubuntu ATLAS package, libatlas3gf-sse2, and time gretl as linked against the LAPACK/BLAS from that package. Then hand-build ATLAS (version 3.8.2, with lapack version 3.1.1, using gcc 4.2.3) and run the timing test again. The results are shown below. As you can see, the per-machine optimization produces an extra performance gain, but most of the gains over vanilla matrix multiplication are realized by the libatlas3gf-sse2 package.

Using native matrix multiplication code:

Case 1 ------------------------------------
n =  512, time =   0.42 (speed =    31957)
n = 1024, time =   0.82 (speed =    32736)
n = 2048, time =   1.64 (speed =    32736)
Case 2 ------------------------------------
n =  512, time =   0.42 (speed =    31957)
n = 1024, time =   1.73 (speed =    31033)
n = 2048, time =   7.21 (speed =    29785)
Case 3 ------------------------------------
n =  512, time =   0.44 (speed =    30504)
n = 1024, time =   3.38 (speed =    31768)
n = 2048, time =  27.73 (speed =    30977)

Using Ubuntu's libatlas3gf-sse2

Case 1 ------------------------------------
n =  512, time =   0.09 (speed =   149131)
n = 1024, time =   0.17 (speed =   157903)
n = 2048, time =   0.33 (speed =   162688)
Case 2 ------------------------------------
n =  512, time =   0.10 (speed =   134218)
n = 1024, time =   0.37 (speed =   145100)
n = 2048, time =   1.47 (speed =   146087)
Case 3 ------------------------------------
n =  512, time =   0.10 (speed =   134218)
n = 1024, time =   0.68 (speed =   157903)
n = 2048, time =   5.11 (speed =   168100)

Using hand-built ATLAS 3.8.2

Case 1 ------------------------------------
n =  512, time =   0.08 (speed =   167772)
n = 1024, time =   0.15 (speed =   178957)
n = 2048, time =   0.27 (speed =   198841)
Case 2 ------------------------------------
n =  512, time =   0.08 (speed =   167772)
n = 1024, time =   0.32 (speed =   167772)
n = 2048, time =   1.24 (speed =   173184)
Case 3 ------------------------------------
n =  512, time =   0.08 (speed =   167772)
n = 1024, time =   0.56 (speed =   191740)
n = 2048, time =   4.18 (speed =   205501)

Appendix C: yet more timings

The following table shows timings (the "real" value from the GNU time program) for running batches of gretl scripts on a Lenovo Thinkpad (Core 2 Duo, 1.83GHz). The vanilla timings are from gretl linked against netlib lapack 3.1.1 and netlib BLAS (compiled using gfortran from gcc 4.3.1, with -O2 optimization), and with all gretl matrix multiplication done in-house. The atlas timings are from gretl linked against ATLAS 3.8.2, with blas_nmk_min set to 1000. The numbers in the last column show that the atlas runs were marginally slower in all cases apart from the arbond scripts (Arellano-Bond dynamic panel data modeling).

timings (seconds)
subdir scripts (1) (2)
vanilla atlas (2)/(1)
top 157 run 1 3.082 3.378
run 2 3.111 3.408
mean 3.097 3.393 1.10
arbond 24 run 1 2.081 2.073
run 2 2.072 2.057
mean 2.077 2.065 0.99
mle 10 run 1 4.353 4.410
run 2 4.350 4.455
mean 4.352 4.433 1.02
sarma 6 run 1 4.155 4.418
run 2 4.163 4.354
mean 4.159 4.386 1.05
arma 31 run 1 4.025 4.287
run 2 4.039 4.314
mean 4.032 4.301 1.07
gmm 12 run 1 3.601 3.681
run 2 3.623 3.682
mean 3.612 3.682 1.02
vecm 16 run 1 0.277 0.325
run 2 0.275 0.319
mean 0.276 0.322 1.17
s_watson 14 run 1 1.069 1.119
run 2 1.066 1.169
mean 1.068 1.144 1.07