Notes, June 2008

This may be of little interest unless you're almost as obsessed with
performance as the gretl authors. But just in case you are... We
have been experimenting lately with the ATLAS (Automatically Tuned
Linear Algebra Software) version of LAPACK and BLAS, and have
various points to pass on. Unfortunately we can't just say, "If you
want best performance, you should do such-and-such". Too much
depends on your machine, your operating system, and the size and
type of the problems you typically throw at gretl. But prior to
delving into the details, here's an attempt at an Executive
Summary: *If your work with gretl typically involves very large
datasets and the use of estimators other than OLS, you may find it
worthwhile to install a suitable ATLAS lapack/blas (and ensure that
gretl uses it to the max). On the other hand, if you're typically
using gretl to analyse small to moderately large datasets, and most
often using OLS, use of ATLAS is unlikely to be of benefit, and in
fact will probably slow things down (though perhaps not enough to
notice).*

In gretl, matrix operations fall into two categories: those that are (by default) performed "in-house", using our own C code, and those farmed out to LAPACK.

The "in-house" code covers basic arithmetical operations on matrices (addition, subtraction, multiplication) plus some operations that occur frequently in econometric calculation, such as multiplication of a matrix into its own transpose and construction of quadratic forms. These operations are, we believe, about as well optimized as possible without making special assumptions about the machine on which they're executed (or in other words, while remaining portable across computer architectures).

The operations we farm out to LAPACK include things like QR and SVD
decomposition (and inversion), eigen-analysis, and LU solution of
systems of equations. One operation that straddles the divide is
Cholesky decomposition: we do this "in-house" in the context of OLS,
but we pass it to LAPACK when such a decomposition is specifically
requested by the user (as in the `cholesky` matrix function).

As you may know, when LAPACK is called upon to perform some
relatively high-level matrix operation (such as QR decomposition),
the grunt work is in turn handed off to the BLAS (Basic Linear
Algebra Subprograms). In most implementations, the LAPACK system as
a whole comprises two libraries: on unix-type systems these are
typically named `liblapack.so` and `libblas.so`.

On most systems (but see below for special info on Apple's OS X),
the default installation of LAPACK plus BLAS is the "reference"
version, based on Fortran code from `netlib.org`. Like
gretl's own matrix arithmetic code, this is intended to be portable
and is not specially optimized for any particular processor. What
ATLAS offers is (a) a library containing a set of highly optimized
routines which can function as a drop-in replacement for the
reference netlib BLAS, plus (b) a subset of the LAPACK routines,
similarly optimized. Thus it is possible to build a
LAPACK-plus-BLAS set of libraries which uses ATLAS lapack
functionality where available (otherwise regular netlib code), in
conjunction with a fully optimized BLAS. (In addition, it's
possible to cajole gretl into using optimized BLAS routines in place
of in-house code for matrix multiplication, as discussed below.)

Well, as some readers may know, the foregoing paragraph is a fudge.
The ATLAS project (principal author and
maintainer, R. Clint
Whaley of the University of Texas at San Antonio) does not offer
any libraries as such: what is really offers is a very clever piece
of software which probes *your machine* in depth, and on the
basis of that probe, builds libraries (BLAS and part of LAPACK)
optimized for *your machine*. To get the full benefit of ATLAS,
you'll have to configure and build it yourself, on your own
computer. To be specific, the best ATLAS for your computer depends
not only on the instruction set it supports
(`sse`, `sse2`, and so on) but also, *inter
alia*, on the size of L1 and L2 cache.

That said, and not wishing to underestimate anyone's acumen, we
suspect that few users of gretl will be building their own ATLAS.
Doing so requires sophisticated computer skills: you will have to
follow a
quite densely argued HOWTO, and also to modify/update the
instructions therein. For example (as of this writing, in June
2008), if you are using a current version of the GNU C
compiler, `gcc`, it will be necessary (to obtain a system
that works at all) to hand-edit ATLAS's `Make.inc` (Google is
your friend, hint `libgcc_s`). You will also have to figure
out for yourself the linker flags required to get a program such as
gretl to use your new ATLAS libraries (without the dreaded
"undefined symbol" error), since no guidance on this matter is
provided in the ATLAS documentation.

If you are knowledgeable and determined enough not to be deterred by the previous paragraph then fine, you probably don't need much help from us (though we will give you the linker flags). Otherwise, let's move on the easier options, which are platform-specific.

OS X is a special case: Apple maintains a tight liaison between
hardware and software, and OS X ships with a version of LAPACK plus
BLAS that is optimized for the supplied processor, namely
the `vecLib` framework. You might perhaps
be able to do better using ATLAS, but you're on your own! You
already have something that is well in advance of basic netlib
LAPACK and BLAS. The pre-built disk image (`dmg`) version of
gretl uses Apple's `vecLib`. But you may wish to look into
the issue of matrix multiplication below.

Microsoft offers no support for LAPACK/BLAS or ATLAS. The gretl
installer for Windows provides the reference netlib LAPACK and BLAS,
cross-compiled for win32 using `gcc` on Linux. Beyond that,
sorry, but you're on your own. If you're on Intel, you might wish
to take a look at the Intel Math Kernel Library (MKL); if you're
adventurous you might consider building ATLAS yourself.

We said above that for optimal results, ATLAS should be configured
and built on your own machine. Nonetheless, it may be possible to
get a substantial speed-up over the reference LAPACK/BLAS by
selecting from a hierarchy of pre-built ATLAS libraries. You scan
the hierarchy and select the highest ATLAS variant that is
compatible with your processor. This sort of choice is offered by
Ubuntu and, we suppose, other modern Linux distributions. As of
this writing, the highest "generic" ATLAS variant on offer is likely
to be one that supports the `sse2` instruction set. Most
current PCs will support this. If in doubt, try (at the command
prompt in a terminal window)

cat /proc/cpuinfo | grep sse2

If you get something back from this command, you're OK for an
`atlas-sse2` package (or whatever your distribution calls
it). On Ubuntu, at any rate, if you install ATLAS the linker takes
care of using its libraries instead of the reference LAPACK/BLAS.
You don't need to do anything special when building gretl.

How much do you lose by following this approach, compared to building ATLAS on your own machine? Short answer: not a heck of a lot, but for details see Appendix B.

As we said earlier, by default gretl does matrix multiplication
using native C code. Nonetheless, you can get gretl to use the BLAS
for this task if you wish. The two basic functions affected are (a)
general multiplication (native code versus the BLAS
function `dgemm`) and multiplication of a matrix into its own
transpose, or vice versa (native code versus BLAS `dsyrk`).

Whether or not the BLAS are used depends on the
variable `blas_nmk_min`, which can be modified via
gretl's `set` command. (This is an experimental,
undocumented feature in gretl 1.7.5.) The idea is that
`nmk` denotes the product of the three dimensions operative
in matrix multiplication, and the BLAS will be used if this product
at least equals the given minimum. If you set `blas_nmk_min`
to 0, the BLAS will always be used. If you set this variable to,
say, 5000, then the BLAS will be used if and only if the product of
the three dimensions is at least 5000. However, if you
set `blas_nmk_min` to a negative number, the BLAS will never
be used, and the current default is -1 (always use native code).

Note that if your BLAS is the reference netlib version, you will not
get any advantage from shifting `blas_nmk_min` off its
default value. Our experiments show that native gretl code is at
least as fast as the reference netlib code: it involves fewer
function calls, and takes advantage of a few gretl-specific
constraints.

The discussion above may suggest that if you can build for yourself,
or otherwise get hold of, a LAPACK/BLAS that is tuned to your
processor (either via ATLAS or via Apple's `vecLIB`, or
perhaps Intel's MKL), you'd be better off using that. And moreover,
that you'd be better setting gretl's `blas_nmk_min` to zero,
so that your optimized BLAS will handle all basic matrix
multiplication.

Would that it were so easy! It's true that *large* matrix
operations go much faster with ATLAS. But there's no free lunch,
and the overhead associated with ATLAS can slow down small
calculations.

To illustrate the best-case speed-up, which is certainly impressive, here are timings from an Intel Core 2 Duo (1.83GHz) for a sequence of matrix multiplications of increasing size: ATLAS does the job about 5 times faster. (See Appendix A for the script that generated these results.)

Using native matrix multiplication code:

Case 1 ------------------------------------ n = 512, time = 0.31 (speed = 43296) n = 1024, time = 0.66 (speed = 40672) n = 2048, time = 1.30 (speed = 41298) Case 2 ------------------------------------ n = 512, time = 0.30 (speed = 44739) n = 1024, time = 1.28 (speed = 41943) n = 2048, time = 5.11 (speed = 42025) Case 3 ------------------------------------ n = 512, time = 0.31 (speed = 43296) n = 1024, time = 2.55 (speed = 42108) n = 2048, time = 20.09 (speed = 42757)

Using ATLAS BLAS:

Case 1 ------------------------------------ n = 512, time = 0.06 (speed = 223696) n = 1024, time = 0.12 (speed = 223696) n = 2048, time = 0.24 (speed = 223696) Case 2 ------------------------------------ n = 512, time = 0.05 (speed = 268435) n = 1024, time = 0.24 (speed = 223696) n = 2048, time = 0.93 (speed = 230912) Case 3 ------------------------------------ n = 512, time = 0.06 (speed = 223696) n = 1024, time = 0.46 (speed = 233422) n = 2048, time = 3.50 (speed = 245427)

The largest of the above tests involves multiplication of two 2048
by 2048 matrices. In most econometric work, however, the relevant
matrices are likely to be much smaller. Consider the *T*
by *k* matrix of regressors, *X*. The number of
observations, *T*, may sometimes be very large, but *k* is
typically fairly small, and the key matrix *X'X* will rarely be
anything like as large as the ones handled above. ATLAS achieves a
lot of its speed-up by employing a blocking algorithm: matrix
multiplications are split into blocks, sized to take best advantage
of the machine's cache. This involves costs in set-up and clean-up;
in relatively small problems such costs may dominate the timing.

We haven't tested very systematically as yet, but do have some sense of how this plays out in practice. For example, we have a suite of "typical" gretl scripts that we use for regression testing. The execution time for a set of 150-plus such scripts on a Core 2 Duo machine is around 3 seconds, using the reference netlib LAPACK and BLAS, and with native matrix multiplication. When we switch to maximal use of ATLAS the time increases slightly, by something like 10 percent (see also Appendix C). A difference of this order is not going to be noticeable when you're running a single "typical" script, but it may be appreciable if you're doing a Monte Carlo experiment, or running a complex estimator on a moderate-sized dataset.

Hence the point made at the outset: Will you get a speed-up from
using ATLAS with gretl? Unfortunately the answer is, *It
depends*. We'd be glad to hear from anyone who tries
experiments of their own.

Footnote: assuming you've installed the ATLAS
shared libraries in the default
location, `/usr/local/atlas/lib`, the appropriate linker
invocation is:

-L/usr/local/atlas/lib -llapack -lf77blas -lcblas -latlas

set echo off set messages off # The following line is activated only for ATLAS set blas_nmk_min 0 nulldata 10 n0 = 512 pwr = 3 printf "Case 1 ------------------------------------\n" n = n0 loop pwr fpms = n0*n*n0 a = mnormal(n0,n) b = mnormal(n,n0) set stopwatch c = a*b f0 = $stopwatch printf "n = %4d, time = %6.2f (speed = %8.0f)\n", n, f0, fpms/(10000*f0) n *= 2 end loop printf "Case 2 ------------------------------------\n" n = n0 loop pwr fpms = n*n0*n a = mnormal(n,n0) b = mnormal(n0,n) set stopwatch c = a*b f0 = $stopwatch printf "n = %4d, time = %6.2f (speed = %8.0f)\n", n, f0, fpms/(10000*f0) n *= 2 end loop printf "Case 3 ------------------------------------\n" n = n0 loop pwr fpms = n*n*n a = mnormal(n,n) b = mnormal(n,n) set stopwatch c = a*b f0 = $stopwatch printf "n = %4d, time = %6.2f (speed = %8.0f)\n", n, f0, fpms/(10000*f0) n *= 2 end loop

We tried the following experiment with Ubuntu 8.04 "Hardy" running on
a dual-core Pentium D machine: first, install the highest available
Ubuntu ATLAS package, `libatlas3gf-sse2`, and time gretl as
linked against the LAPACK/BLAS from that package. Then hand-build
ATLAS (version 3.8.2, with lapack version 3.1.1, using gcc 4.2.3) and
run the timing test again. The results are shown below. As you can
see, the per-machine optimization produces an extra performance
gain, but most of the gains over vanilla matrix multiplication are
realized by the `libatlas3gf-sse2` package.

Using native matrix multiplication code:

Case 1 ------------------------------------ n = 512, time = 0.42 (speed = 31957) n = 1024, time = 0.82 (speed = 32736) n = 2048, time = 1.64 (speed = 32736) Case 2 ------------------------------------ n = 512, time = 0.42 (speed = 31957) n = 1024, time = 1.73 (speed = 31033) n = 2048, time = 7.21 (speed = 29785) Case 3 ------------------------------------ n = 512, time = 0.44 (speed = 30504) n = 1024, time = 3.38 (speed = 31768) n = 2048, time = 27.73 (speed = 30977)

Using Ubuntu's `libatlas3gf-sse2`

Case 1 ------------------------------------ n = 512, time = 0.09 (speed = 149131) n = 1024, time = 0.17 (speed = 157903) n = 2048, time = 0.33 (speed = 162688) Case 2 ------------------------------------ n = 512, time = 0.10 (speed = 134218) n = 1024, time = 0.37 (speed = 145100) n = 2048, time = 1.47 (speed = 146087) Case 3 ------------------------------------ n = 512, time = 0.10 (speed = 134218) n = 1024, time = 0.68 (speed = 157903) n = 2048, time = 5.11 (speed = 168100)

Using hand-built ATLAS 3.8.2

Case 1 ------------------------------------ n = 512, time = 0.08 (speed = 167772) n = 1024, time = 0.15 (speed = 178957) n = 2048, time = 0.27 (speed = 198841) Case 2 ------------------------------------ n = 512, time = 0.08 (speed = 167772) n = 1024, time = 0.32 (speed = 167772) n = 2048, time = 1.24 (speed = 173184) Case 3 ------------------------------------ n = 512, time = 0.08 (speed = 167772) n = 1024, time = 0.56 (speed = 191740) n = 2048, time = 4.18 (speed = 205501)

The following table shows timings (the "real" value from the GNU `time`
program) for running batches of gretl scripts on a Lenovo Thinkpad (Core 2
Duo, 1.83GHz). The `vanilla` timings are from gretl linked against
netlib lapack 3.1.1 and netlib BLAS (compiled using gfortran from gcc 4.3.1,
with -O2 optimization), and with all gretl matrix multiplication done
in-house. The `atlas` timings are from gretl linked against ATLAS
3.8.2, with `blas_nmk_min` set to 1000. The numbers in the last column
show that the atlas runs were marginally slower in all cases apart from the
`arbond` scripts (Arellano-Bond dynamic panel data modeling).

timings (seconds) | |||||||

subdir |
scripts |
(1) | (2) | ||||

vanilla | atlas | (2)/(1) | |||||

top | 157 | run 1 | 3.082 | 3.378 | |||

run 2 | 3.111 | 3.408 | |||||

mean | 3.097 |
3.393 |
1.10 | ||||

arbond | 24 | run 1 | 2.081 | 2.073 | |||

run 2 | 2.072 | 2.057 | |||||

mean | 2.077 |
2.065 |
0.99 | ||||

mle | 10 | run 1 | 4.353 | 4.410 | |||

run 2 | 4.350 | 4.455 | |||||

mean | 4.352 |
4.433 |
1.02 | ||||

sarma | 6 | run 1 | 4.155 | 4.418 | |||

run 2 | 4.163 | 4.354 | |||||

mean | 4.159 |
4.386 |
1.05 | ||||

arma | 31 | run 1 | 4.025 | 4.287 | |||

run 2 | 4.039 | 4.314 | |||||

mean | 4.032 |
4.301 |
1.07 | ||||

gmm | 12 | run 1 | 3.601 | 3.681 | |||

run 2 | 3.623 | 3.682 | |||||

mean | 3.612 |
3.682 |
1.02 | ||||

vecm | 16 | run 1 | 0.277 | 0.325 | |||

run 2 | 0.275 | 0.319 | |||||

mean | 0.276 |
0.322 |
1.17 | ||||

s_watson | 14 | run 1 | 1.069 | 1.119 | |||

run 2 | 1.066 | 1.169 | |||||

mean | 1.068 |
1.144 |
1.07 |