AMD BLAS/LAPACK Optimization in 2022

This is the first of a multi-part series on how to optimize linear algebra performance as of January 2022. We will go over how to optimize BLAS/LAPACK performance on AMD CPUs, focusing on R because it is most affected by this process. In Part 2, we will compare the performance of various software commonly used in economics.

Before we move on, let us first address the obvious question: why do we have to optimize specifically for AMD CPUs? The reason is that, as you will see in a moment, the fastest linear algebra library in recent years has been the Intel Math Kernel Library (MKL). While MKL does works on AMD CPUs, for competitive reasons, MKL checks whether the CPU on a system is made by Intel and forces the use of much slower routines otherwise. This check is quite rudimentary and can be circumvented easily.

In this post, we will be comparing the performance of the following BLAS/LAPACK libraries:

Built-in: These are the default libraries if you compile R from source without specifying a different linear algebra library. These libraries are not multithread-capable.
OpenBLAS 0.3.17: This is is default if you install R through conda right now.
MKL 2019: This is the library used by Microsoft R Open 3.5.3. For this version, the typical way to get around the Intel CPU check is to set the following environment variable:
```
MKL_DEBUG_CPU_TYPE=5
```
MKL 2022: This the latest version of MKL. You can install it as part of Intel oneAPI or as part of the R installation through conda as follows:
```
  conda create -n r-mkl -c conda-forge r-recommended mkl blas=*=*mkl
```
For this version, the MKL_DEBUG_CPU_TYPE variable no longer has any effect. To get around the CPU check, one has to replace the check function with a custom one through LD_PRELOAD. See here for step-by-step instructions.

For each library, we will run the following tests:

OLS: Linear regression with 10 independent variables and five million randomly-generated observations.
MMul: Matrix multiplication of two 9000 x 9000 matrices.
Eigen: Eigenvalue and eigenvector computation of a 2000 x 2000 matrix.

Test platform: EPYC 7302 and SK Hynix 64GB 3200Mhz ECC RDIMM x 4 on Asrock Rack EPYCD8-2T.

MKL 2019 benchmarks were run on MRO 3.5.3. All other tests were run on R 4.1.2.

Here are the results, measured in seconds:

BLAS/LAPACK	OLS	MMul	Eigen
Built-in	2.294	>100	34.716
OpenBLAS 0.3.17	1.642	2.654	3.190
MKL 2019	1.724	9.853	4.698
MKL 2019 + MKL_DEBUG_CPU_TYPE	1.641	2.300	3.209
MKL 2022.01	1.560	2.650	4.160
MKL 2022.01 + LD_PRELOAD	1.570	2.305	2.404

MKL 2022 is essentially the fastest in all three benchmarks—with a particularly noticable lead in eigenvalue computation—while OpenBLAS is barely competitive with MKL 2019. The importance of the Intel CPU workaround is very apparent, without which MKL would be slower than OpenBLAS when run on AMD CPUs. To be fair, the difference between these libraries probably does not matter unless you multiply extremely huge matrices for a living. The exception is the default library which, frankly, there is simply no reason why you should be using on an x86 platform in 2022.