The video version of this post is avaiable here.

We are glad to announce that, with generous funding support from the department, SCRP now offers access to NVIDIA A100 datacenter GPUs for faculty members, research staff and research postgraduate students.

As one of the fastest general-purpose accelerators, the main advantages of the A100 GPU are memory and computational speed. Before having A100s, SCRP offer two types of consumer-grade GPUs: the top-of-the-line RTX 3090 and the mainstream RTX 3060. The RTX 3090 has 24 gigabytes of memory, while the RTX 3060 has half of that. The A100, in contrast, has 80 gigabytes of memory, allowing you to load much larger model or more data.

GPU VRAM FP16* FP32* FP64*
RTX 3090 24GB 36 36 0.6
RTX 3060 12GB 13 13 0.2
A100 80GB 78 20 9.7

*Measure: TFLOPS

When it comes to speed, both the RTX 3090 and RTX 3060 excel in single and half precision computation. The A100 is, however, even faster. This is particularly true in for double-precision computation, which is often necessary for scientific computation. In this specific area, the A100 is over 15-times faster than the RTX 3090. For reference, by default, Stata works in single precision while MATLAB and R work in double precision.

If your computation is heavy on matrix multiplication, the tensor core performance is the more relevant metric. Tensor cores are compute units dedicated to matrix multiplication, providing several times the performance over other operations. Here, once again, the A100 is much faster than the other options. It is worth noting that a single A100 is faster than the fastest supercomputer in the world from 20 years ago.

GPU Tensor FP16* Tensor FP32*
RTX 3090 142 36
RTX 3060 51 13
A100 312 156

Finally, the A100 has many times the inter-GPU bandwidth than the other options, which could be an important consideration if your computation requires the use of multiple GPUs.

GPU Inter-GPU Bandwidth
RTX 3090 113GB/s
RTX 3060 64GB/s
A100 600GB/s

Benchmarks

To see how fast the A100 really are, I will now run through several scenarios. The numbers reported below are in seconds, averaged over three runs.

  • The first scenario simulates a situation where you pass data to the GPU for computation. We first generate a 20K x 20K with CPU, pass it to a GPU, perform two sets of matrix multiplications and then sum the results.

    test1

    Device FP16^ FP32^ FP64^
    Ryzen 5950X - 23.2 53.4
    RTX 3090 1.1 1.6 59.5
    A100 0.9 1.1 3.7

    ^Measure: seconds

    Using a GPU is generally much faster than CPU, but the numeric precision you use is going to have a big impact. If you need double precision, only the A100 is going accelerate your computation.

  • Next, we will spread the two matrix multiplications over two GPUs.

    test2

    Device FP16^ FP32^ FP64^
    RTX 3090 1.3 1.6 31.6
    A100 1.4 1.3 3.3

    Will this result in faster speed? Perhaps surprisingly, this is only true for double-precision computation. The reason is that computation is bottlenecked by the time it takes for data to transfer from main memory to GPUs.

  • With that in mind, we will try generating the data on GPU instead, followed by the same computation.

    test3

    Device FP16^ FP32^ FP64^
    RTX 3090 0.23 0.5 29.6
    A100 0.1 0.2 1.0

    Moving data away from the CPU and main memory result in significantly faster speed. What takes 23 seconds on a modern 16-core CPU only takes 200 milliseconds on an A100.

We therefore have three observations:

  1. For low-precision computation, any GPU is fine.
  2. For high-precision computation, you need the A100.
  3. CPU-to-GPU data transfer is a major bottleneck, so it should be avoided as much as possible.

Finally, I would like to present some benchmarks on deep learning performance of the A100:

test4

For both training and inference, you can expect around 80% speed up over RTX 3090, due to a combination of more memory and faster computation.

Requesting A100 GPUs

The two A100 GPUs are installed in the new scrp-node-10 compute node. To request them with the compute command, specify --gpus-per-task=a100[:num]. For example:

compute --gpus-per-task=a100:2 python

will request both A100s. The compute command automatically allocates 8 CPU cores and 160GB RAM per A100.

If you use srun, you need to specify the a100 partition, plus your desired number of CPU cores and memory:

srun -p a100 --gpus-per-task=1 -c 16 --mem=160G [command] 

Finally, you can request A100s on JupyterHub (Slurm) for up to three hours.