NVIDIA A100 Tensor Core GPU
The video version of this post is avaiable here.
We are glad to announce that, with generous funding support from the department, SCRP now offers access to NVIDIA A100 datacenter GPUs for faculty members, research staff and research postgraduate students.
As one of the fastest generalpurpose accelerators, the main advantages of the A100 GPU are memory and computational speed. Before having A100s, SCRP offer two types of consumergrade GPUs: the topoftheline RTX 3090 and the mainstream RTX 3060. The RTX 3090 has 24 gigabytes of memory, while the RTX 3060 has half of that. The A100, in contrast, has 80 gigabytes of memory, allowing you to load much larger model or more data.
GPU  VRAM  FP16*  FP32*  FP64* 

RTX 3090  24GB  36  36  0.6 
RTX 3060  12GB  13  13  0.2 
A100  80GB  78  20  9.7 
*Measure: TFLOPS
When it comes to speed, both the RTX 3090 and RTX 3060 excel in single and half precision computation. The A100 is, however, even faster. This is particularly true in for doubleprecision computation, which is often necessary for scientific computation. In this specific area, the A100 is over 15times faster than the RTX 3090. For reference, by default, Stata works in single precision while MATLAB and R work in double precision.
If your computation is heavy on matrix multiplication, the tensor core performance is the more relevant metric. Tensor cores are compute units dedicated to matrix multiplication, providing several times the performance over other operations. Here, once again, the A100 is much faster than the other options. It is worth noting that a single A100 is faster than the fastest supercomputer in the world from 20 years ago.
GPU  Tensor FP16*  Tensor FP32* 

RTX 3090  142  36 
RTX 3060  51  13 
A100  312  156 
Finally, the A100 has many times the interGPU bandwidth than the other options, which could be an important consideration if your computation requires the use of multiple GPUs.
GPU  InterGPU Bandwidth 

RTX 3090  113GB/s 
RTX 3060  64GB/s 
A100  600GB/s 
Benchmarks
To see how fast the A100 really are, I will now run through several scenarios. The numbers reported below are in seconds, averaged over three runs.

The first scenario simulates a situation where you pass data to the GPU for computation. We first generate a 20K x 20K with CPU, pass it to a GPU, perform two sets of matrix multiplications and then sum the results.
Device FP16^ FP32^ FP64^ Ryzen 5950X  23.2 53.4 RTX 3090 1.1 1.6 59.5 A100 0.9 1.1 3.7 ^Measure: seconds
Using a GPU is generally much faster than CPU, but the numeric precision you use is going to have a big impact. If you need double precision, only the A100 is going accelerate your computation.

Next, we will spread the two matrix multiplications over two GPUs.
Device FP16^ FP32^ FP64^ RTX 3090 1.3 1.6 31.6 A100 1.4 1.3 3.3 Will this result in faster speed? Perhaps surprisingly, this is only true for doubleprecision computation. The reason is that computation is bottlenecked by the time it takes for data to transfer from main memory to GPUs.

With that in mind, we will try generating the data on GPU instead, followed by the same computation.
Device FP16^ FP32^ FP64^ RTX 3090 0.23 0.5 29.6 A100 0.1 0.2 1.0 Moving data away from the CPU and main memory result in significantly faster speed. What takes 23 seconds on a modern 16core CPU only takes 200 milliseconds on an A100.
We therefore have three observations:
 For lowprecision computation, any GPU is fine.
 For highprecision computation, you need the A100.
 CPUtoGPU data transfer is a major bottleneck, so it should be avoided as much as possible.
Finally, I would like to present some benchmarks on deep learning performance of the A100:
For both training and inference, you can expect around 80% speed up over RTX 3090, due to a combination of more memory and faster computation.
Requesting A100 GPUs
The two A100 GPUs are installed in the new scrpnode10
compute node.
To request them with the compute
command, specify gpuspertask=a100[:num]
.
For example:
compute gpuspertask=a100:2 python
will request both A100s. The compute command automatically allocates 8 CPU cores and 160GB RAM per A100.
If you use srun
, you need to specify the a100 partition, plus your desired number of CPU cores and memory:
srun p a100 gpuspertask=1 c 16 mem=160G [command]
Finally, you can request A100s on JupyterHub (Slurm) for up to three hours.