Data Loss on scrp-data: Preliminary Incident Report

During the June 2023 maintenance,the cluster has suffered data loss on the distributed storage (/data and ~/large-data) due to software issues. This preliminary report detail what had happened and what we plan to do to avoid similar incidents in the future.

What is Distributed Storage?

Distributed storage refers to technology that spread data across multiple servers, in order to speed up file access beyond what is possible with a single file server. Most high-performance computing clusters are backed by some form of distributed storage. Running a distributed storage system requires specialised software such as Lustre or BeeGFS.

Design of the Current Distributed Storage

SCRP is currently on our second generation of distributed storage. The distributed storage is hosted on a single server, scrp-data, with multiple network interface cards (NIC) and solid-state drives (SSD). In order to spread the workload evenly across the NICs and the SSDs, the solid-state drives on the server is grouped by RAID 10 and split again with LVM, with each NIC having access to its own LVM logical volume. Redundancy is provided by RAID 10, theoretically allowing the storage system to lose up to half its disks without losing data.

SCRP Data

What Had Happened During June 2023 Maintenance?

scrp-data started off with six Intel P4510 8TB SSDs for storage. These six disks form one RAID 10 array with six LVM logical volumes. This array was not affected by the data loss.

In March 13, 2023, we added an additional six Intel P4510 8TB SSDs to increase the available amount of storage. The setup of this second array was the same as the old one, and it worked without problem until the June 2023 Maintenance. This expansion was done without rebooting the server, which turns out to be an issue.

During the June 2023 maintenance, we rebooted all servers as part of our standard procedure to apply upgrades. After scrp-data boots up, we discovered that the second array was missing. For this reason, only files created after March 13, 2023 on /data are at risk of being lost, since older files reside on the first array.

Upon investigation, we discovered that while the array can be rebuilt, it would disappear again after reboot. The cause of this remains unknown till we can further investigate, which would require downtime.

Why Is There No Backup?

Due to their large storage capacity and optimization to deliver the fastest access speed, most distributed storage systems have no backup. Note that backup is different from redundancy—the former refers to a second copy of files not in sync with the primary copy. scrp-data does have redundancy in the form of RAID 10, but it appears the software controlling redundancy has failed.

What Are You Going Doing About It?

In light of the incident, we will accelerate the deployment of our third-generation distributed storage system. The new system will have two servers that are identical copies of each other. This provides complete redundancy up to the operating-system level. During future maintenace, we will ensure that the two severs are never powered down at the same time. This will provide significantly better protection againsts data loss.

While we strive to prevent data loss on all of the cluster’s storage, as a good data security practice, you should not put any file you cannot afford to lose in locations that have no backup. We advise users to put anything that is generated and too important to lose in the ~/. For files too large to fit into ~/, you should keep your primary copy in ~/large-data and a backup copy in /archive.