Building a ZFS HA Cluster Part 1
Hello and welcome to my journey on building a dual head node, redundant ZFS Cluster. Most stuff I showcase here will be using second hand stuff.
First up, Why am I doing this?
I love ZFS. I am running it on three NAS within my network. On all my three daily drivers, two private & one at work, are running ZFS on boot. Further, this is will not be for my home setup, but is a side project for work.
Our current setup at work is a single server for all our active research and archival data. As this data is accessed frequently by multiple researchers at the same time, we want something more reliable and high performance. My plan is that this high availability setup will take over the primary data storage. This concerns storage of all our active data sets, some storage for VMs and read-only archive for finished projects. The existing storage will move back one layer and only handle backups and archival of projects. I might also have the possibility for racking it in a different data center. Which would solve my concerns with keeping all data at a single physical location.
With the use-case out of the way, how do I want to realize this plan? The concept I want use here is based on multiple SAS disk shelves. SAS has an awesome feature called multipath. This feature allows for each SAS drive to be access via two different data paths. Using a single SAS controller hooked up to this drive, both data paths can be aggregated to increase bandwidth. This does not really matter with HDDs yet, however SSDs can easily double their bandwidth using this feature.
Funnily enough, SSD vendors use this to inflate their read and write speeds on their datasheets. SAS4 has a theoretical bandwidth limit of 24Gbit/s, which is around 3GB/s ignoring overhead. The datasheet of the 3.84TB Kioxia PM7-R Enterprise drive lists a read speed of 4200MB/s. This exceeds the single path speed of SAS and thus will not be usable using a single path on most servers. I once bought some used SAS3 SSDs for cheap and was kind of let down because the speeds in my Dell R630 did not match the reported data, so I feel obliged to highlight this. At least Kioxia mentions this in the data sheet, that speeds can only be achieved in dual-port mode.

Section of the Kioxia PM7-R Datasheet highlighting the achievable transfer speeds in dual-port mode.
Another way to use both data paths is to use two different controllers. If both controllers are in the same head server, we can increase reliability, as we now have two dedicated paths - two controllers and two cables - to our disk. In this case, either a controller or cable can fail without impacting the service. To increase the reliability even more, we can move the second controller into a second head server, thus allowing us a failure of a server as well. This however, comes at the drawback of increased complexity and the filesystem needs to be able to support this feature.
ZFS supports this multi headserver setup, so I want to try this here. We use ZFS snapshots excessively to transfer our data sets and do backups. I planned to acquire two NetApp disk shelves to be used. On disk shelf should hold 24x 3.5" hard disks and the other one should hold 24x 2.5" hard disks or SSDs. Each NetApp typically has two controllers with two SAS ports each, resulting in four ports per disk shelf in total. We strive for a setup with as high as possible reliability on this kind of budget. So we want to split the available ports for two servers. Each server thus should be equipped with two SAS HBAs (Host Bus Adapter). In turn, each HBA is connected to one controller in each disk shelf. Thus, each component (server, HBA, SAS controller) can fail theoretically without impacting services. We can not sustain a failure of a total disk shelf. This is one limitation I am fine with, as doubling the disk shelves and drives increases total cost by too much. Further, all data should be mirrored on the backup server too, so a total failure should be recoverable. The following diagram illustrates the intended setup with cabling.

The planned cabling diagram for our test cluster if we can find all components as planned.
Hardware Acquiring
So, the general plan for this setup were two identical servers and two disk shelves. I planned for two disk shelves, one for 3.5" and on for 2.5" drives, so i can use high capacity 3.5" drives and add some test disks and SAS SSDs into the other one. However, as I was hunting the second hand market for deals, one has to adapt to the actual auctions online and change the plan ;)
First, I found a decent deal on two identical ASUS RS700-E9-RS12. They were a four hour drive away, but the deal was too good to drop it. Both servers had identical configuration and were used in a 7 node CEPH cluster. Each server is 1U in height and has two Xeon Scalable 6124 Gold CPus with 12 cores / 24 threads each. For RAM we got 4x 64GB DDR4 RDIMM with 3200MT/s which is way more than we ever would need in such a test system. We even got two Samsung 850 Evo M.2 disks as SATA boot drives. For networking, the servers came with an Intel X550 dual port 10Gbit RJ45 OCP2.0 cards. With one of these servers, I had some difficulties getting the IPMI and BIOS updated, which you can find in previous blog post. The drive bays are fitted to hold 6 SAS/SATA as well as 4 U.2 NVMe drives.

One of the head servers on top of the DS4246 disk shelf.
As we want to use these servers as head servers for our storage needs, we need some additional parts for them. First, I found 4 identical Dell External 12Gbit SAS HBAs. Each HBA has 2 MiniSAS HD connectors and is adequate for the original plan with two disk shelves. Next up, for networking, the dual port 10Gbit RJ45 card is not really usable in our data center, we much rather want something more potent but more important, something we can actually use. There are some really good deals on ebay for some dual port 25Gbit Mellanox ConnectX-4 OCP2.0 cards for 26€ per card. They came from china, but as I was in no rush, the wait was well wort the good price.
As for the disk shelves, I had to apply more patience as the selection on willhaben was rather slim. I set up a notification for the search term “netapp” as these were the cheapest option to get some good capacity shelves and waited for results. I found a NetApp DS4643 rather fast. Here I should explain, these shelves come with redundant and hot-swappable SAS controllers. the DS4643 with a 3 at the end, indicates IOM3 controllers with SAS 3Gbit connections. This was rather underwhelming, SAS 12Gbit has been standard for ages and SAS4 with 24Gbit was available for some time now too. The minimum for me was SAS 6Bit. This would still saturate our 50Gbit of network connection, so SAS3 was not a priority of this setup until we upgrade the entire network to 200Gbit or so which is out of my control :'(
Fortunatelly, the seller was friendly enough to add in two IOM6 modules for 5€ a piece. Some months later, I found two DS2246 shelves with 52 900GB 2.5" drives. This was an amazing deal for me, as it allows me to test the entire system without buying any drives. I plan on buying high capacity 3.5" drives later down the road and some SSDs for higher IOPS data sets and/or ZFS SLOG. To attach the disk shelves to our head servers, we need some special cabling. NetApp uses QSFP connectors for their SAS connections. As we now have three shelves, we need to cascade two shelves after another. For this I found some original NetApp QSFP SAS cables again on ebay. For the connections between the HBAs and the shelves, 10Gtek has some MiniSAS HD (SFF-8644) to QSFP (SFF-8436). With that, we have all the hardware we need for a small test cluster.

The final cabling diagram for our test cluster.
Drive Madness
The final thing we need to prepare are the disks. After cabling everything as displayed on the cabling guide, all disks will show up twice in our Debian installation. This is to be expected and a consequence of the SAS multipath feature and us using two SAS HBAs to connect to each disk shelf. We do not care about that right now, we only want to prepare the disks for usage in Linux.
All 900GB drives are NetApp branded drives and came with the disk shelves.
NetApp utilizes 520 byte as the logical sector size to store additional
information about the array and data, e.g. checksums. This is not support on
a standard Debian installation, so we cannot use them in this current state.
Thankfully, we can just reformat all disks to a more standard sector size of
512 bytes. Usually, I do that manually, but as we are talking about 52 drives at
once, I invested some time to script it. Using our trusty lsblk
we can query
the disks paths on our machine. I have to admit, I was a bit startled at first
when it showed me 130 disks attached to this machine. I had never seen so many
disks at once and had to giggle like a small boy in a candy shop :D
All of the 900GB disks showed with size 0 bytes, an side effect of the 520 byte formatting.
The first step was to isolate the disks on one HBA, so we can be sure to not
query / format a disk twice. As we are using a PCIe attached HBA, we can use the
path /dev/disk/by-path/pci-***
. All disks should be shown a single time with the
controller PCIe path as prefix. E.g. my controllers appeared at
/dev/disk/by-path/pci-0000:87:00.0-sas-*
with IDs at 86 and 87 with all
64 disks shown for both controllers. My small script queries each device for its
logical block size and its serial number. If the block size equals to 520,
we start a formatting process via sg_format
within a screen. Each screen is named
after the disks serial number, to easy attribute any errors or longer durations
to any disk. The script depends on sg3-utils
package on Debian. Following script
does exactly that, it can also be found on my Codeberg repository linked later.
#! /bin/bash
# This script looks at all drives found via the scsi controller located at pci-0000:87:00.0
# You might need to change that, that is where my controller was located
# WARNING: Script will format *ANY* drive encountered with blocksize 520 and
# formats to 512
# Each format takes place in a screen named after the drive by its serial number
BASE_PATH=/dev/disk/by-path/pci-0000:87:00.0-sas-*
if [ "$EUID" -ne 0 ]; then
echo "Please run as root"
exit
fi
for i in $BASE_PATH; do
byte_size=$(sg_readcap --long $i | grep "Logical block length" | tr -dc '0-9')
serial_nr=$(lsblk -dno serial $i)
echo "Disk $serial_nr"
if [ "$byte_size" -eq 520 ]; then
echo ""
echo "!!! Disk $serial_nr has 520byte !!!"
echo "Formatting!"
echo ""
screen -S $serial_nr -dm bash -c "sg_format $i -v --format --size=512 -Q"
else
echo "Disk $serial_nr is fine"
fi
done
I was lucky enough to find all disks to start formatting immediately. I encountered some disks before, that contain some kind of write protection and need a full format before trying to reformat the disk. Thankfully, that was not the case here with any disk.
Outlook
This was it for Part 1 of my ZFS HA adventures. As this is a small side project, progress probably will be somewhat slow. All code will be published at my Codeberg repository for my Linux ZFS cluster. My next plans are to deploy the first head node with all software I want for management and monitoring. This way, I can test all services and performance without the two node complexity eating all my time. Thanks for reading :)
Last modified on 2024-12-16