272 lines
14 KiB
Markdown
Executable File
272 lines
14 KiB
Markdown
Executable File
---
|
|
layout: post
|
|
title: Setting up a homegrown NAS
|
|
date: 2021-07-02 14:25
|
|
author: neko
|
|
---
|
|
|
|
# Preface
|
|
|
|
About a year back I decided to install a new NAS in my lab and documented the process. This has been lying around on my wiki for a while in a pretty raw shape.
|
|
|
|
Since I was asked about it recently, I decided to rework the notes into a proper document.
|
|
|
|
Please let me know if any info here is incorrect or inprecise, or if you have any ideas how to improve upon it!
|
|
|
|
# Intro
|
|
|
|
This document details the installation process of a homegrown NAS system running an `md`+`lvm` combination. `md` is used for the heavy lifting of the software raid, and `lvm` is used to keep volumes tidy.
|
|
|
|
In this system, the operating system will be run of it's own SSD drive and will not be stored on the array, therefor no further steps must be taken to ensure that the md array is initialised for booting.
|
|
|
|
The decision to use software RAID is often an economic one, but I think there's more to it than just saving some money for a proper RAID controller.
|
|
|
|
Unless you are ready to pay for a proper server grade RAID controller with enough cache and a working battery backup unit, software RAID is actually preferrable in most situations. Software RAID is often called "slow" but the reality is that modern processors are more than fast enough to deal with the processing required for the RAID. The flexibility gain should also be taken into account. The only real drawback is a lack of dedicated cache for the RAID.
|
|
|
|
Most off-the-shelve consumer-grade NAS devices are using software RAID aswell.
|
|
|
|
## Installation sequence
|
|
|
|
1. Install debian
|
|
2. Set up md (create)
|
|
3. Set up md config for safety
|
|
4. Initialise lvm
|
|
5. Add lvm volumes
|
|
6. Set up smb
|
|
7. Set up exim and monitoring in mdadm
|
|
8. Set up hdparm monitoring and checks
|
|
|
|
## RAID setup mdadm
|
|
|
|
First, initialise a new RAID 5 on the devices chosen:
|
|
`mdadm --create --level=5 --raid-devices=3 /dev/sda /dev/sdb /dev/sdc`
|
|
|
|
The information is written to the drives themselves and the RAID will automatically be detected by the `md` module and set up accordingly.
|
|
|
|
After the RAID was initialised, `mdadm --detail --scan` will list the currently detected md raid array. For safety, copy the output into `mdadm.conf` to make sure the device node doesn't change one day[^conf].
|
|
|
|
Finally set up md monitoring in the config (On debian, the configuration should already be set up with placeholders for that.)[^mon]
|
|
|
|
```
|
|
# mdadm.conf
|
|
#
|
|
# Please refer to mdadm.conf(5) for information about this file.
|
|
#
|
|
|
|
# by default (built-in), scan all partitions (/proc/partitions) and all
|
|
# containers for MD superblocks. alternatively, specify devices to scan, using
|
|
# wildcards if desired.
|
|
#DEVICE partitions containers
|
|
|
|
# automatically tag new arrays as belonging to the local system
|
|
HOMEHOST <system>
|
|
|
|
# instruct the monitoring daemon where to send mail alerts
|
|
MAILADDR notifications@example.com
|
|
|
|
# definitions of existing MD arrays
|
|
# this is a copy of mdadm --detail --scan
|
|
ARRAY /dev/md/0 metadata=1.2 UUID=83a1806c:d3d5461e:550b91bb:cd59045b name=nas-test:0
|
|
|
|
# This configuration was auto-generated on Thu, 20 Feb 2020 09:58:14 +0000 by mkconf
|
|
MAILFROM nas@example.com
|
|
```
|
|
|
|
If you are not going to use LVM this is where your RAID is up and running. You can now format your md array, set up hdparm and system mounts and finally your SMB server. (SMB setup is not covered in this documentation)
|
|
|
|
* `fdisk /dev/md0` to format the drive
|
|
* use `gdisk /dev/md0` if the RAID is bigger than 2TB to make a GPT table instead of an MBR table
|
|
* `mkfs.xfs /dev/md0p1` to format the first partition to xfs (recommended)
|
|
|
|
Now we need to change some parameters for the drives. This is tricky.
|
|
|
|
Since we are running software RAID, data to be written to the array is first cached on the system cache, at which point the journal of the file system will do it's job.
|
|
|
|
However, after that we have a second layer of cache: the hard drive cache. Since our journal is taking care of the md array and not the individual hard drive, we cannot be certain that data written to the drive is actually written onto the platter and not to the cache (likely).
|
|
|
|
In case of a power loss, it would not be possible to determine if a file has been lost before the transaction from the cache to the platter has occured.
|
|
|
|
The safe route here is to disable the cache on all hard drives. This will cost us performance, since there is no dedicated RAID cache like a real hardware RAID controller would have, but data safety is more important usually. For more info, check out this [link (serverfault.com)](https://serverfault.com/questions/134136/how-much-does-hdd-cache-matter-with-linux-softraid)
|
|
|
|
Add the hard drives you want to disable the cache for to the `hdparm.conf` (do this for every drive in the RAID array). This will disable cacheing on the drive.:
|
|
|
|
```
|
|
/dev/sdX {
|
|
write_cache = off
|
|
}
|
|
```
|
|
|
|
Now finally, create a `systemd.mount` mount file so the mount is automatically loaded into the system (replaces the former `fstab` method).
|
|
|
|
```
|
|
[Unit]
|
|
Description=Mount RAID partition
|
|
|
|
[Mount]
|
|
What=/dev/disk/by-uuid/86fef3b2-bdc9-47fa-bbb1-4e528a89d222
|
|
Where=/your/mount/point
|
|
Type=xfs
|
|
Options=defaults
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
A few things to note here:
|
|
* `What=` should be the path of the disk by UUID, not by device file like `/dev/md0p1`. These names can change, the UUID will not. You can find the UUID by running `blkid`. Make sure you use the id for the partition, not the drive.
|
|
* When saving the mount file, it must be named after the mount folder. So if your mount will be to `/mnt/storage` the name of the file should be: `/etc/systemd/system/mnt-storage.mount`. Replace the `/` in your path with `-`.
|
|
|
|
After setting up the mount file, reload the systemd daemon and enable the mount (so it is reloaded after a reboot):
|
|
```
|
|
systemctl daemon-reload
|
|
systemctl start mountfile.mount
|
|
systemctl enable mountfile.mount
|
|
```
|
|
|
|
* Note: When using this in combination with SMB, add a `Before=`-clause to the mount file and specify that the mount should start before `smbd.service` for safety reasons. If the mountpoint is not available by the time Samba starts, the service will fail to start.[^smbmnt]
|
|
|
|
[^smbmnt]:This might not be entirely needed, since local filesystems are always mounted before network services are enabled as indicated by a plot of `systemd-analyze plot`.
|
|
|
|
[^mon]: this is useful in case a drive fails. you want to know as soon as possible. better: set up a monitoring system like influx
|
|
|
|
[^conf]: the config itself does not need to be stored on in the config; it is written to the drive itself. this however is useful for making sure your raid might not one day randomly be renamed from md0 to md1 for example. (this may only happen if a new raid array is autodetected)
|
|
|
|
## LVM configuration
|
|
|
|
In case you decide to run a more elaborate partition setup on your array, LVM is highly recommended. While it increases the amount of complexity in the disk setup and might be harder to restore in case of catastrophic failure (You should keep backups off-site no matter your setup, anyway), it will give you a lot of flexibility when it comes to managing the data stored on the array.
|
|
|
|
In my case, I want to have a single partition, called logical volume in LVM terminology, per SMB share.
|
|
|
|
First, install LVM:
|
|
|
|
`apt install lvm2`
|
|
|
|
Now initialise a physical device. A physical volume (PV) is the device that will contain the data stored on the logical volume. Note that this requires the device to be free of any prior file system, since the initialisation data and configuration of the PV is written to the beginning of the device (so again, no configuration is needed here)
|
|
|
|
`pvcreate /dev/md0`
|
|
|
|
If `pvcreate` fails due to a filter, it's because `pvcreate` detected a filesystem structure. this is a safety check to make sure you don't accidentally erase your file system off a drive.
|
|
|
|
To wipe all remains of a filesystem you can use: `wipefs -a /dev/md0`
|
|
|
|
Next, create a volume group. A volume group (VG) is a named entity consisting of one or more physical volumes. All logical volumes have to be associated with a volume group.
|
|
|
|
`vgcreate name /dev/md0`
|
|
|
|
`name` is the name for the volume group. This is used to identify the volume group later.
|
|
|
|
Now create your first logical volume (LV):
|
|
|
|
`lvcreate -L 30G -n partname name`
|
|
|
|
`partname` is the name of the partition you want. Change this to reflect what you will store on it. `name` is the name of your volume group.
|
|
|
|
Once this is finished, you can now use the logical volume just like you would use any partition on a hard drive. So now, we need to make a new filesystem on it.
|
|
|
|
`mkfs.xfs /dev/name/partname`
|
|
|
|
Note that instead of using device file names, you will use the readable names entered when creating the VG and the LV respectively.
|
|
|
|
## Monitoring
|
|
|
|
When running a RAID you always want to monitor your hard drives for failure and for signs of a coming failure (via S.M.A.R.T., covered later). In case of a drive failure, the defective drive has to be replaced as soon as possible to avoid data loss.
|
|
|
|
`mdadm` on debian automatically monitors the arrays configured. To make this a bit more flexible, set up a mail address in the config `/etc/mdadm/mdadm.conf`. If you followed the instructions above, this will already be somewhere in your `mdadm.conf`:
|
|
|
|
```
|
|
MAILADDR notifications@example.com
|
|
MAILFROM nas@example.com
|
|
```
|
|
|
|
It is recommended to use a smarthost configuration with exim4 to relay these emails correctly to an external mailbox. (not convered in this documentation)
|
|
|
|
Additionally to monitoring the drives for failure with `mdadm`, one should always run S.M.A.R.T. checks on the drive to determine the current health of the drive. This can help identify a future drive failure beforehand.
|
|
|
|
On linux, `smartmontools` (`smartd`) provides this functionality. The details are not covered in this documentation, since the configuration file is very complex. A manual is installed with every copy of `smartmontools`.
|
|
|
|
In my configuration, I added the following:
|
|
|
|
```
|
|
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
|
|
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
|
|
/dev/sdc -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
|
|
/dev/sdd -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
|
|
```
|
|
|
|
This will run a short test on every drive every night, aswell as a long test every saturday night.
|
|
|
|
Notifications will be sent via a shell script mentioned behind the `exec`-directive. The script looks as follows:
|
|
|
|
```sh
|
|
#!/bin/sh
|
|
# Send email
|
|
printf "Subject: $SMARTD_FAILTYPE \n $SMARTD_MESSAGE" | msmtp "$SMARTD_ADDRESS"
|
|
# Notify user
|
|
wall "$SMARTD_MESSAGE"
|
|
```
|
|
|
|
## Benchmarking tips
|
|
|
|
* `gnome-disk-utility` has a benchmark, but it bugs with mdadm raid arrays showing super slow write speeds
|
|
* `hdparm -T /dev/sda` benchmarks a drive with caches
|
|
* `hdparm -t /dev/sda` benchmarks a drive without caches
|
|
* mind that hdparm is using raw disk operations and therefore does not work on raid arrays from mdadm or lvm
|
|
* dd works fine but oflag=sync should be specified to disable caching (hdd caching isnt the only type of cache for files on linux)
|
|
* `dd if=/dev/zero of=/dev/sda oflag=sync bs=512 count=10000`
|
|
* `dd if=/dev/zero of=/dev/sda oflag=sync bs=1M count=1000`
|
|
* `dd if=/dev/zero of=/dev/sda oflag=sync bs=10M count=100`
|
|
* `dd if=/dev/zero of=/dev/sda oflag=sync bs=100M count=10`
|
|
* `dd if=/dev/zero of=/dev/sda oflag=sync bs=2G count=1`
|
|
|
|
## Partition alignment details
|
|
|
|
When using pvcreate on a raw device like `/dev/md0` you might want to check if the physical alignment of the partition was done correctly. All modern versions of `lvm2` should account for this automatically. All modern partition management tools do so aswell.
|
|
|
|
The issue arises if you are reading a 512B sector of a drive that is running 512e (4K physical sectors but 512B sectors exposed externally). Reading a single 512B sector causes 4K bytes to be read, and the drive controller uses processing power to expose only the required 512B.
|
|
|
|
Another issue arises when blocks in the file system cover more than 1 sector. Imagine a filesystem storing data in blocks the size of 4KiB. On a 512e drive, one might imagine a situation in which reading 4KiB off the drive (8 512B sectors) might cause 2 4K sectors to be read because some 512B sectors are in one 4K sector and some are in another. This can lead to heavy performance loss on 512e drives.
|
|
|
|
You can make sure block alignment for disks with advanced format (512e) is correct manually. Partitions should start at a sector multiple of 8 [^512e].
|
|
|
|
`gdisk` can help by showing your hard drive's sector layout by running `gdisk -l` on your disk:
|
|
|
|
```
|
|
root@nas-test:~# gdisk -l /dev/sda
|
|
GPT fdisk (gdisk) version 1.0.3
|
|
|
|
Partition table scan:
|
|
MBR: not present
|
|
BSD: not present
|
|
APM: not present
|
|
GPT: not present
|
|
|
|
Creating new GPT entries.
|
|
Disk /dev/sda: 7814037168 sectors, 3.6 TiB
|
|
Model: TOSHIBA HDWQ140
|
|
Sector size (logical/physical): 512/512 bytes
|
|
Disk identifier (GUID): A16EAA69-C571-436B-9C02-041A7B761572
|
|
Partition table holds up to 128 entries
|
|
Main partition table begins at sector 2 and ends at sector 33
|
|
First usable sector is 34, last usable sector is 7814037134
|
|
Partitions will be aligned on 2048-sector boundaries
|
|
Total free space is 7814037101 sectors (3.6 TiB)
|
|
```
|
|
|
|
Mind the second to last line, stating the partition alignment.
|
|
|
|
To check the alignment of an LVM physical volume use `pvs -o +pe_start --units m` (unit `m` makes sure the number you see is in MiB, base-2 IEC units rather than SI-prefixed ones.):
|
|
|
|
```
|
|
root@nas:~# pvs -o +pe_start --units m
|
|
PV VG Fmt Attr PSize PFree 1st PE
|
|
/dev/md127 share lvm2 a-- 7630636.00m 3895084.00m 1.00m
|
|
```
|
|
The entry `1st PE` shows us that the first partition entry is placed at 1 MiB.
|
|
|
|
The standard 1MiB alignment done by most tools is perfectly aligned with 4K sectors[^sa1] aswell as the conventional 512B sectors.
|
|
|
|
[^sa1]: 2048/8 = 256. (8 being the amount of 512B logical sectors per physical 4K sector)
|
|
|
|
[^512e]:8 512B sectors is 4096B - a single 4K sector.
|
|
|
|
Most modern tools align sectors correctly, since 4K alignment does no harm to 512B sector drives except losing a single MiB at the beginning of the disk and inbetween partitions.
|