I think at this stage in computers, RAID needs no introduction. It's a common component of servers, workstations, desktops, and even small, embedded devices. Still, there is some confusion between the different types of RAID (not to be confused with RAID levels, eg 0, 1...). So I'll begin with a comparison of these types, then move on to examples of software RAID in Linux and how it performs.
Index
- Quick Introduction
- Hardware RAID
- FakeRAID
- Software RAID
- Starting Out
- The Setup
- Getting started with Software RAID
- Preparing the hard disks
- Creating Software RAID Arrays
- Monitoring a Software RAID Array
- Recovering from a drive failure
Top
I. Hardware RAID
Hardware RAID, to put it simply, is expensive. Hardware RAID cards commonly come with their own processor and RAM to offload a lot of the RAID logic from the host's CPU. They have their own BIOS modules to configure and manage them, and are typically only found in high performance servers and some workstations where the cost is justified. Adaptec, LSI Logic, and 3ware make true hardware RAID cards.
II. FakeRAID
FakeRAID is an epidemic in consumer workstations, desktops, and embedded systems. It masquerades as hardware RAID due to the BIOS interface it provides, but in reality, it's software RAID that requires a special driver (which is why you need to hit F6 and insert a driver floppy when installing Windows XP). Most chipsets from Intel, nVidia, VIA, and others that tout "on board" RAID fall under this category. Cheap (sub-$100 USD) RAID PCI(e) cards are usually FakeRAID as well, such as those made by Promise Tech. and HighPoint.
The only real benefit to using FakeRAID is if you are dual booting a Windows and Linux based OS, which can "share" the FakeRAID array.
III. Software RAID
True software RAID exists inside of the operating system. They cannot be shared between operating systems and use special software to manage them. In Linux, the userland software that manages software RAID arrays is called mdadm - more on this later. Software RAID uses the host's CPU to perform RAID logic, and can be an effective solution on systems with multiple cores.
Starting Out
Top
0. The Setup
TopStart with 2 (RAID 0 or 1), 4 (RAID 0, 10, or 5), or more hard drives, ideally all of equal size. For the simplicities sake (and their prevalence) we'll assume they're all SATA drives. In the examples below, I'll assume you have 2 or 4 identical SATA drives, hooked up to ports SATA0-SATA3 on your motherboard.
Before you continue with software RAID, you might want to benchmark a disks performance. The dd command can be used for this:
# dd if=/dev/zero bs=8k count=100000 of=/dev/sda oflag=sync
100000+0 records in
100000+0 records out
819200000 bytes (819 MB) copied, 16.4693 s, 49.7 MB/s
Keep that figure (49.7 MB/s) in mind.
1. Getting started with Software RAID
TopUnfortunately, it's non-trivial to migrate a system onto a software RAID setup, so if you wish to follow along with the steps outlined below, you may want to use a spare machine, or perhaps a virtualization suite such as VMWare Workstation. A lot of Linux LiveCDs exist with software RAID tools on them, so I'll be using one of them (specifically, I'll be using the Gentoo 2007.0 AMD64 Minimal LiveCD). Attach your hard drives, boot your system, and you're ready to go!
2. Preparing the hard disks
TopBefore you can actually create the arrays, you'll need to partition the drives. The Gentoo CD provides cfdisk for this task, but if that's unavailable, you can use fdisk or even parted instead.

Fig. 1: Partitioning /dev/sda
The most important caveat when using Linux software RAID is remembering that
you cannot boot from a RAID0 or RAID5 partition. Instead, I recommend the first partition you create is RAID1, which will be your
/boot partition. A size of 256M should be sufficient:
Fig. 2: Creating the /boot partition.
Now create the rest of the partitions how you normally would. Keep in mind that the Kernel will automatically stripe (RAID0) swap partitions for you, so you'll probably only need a 256-512 MB swap partition on each disk.
Fig. 3: Finished partioning!
Once you've created all the partitions, be sure to mark the first partition (which will be RAID1,
/boot) with the
Boot flag. Then, change the partition type of all the partitions (except the swap partition) to be of type '
fd' (software RAID) if you haven't already done so.
Fig. 4: The final product, to be mirrored to the other drives.
To save ourself the trouble of having to repeat that process over and over, we'll use the
dd command to copy the partition layout to each drive. So assuming you just partitioned
/dev/sda:
# dd if=/dev/sda bs=1k count=1 of=/dev/sdb
Repeat the above command for any other hard disks that will be in your array, ie,
/dev/sdc,
/dev/sdd, etc. Now we'll tell the kernel to recheck the partition tables on the drives to make sure the correct partition devices are created:
# hdparm -z /dev/sd?
Finally, we're ready to actually create the arrays!
3. Creating Software RAID Arrays Top
Let's begin by creating that RAID1 array which will be our
/boot partition that I spoke of earlier:
# mdadm --create /dev/md0 --level=1 --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: array /dev/md0 started.
If you only have two disks, simply omit the
/dev/sdc1 and
/dev/sdd1 partitions and change the
--raid-devices= to 2.
Now for some fun - let's create a two disk RAID0 array and see how it's performance matches up to our earlier single disk benchmark.
Again, we'll create an array from the component partition devices (be sure to use
--level=0!):
# mdadm --create /dev/md1 --level=0 --raid-devices=2 /dev/sda3 /dev/sdb3
And benchmark it!
# dd if=/dev/zero of=/dev/md1 bs=8k count=100000
100000+0 records in
100000+0 records out
819200000 bytes (819 MB) copied, 7.50833 s, 109 MB/s
Nice! Now we'll stop the array and create a 4 disk RAID0 array in its place.
# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
mdadm --create /dev/md1 --raid-devices=4 --level=0 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3
mdadm: /dev/sda3 appears to be part of a raid array:
level=raid0 devices=2 ctime=Sat Oct 6 17:24:05 2007
mdadm: /dev/sdb3 appears to be part of a raid array:
level=raid0 devices=4 ctime=Sat Oct 6 16:24:45 2007
mdadm: /dev/sdc3 appears to be part of a raid array:
level=raid0 devices=2 ctime=Sat Oct 6 17:24:05 2007
mdadm: /dev/sdd3 appears to be part of a raid array:
level=raid0 devices=4 ctime=Sat Oct 6 16:24:45 2007
Continue creating array? y
mdadm: array /dev/md1 started.
Let's try various blocksizes (
bs) to see how that impacts performance:
# dd if=/dev/zero of=/dev/md1 bs=4k count=10000 oflag=sync
10000+0 records in
10000+0 records out
40960000 bytes (41 MB) copied, 1.5172 s, 27.0 MB/s
# dd if=/dev/zero of=/dev/md1 bs=64k count=10000 oflag=sync
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 7.56629 s, 86.6 MB/s
# dd if=/dev/zero of=/dev/md1 bs=128k count=10000 oflag=sync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 9.05751 s, 145 MB/s
# dd if=/dev/zero of=/dev/md1 bs=256k count=10000 oflag=sync
10000+0 records in
10000+0 records out
2621440000 bytes (2.6 GB) copied, 12.0935 s, 217 MB/s
# dd if=/dev/zero of=/dev/md1 bs=512k count=10000 oflag=sync
10000+0 records in
10000+0 records out
5242880000 bytes (5.2 GB) copied, 28.4811 s, 184 MB/s
Eventually, increasing the block size stops helping. Our performance has plateaued at approximately 4 times our original disk throughput of 49.7 MB/s - very close to the theoretical limit. The big benefits of RAID0 here are going to be reading and writing large files and the extra space it provides. Performance begins to change at file sizes within an order of magnitude or so of the stripe size.
Compared to write performance, raw read performance is fairly consistent across block sizes. You shouldn't have any problems achieving between 210-240 MB/sec. This is actually comparable to the read speeds of a 3ware 9500s hardware RAID card with 8 hard disks in RAID5!
Ok, so now we've seen the performance of a RAID0 array. Let's say we want to balance redundancy and disk space: RAID5. Again, we'll stop the array, then recreate it:
# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
# mdadm --create /dev/md1 --raid-devices=4 --level=5 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3
mdadm: /dev/sda3 appears to be part of a raid array:
level=raid0 devices=4 ctime=Sat Oct 6 17:29:51 2007
mdadm: /dev/sdb3 appears to be part of a raid array:
level=raid0 devices=4 ctime=Sat Oct 6 17:29:51 2007
mdadm: /dev/sdc3 appears to be part of a raid array:
level=raid0 devices=4 ctime=Sat Oct 6 17:29:51 2007
mdadm: /dev/sdd3 appears to be part of a raid array:
level=raid0 devices=4 ctime=Sat Oct 6 17:29:51 2007
Continue creating array? y
mdadm: array /dev/md1 started.
If you look at your hard drive light, you might see a flurry of activity. What's going on? In order for the RAID5 array to be redundant, it must synchronize the information on the hard disks. Think of this as an initial, good state: it resets all the data on the hard drives to a known good state in preparation for your use. During this time, you can still use the
/dev/md1 device, because it can "keep track" of where you write to. In the meantime, however, your disk performance will be SEVERELY impacted. You'll need to wait for it to finish synchronizing before you can benchmark it - but how will you know when it's done?
4. Monitoring a software RAID array Top
The poor man's method is to watch a special
proc filesystem file,
/proc/mdstat. Reading this file will give you the down and dirty status of all software RAID devices on your system. Reading it is easy:
# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdd3[4] sdc3[2] sdb3[1] sda3[0]
232950336 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
[=>...................] recovery = 8.8% (6850560/77650112) finish=45.8min speed=25708K/sec
unused devices: <none>
Gonna take a while to finish. Might want to throw on a movie or watch a TV show while it synchronizes. Note that if you reboot during a [re]synchronization, you'll lose your progress and the synchronization will have to start over. To avoid having to reissue the
cat command, we'll use the
watch utility to reissue the command after a set interval of time.
# watch -n 1 'cat /proc/mdstat'
Every 1.0s: cat /proc/mdstat Sat Oct 6 17:57:45 2007
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdd3[4] sdc3[2] sdb3[1] sda3[0]
232950336 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
[===>.................] recovery = 17.8% (13891968/77650112) finish=42.1m
in speed=25236K/sec
unused devices: <none>
The
mdadm utility provides a more robust method to monitor your software RAID arrays. You can use
mdadm to monitor the array and send you an email if it's status changes, or send an alert to a program (such as a shell script). It's also possible to run the
mdadm command periodically via a
cron job. Most Linux distributions allow you to specify an E-Mail address to send alerts to in the
/etc/mdadm.conf file and will launch
mdadm to monitor the RAID arrays on system startup. Finally, if you simply want to check the status of your array using a script, you don't need to process the output of
cat /proc/mdstat;
mdadm provides the
--misc --test flags which will "exit status 0 if ok, 1 if degrade, 2 if dead, 4 if missing."
Once the disk is finally done synchronizing:
# dd if=/dev/zero of=/dev/md1 bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 9.51529 s, 68.9 MB/s
# dd if=/dev/zero of=/dev/md1 bs=128k count=10000
^[t10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 15.1312 s, 86.6 MB/s
# dd if=/dev/zero of=/dev/md1 bs=256k count=10000
10000+0 records in
10000+0 records out
2621440000 bytes (2.6 GB) copied, 29.2271 s, 89.7 MB/s
Once again, increasing the block size increased write throughput, but the maximum appears to be much lower this time, under 100 MB/s.
Checking read speed, we again see a fairly constant throughput independent of blocksize that is about 10% lower than RAID0:
# dd if=/dev/md1 of=/dev/null bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 3.89226 s, 168 MB/s
# dd if=/dev/md1 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 7.63301 s, 172 MB/s
# dd if=/dev/md1 of=/dev/null bs=256k count=10000
10000+0 records in
10000+0 records out
2621440000 bytes (2.6 GB) copied, 15.3274 s, 171 MB/s
We've taken a severe write penalty and a mild read penalty for using RAID5. Of course, we've gained redundancy, which can be invaluable. Now that we have redundancy, what happens when a drive fails?
5. Recovering from a drive failure Top
While we could simply unplug a hard drive to simulate a failure,
mdadm provides a mechanism via the
--manage --fail flags to mark a device as failed. Once the device has failed, we can remove the device and hotadd it back into the array:
First we'll mark
/dev/sdc3 as failed:
# mdadm --manage --fail /dev/md1 /dev/sdc3
mdadm: set /dev/sdc3 faulty in /dev/md1
# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdd3[3] sdc3[4](F) sdb3[1] sda3[0]
232950336 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
unused devices: <none>
Now we'll remove
/dev/sdc3 from the array completely:
# mdadm --manage --remove /dev/md1 /dev/sdc3
mdadm: hot removed /dev/sdc3
# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdd3[3] sdb3[1] sda3[0]
232950336 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
unused devices: <none>
If we were to actually replace a dead drive at this step, it would be necessary to first partition that hard drive. As before, you can do this by using
dd to copy the partition table (first 1 KB of the disk) from another disk in the array to the new disk. After issuing the
hdparm -z /dev/sdX command, the partition devices should appear and you can add the new disk into the array. In this case, we'll simply add the
/dev/sdc3 partition back into the array:
# mdadm --manage --add /dev/md1 /dev/sdc3
mdadm: re-added /dev/sdc3
# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active raid5 sdc3[2] sdd3[3] sdb3[1] sda3[0]
232950336 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
[>....................] recovery = 0.0% (25472/77650112) finish=101.4min speed=12736K/sec
unused devices: <none>
As you may have suspected, the array will need to resynchronize itself after the new drive is added. Like before, the array can still be used while it's being resynchronized.
Bookmark/Search this post with:
Thanks
Hi, Just wanted to say thanks for a very nice software raid hand on tutorial. I completed reading it as my own software RAID 5 was being built. Thanks for your time and efforts on this.
Post new comment