April 12, 2006

Recovering RAID on Linux in Rescue Mode

The software RAID on Linux works very well.  Our backup machine (still using Red Hat 9) uses a RAID‐5 with 6 drives: 4 hot and 2 reserve.  When a main drive fails, a reserve is auto‐magically brought in (thanks to the “md” daemon).

But last Sunday we had a hard crash.  Instead of a sector failure that would cause “md” to bring in a reserve drive, my “hdg” drive generated an Interrupt 15 that caused the kernel to panic and the system to halt.  Rebooting allowed the system to come back up, but sure enough, after 3 hours of resynching the disks, another interrupt 15 occurred.

So I did the next logical step:  since hdg is causing the problem, I unplugged the power to it, expecting the RAID to come up in degraded mode, bring in a new drive, and recover gracefully.  Alas, it failed to even boot.  In the end, I had to:

  • Use “linux rescue” from a CD
  • Use “mdadm” to reassemble the disks
  • Use “mdadm” to fail one disk and bring in a new one
  • Speed up the RAID resynching process
  • Reinitialize grub (the boot loader)
  • Convert an ext2 file system to ext3 (because I accidentally ran “fsck” not “fsck.ext3” on the disk)

I found commands for all of this in different locations, so I wanted to consolidate them here:

Use “linux rescue” from a CD

The first disk of a Red Hat distribution also includes a rescue mode that can be started by typing “linux rescue” at the “boot:” prompt.  It still puts you through some configuration screens (like setting your keyboard type).  In my case I didn’t turn on networking, and since my RAID devices were unhappy, having it try to detect my file systems caused it to hang, so I had to say “no” to file system detection and reassemble the RAID file systems by hand.

Use “mdadm” to reassemble the disks

In my configuration, I have a /dev/md0 that is RAID‐1 and contains “/boot” (made from /dev/hda1 through /dev/hdg1), and /dev/md1 that is RAID‐5 and contains “/” (made from /dev/hda2 through /dev/hdg2). 

Once at the rescue prompt, I was able to use mdadm to reassemble the RAID devices manually:

Using vi, I created an /etc/mdadm.conf file with the following lines:

DEVICE /dev/hd[abcdefg]1
DEVICE /dev/hd[abcdefg]2

This tells mdadm the devices to scan from which it can reconstruct the RAID devices.  You may need to use “sda1” and so on if you are using SATA drives.

After mdadm.conf is initialized, I ran

mdadm –examine –scan » /etc/mdadm.conf

This adds entries in the mdadm.conf file for any RAID devices that it finds on those disks (in my case /dev/md0 and /dev/md1).

I was then able to assemble the disks using:

mdadm –assemble –scan /dev/md0
mdadm –assemble –scan /dev/md1

To check the status of the disks, use:

cat /proc/mdstat

Use “mdadm” to fail one disk and bring in a new one

In my case /dev/hdg was causing all the problems.  I was able to remove hdg from the RAID devices using:

mdadm /dev/md0 –fail /dev/hdg1 –remove /dev/hdg1
mdadm /dev/md1 –fail /dev/hdg2 –remove /dev/hdg2

and then bring in a new drive using;

mdadm /dev/md0 –add /dev/hde1
mdadm /dev/md1 –add /dev/hde2

Looking at “/proc/mdstat” then showed that it was actually working…

Speed up the RAID resynching process

Every time I tried something (and failed) I had to wait for the disks to resync.  This would take hours.  I could reduce it to 90 minutes by telling “md” to speed up the process:

echo 50000 > /proc/sys/dev/raid/speed_limit_min

Reinitialize grub (the boot loader)

After all of this I was able to mount my RAID drives in rescue mode:

mkdir /mnt/big
mount /dev/md1 /mnt/big

But for some reason, while my file systems appeared to be happy, grub (the boot loader) would hang on “Grub loading Stage2”.  I learned that I had to reinitialize grub.  Once I mounted the disk with my root file system, I found the grub command‐line configuration tool in /sbin/grub (/mnt/big/sbin/grub with they way that I mounted it).

After running “grub”, I was greeted with a “grub >” prompt and ran:

device (hd0) /dev/hda
root (hd0,0)
setup (hd0)

I was able to take such a minimalist approach because my “boot/” file system (in /dev/md0) already had the grub configuration set to boot off of /dev/md1.  So I was simply rewriting the MBR of /dev/hda to tell grub where to go.

Convert an ext2 file system to ext3

I made an error and ran “fsck” on my ext3 disk, which caused it to convert to an ext2 disk (I should have run “fsck.ext3” instead).  The net result was that when the machine tried to reboot, it looked in /etc/fstab, saw that /dev/md1 should have been an ext3 disk, was set up as an ext2 disk, and refused to mount it.  So I had to convert /dev/md1 back to an ext3 disk.  In this case, it was very simple (after assembling the /dev/md1 disk with mdadm in rescue mode):

umount /dev/md1
tune2fs -j /dev/md1

After this, I was able to reboot and have the system finally come up clean.

I hope you never have to manually set up RAID disks in rescue mode, but if you did, I hope this helped.

8 responses to “Recovering RAID on Linux in Rescue Mode

  1. When re‐assembling arrays, you probably don’t need to create the mdadm.conf file with the DEVICE lines.
    Instead, you can probably just get away with:
    # mdadm –examine –scan » /etc/mdadm.conf
    # mdadm –assemble –scan
    Which will automatically assemble any existing RAID devices. (Tested with the CentOS5 install DVD in “rescue” mode.)

  2. You should put the boot loader on the mbr of all your hard drives. With your setup, if hda would fail, your system would not be able to boot.

  3. Thanks for a great writeup, it was extremely helpful! My ‘linux rescue’ session did not create mdX devices above md0, so I had to create them manually before running the “mdadm –assemble –scan /dev/mdX” commands:
    mknod /dev/md1 b 9 1
    mknod /dev/md1 b 9 2
    mknod /dev/md1 b 9 3
    … etc.

  4. Hi ,
    Thanks for your great article . It is a real life saver .
    Just for more info to gather key point together:
    If you are stocked in rescue mode with unregistered LVMs ,
    try: lvm vgchange -ay
    and the magic happens in /dev .

Comments are closed.