This is what happened to me the other day on my RAIDZ-1:
$ sudo zpool status apool -x
pool: apool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0 in 3h53m with 0 errors
config:
NAME STATE READ WRITE CKSUM
apool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
b ONLINE 0 0 0
c FAULTED 0 140 0 too many errors
d ONLINE 0 0 11
errors: No known data errors
It is as bad as it looks like. I had a ZFS pool with 3 disks and one decided to fail. I had seen some smart warning but never gave it the attention needed ... my bad ... now it needs to be taken care of.
Identifying the device
The first step is to identify the device. A first method is to get each device's serial
number using smartctl. Obviously if your disk is unreachable
by smartctl
, you'll have to get the healthy ones's serials and go by deduction from there.
Here's the smartctl
command to get a device's serial:
$ sudo smartctl -i /dev/sdXXX | grep -i 'Serial Number'
Another way is to use ledctl
from the ledmon
tool. A little software which will allow you to control storage leds and thus identify physically
your device.
Here's how the pool looks like once the failing disk has been removed:
$ sudo zpool status
pool: apool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 4h5m with 0 errors
config:
NAME STATE READ WRITE CKSUM
apool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
b ONLINE 0 0 0
11743263287665849900 FAULTED 0 0 0 was /dev/mapper/c
c ONLINE 0 0 0
errors: No known data errors
It is still working but in a degraded state, what means that hopefully no other disk goes sick !
Replacing the disk in the pool
Once you get your new device, here's the process to replace it in the zfs pool.
$ sudo zpool status
pool: apool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 3h55m with 0 errors
config:
NAME STATE READ WRITE CKSUM
apool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
b ONLINE 0 0 0
11743263287665849900 UNAVAIL 0 0 0 was /dev/mapper/c
d ONLINE 0 0 0
errors: No known data errors
First put the old device offline:
sudo zpool offline apool 11743263287665849900
And finally replace with the new mounted disk:
sudo zpool replace apool /dev/mapper/c
Now the pool is rebuilding using the new disk (resilvering in the zfs world):
$ sudo zpool status
pool: apool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress
116G scanned out of 4.53T at 264M/s, 4h52m to go
38.5G resilvered, 2.49% done
config:
NAME STATE READ WRITE CKSUM
apool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
b ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
11743263287665849900 OFFLINE 0 0 0 was /dev/mapper/c/old
c ONLINE 0 0 0 (resilvering)
d ONLINE 0 0 0
errors: No known data errors
With the above command, you can see the progress ...resilvered, 2.49% done...
and the
expected duration: 4h52m to go
.
Hopefully, you will end up with something like that: scan: resilvered 1.51T in 5h2m with 0 errors
.
My resilvering took a bit more than 5 hours. After that my pool was back in shape. Thanks zfs !