Discussion:
Endless Resilvering
(too old to reply)
Jason Gullickson
2013-06-21 15:57:41 UTC
Permalink
I have a large (100TB+) ZFS pool that has been through some contortions of
which it's recovering from. At the moment things are looking better, but
for some reason it keeps restarting the resilver process, and this is
giving me reservations about putting it back into production.

The short version of the story is that the pool experienced two drive
failures close together in time. When the first disk failed, I "replaced"
it with a spare. A second disk failed, which I replaced with another
spare, but the second disk before the resilvering for the first completed.

I took the system down to physically replace the failed disks, and when it
came up, I used "replace" to replace the failed disks with the new ones (I
use the disk/by-id device names so the new disks have new names). The
problem is that it seems to resilver for a few days, get within perhaps a
day of being finished (based on the estimated completion time) and then the
next day I find that it's re-started the resilvering process. Each time
the estimated completion time seems to be going down, but it's not clear to
me if it's ever going to finish, and I'm concerned that replacing the two
drives at the same time may have created a scenario where the system is
"chasing its tail" or something.

At this point it looks to me like one of the spares have been completely
resilvered and integrated into the RAIDZ, but the second spare doesn't look
caught up (I could be misinterpreting this). I'm considering a few options
to try and get the resilver to finish (faster), including moving some data
to another array (it's at 85% capacity at the moment), or off-lining one of
the spares to reduce the amount of recover work being done, etc., but I
want to get some feedback before making any changes that might impact the
currently stable state of the data, or slow down the overall recovery
process.

Any feedback is appreciated!



Here's the current output of zpool status:

pool: pod
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 20h24m, 18.32% done, 90h59m to go
config:

NAME STATE READ WRITE
CKSUM
pod DEGRADED 0 0
645
raidz1-0 DEGRADED 0 0
1.26K
disk/by-id/wwn-0x5000c5003f216f9a ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWPK ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQAM ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPVD ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ2Y ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CVA3 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQHC ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPWW ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09X3Z ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ87 ONLINE 0 0
0
spare-10 DEGRADED 0 0
0
replacing-0 DEGRADED 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ9N UNAVAIL 0 0
0 cannot open
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F20T1K ONLINE 0 0
0 436G resilvered
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BJN ONLINE 0 0
0 436G resilvered
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQG7 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQKM ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQEH ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09C7Y ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWRF ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ7Y ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0C7LN ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQAD ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CBRC ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPZM ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPT9 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ0M ONLINE 0 0
0
spare-23 DEGRADED 0 0
26
replacing-0 DEGRADED 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWB3 UNAVAIL 0 0
0 cannot open
disk/by-id/scsi-SATA_ST3000DM001-1CH_W1F226B4 ONLINE 0 0
0 436G resilvered
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CCMV ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0D6NL ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWA1 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CVL6 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0D6TT ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BPVX ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BGJ ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0C9YA ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09B50 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0AZ20 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BKJW ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F095Y2 ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F08YLD ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQGQ ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0B2YJ ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQBY ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0C9WZ ONLINE 0 0
4 1.60M resilvered
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQGE ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0BQ5C ONLINE 0 0
0
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CWWH ONLINE 0 0
0
spares
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F0CCMV INUSE currently
in use
disk/by-id/scsi-SATA_ST3000DM001-9YN_Z1F09BJN INUSE currently
in use

errors: No known data errors
--
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
---
You received this message because you are subscribed to the Google Groups "zfs-fuse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-fuse+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
Loading...