Checksum errors on a drive currently being replaced

Discussion:

Tan Chee Eng

2013-10-11 09:51:53 UTC

Hi,

I've tried by best to search for the answer for this, but there was too
much noise -- in pretty much all the discussions on checksum errors and
replacing drives, people are asking about replacing a drive after seeing
checksum errors. I'm hoping somebody here would be able to help me
understand this.

One of my pools experienced two hard drive failures (fortunately from two
different vdevs). I've replaced the hard drive, but I've been seeing CKSUM
errors popping up on the device that's being replaced. Here's a dump of
"zpool status" on my machine:

pool: archivepool

state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h1m, 0.02% done, 107h55m to go
NAME
STATE READ WRITE CKSUM
archivepool
DEGRADED 0 0 0
raidz1-0
DEGRADED 0 0 0
replacing-0
DEGRADED 0 0 277
6312875129823442296
UNAVAIL 0 0 0 was
/dev/disk/by-id/usb-WD_Ext_HDD_1021_574D415A4132393938343534-0:0
disk/by-id/usb-Seagate_Backup+_Desk_NA5KKM31-0:0
ONLINE 0 0 0
disk/by-id/usb-WD_Ext_HDD_1021_5743415A4132353337343532-0:0
ONLINE 0 0 0
disk/by-id/usb-WD_Ext_HDD_1021_5743415A4133343235393936-0:0
ONLINE 0 0 0
disk/by-id/usb-WD_Ext_HDD_1021_5743415A4133323837353135-0:0
ONLINE 0 0 0
raidz1-1
DEGRADED 0 0 0
disk/by-id/usb-BUFFALO_External_HDD_01092000209C-0:0
ONLINE 0 0 0
disk/by-id/usb-BUFFALO_HD-CXU2_0010100702091C720-0:0
ONLINE 0 0 0
disk/by-id/usb-BUFFALO_HD-CXU2_00101007020C94C80-0:0
ONLINE 0 0 0
replacing-3
DEGRADED 0 0 132
4051679456373771497
UNAVAIL 0 0 0 was
/dev/disk/by-id/usb-BUFFALO_HD-CXU2_0010100702071DD80-0:0
disk/by-id/usb-Seagate_Backup+_Desk_NA5KKM3P-0:0
ONLINE 0 0 0
errors: No known data errors

--
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
---
You received this message because you are subscribed to the Google Groups "zfs-fuse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-fuse+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.

Ethan

2013-10-24 20:14:32 UTC

Permalink

I don't know if you found your answer yet (I don't look at this list very
often), but from what I understand, you should be fine. I'm no expert, just
a user who has encountered occasional device failure and drive corruption,
so take this with a grain of salt.

The fact that you have "errors: No known data errors" means you haven't
lost data. If you had lost data, you'd see a cksum error on the vdev and/or
pool, that message would say "errors: 149 data errors, use '-v' for a
list", and the status would say "status: One or more devices has
experienced an error resulting in data corruption. Applications may be
affected." (I don't know if it would say this during the resilvering or
after, though.)

I don't know why you have checksum errors on the replacing devices, and I
can't recall whether I have seen that myself when replacing devices, but
I'd wait for the resilver to finish (which I suppose it has by now), do a
scrub, and see if everything looks clear. keep an eye on it in the future
and scrub from time to time, if the drive shows further checksum errors,
I'd replace it.

Post by Tan Chee Eng
Hi,
I've tried by best to search for the answer for this, but there was too
much noise -- in pretty much all the discussions on checksum errors and
replacing drives, people are asking about replacing a drive after seeing
checksum errors. I'm hoping somebody here would be able to help me
understand this.
One of my pools experienced two hard drive failures (fortunately from two
different vdevs). I've replaced the hard drive, but I've been seeing CKSUM
errors popping up on the device that's being replaced. Here's a dump of
pool: archivepool

As you can see, the pool is resilvering, but I'm seeing checksum errors on
replacing-0 and replacing-3. I understand what checksum errors on the pool,
vdev, or device means, but what does it mean on a drive that isn't even a
part of the vdev yet? Is my new drive bad, or did I just loose data?
Regards,
Chee Eng
--
--
To visit our Web site, click on http://zfs-fuse.net/
---
You received this message because you are subscribed to the Google Groups "zfs-fuse" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Tan Chee Eng

2013-10-25 01:38:28 UTC

Permalink

Hi Ethan,

Thanks for replying. Anyway, just as an update to my situation, the
resilver was extremely slow and finally crashed zfs-fuse a few days later.
I switched to ZFS on Linux, at which point I realised something - the pool
was created with 512 sector drives (ashift=9), but the drives I bought to
replace them were advanced format drives. It looks like my options now are
to back up my data somewhere and recreate the pool with ashift=12.

I'll need to find some place to borrow 8TB+ of hard drives, though... that
might be a bit difficult.

Regards,
Chee Eng

Post by Ethan
I don't know if you found your answer yet (I don't look at this list very
often), but from what I understand, you should be fine. I'm no expert, just
a user who has encountered occasional device failure and drive corruption,
so take this with a grain of salt.
The fact that you have "errors: No known data errors" means you haven't
lost data. If you had lost data, you'd see a cksum error on the vdev and/or
pool, that message would say "errors: 149 data errors, use '-v' for a
list", and the status would say "status: One or more devices has
experienced an error resulting in data corruption. Applications may be
affected." (I don't know if it would say this during the resilvering or
after, though.)
I don't know why you have checksum errors on the replacing devices, and I
can't recall whether I have seen that myself when replacing devices, but
I'd wait for the resilver to finish (which I suppose it has by now), do a
scrub, and see if everything looks clear. keep an eye on it in the future
and scrub from time to time, if the drive shows further checksum errors,
I'd replace it.

As you can see, the pool is resilvering, but I'm seeing checksum errors
on replacing-0 and replacing-3. I understand what checksum errors on the
pool, vdev, or device means, but what does it mean on a drive that isn't
even a part of the vdev yet? Is my new drive bad, or did I just loose data?
Regards,
Chee Eng
--
--
To visit our Web site, click on http://zfs-fuse.net/
---
You received this message because you are subscribed to the Google Groups
"zfs-fuse" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.