Discussion:
Failed replace -- now cannot import
(too old to reply)
Matt Jacobsen
2008-12-15 18:19:22 UTC
Permalink
Can anyone help me here. I was replacing a drive (as I've done many
times) and after executing the zpool replace Media <old> <new>, zpool
hung on me, no drive activity, nothing. I CTRL-C'd out and couldn't
get access. After rebooting, I can't get zfs to import. I'm running on
Ubuntu Hardy.

I've tried to import several different ways. But zpool seems to think
the raidz array is busy replacing and wont import it. But nothing is
happening on the drives. I've rebooted a couple of times to no avail.

Any help would be appreciated. I really would rather not lose the
raidz array.

I've tried the following:

***@Closet:~$ sudo zpool import -d /dev/disk/by-id
pool: Media
id: 6147641630498163037
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

Media FAULTED corrupted
data
raidz1 ONLINE
replacing ONLINE
disk/by-id/ieee1394-0090a909cc248a03:1:0 ONLINE
disk/by-id/ieee1394-0090a991e0100395:1:0 ONLINE
disk/by-id/ieee1394-0090a948c2a43ebe:1:0 ONLINE
disk/by-id/ieee1394-0090a98bdf25207c:1:0 ONLINE
disk/by-id/ieee1394-0090a94cc8ee15b4:0:0 ONLINE
disk/by-id/ieee1394-0090a9cefba6023a:1:0 ONLINE
***@Closet:~$ sudo zpool status
no pools available
***@Closet:~$ sudo zpool import -f -d /dev/disk/by-id
pool: Media
id: 6147641630498163037
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

Media FAULTED corrupted
data
raidz1 ONLINE
replacing ONLINE
disk/by-id/ieee1394-0090a909cc248a03:1:0 ONLINE
disk/by-id/ieee1394-0090a991e0100395:1:0 ONLINE
disk/by-id/ieee1394-0090a948c2a43ebe:1:0 ONLINE
disk/by-id/ieee1394-0090a98bdf25207c:1:0 ONLINE
disk/by-id/ieee1394-0090a94cc8ee15b4:0:0 ONLINE
disk/by-id/ieee1394-0090a9cefba6023a:1:0 ONLINE
***@Closet:~$ sudo zpool import -f Media
cannot import 'Media': I/O error
***@Closet:~$ sudo zpool import
pool: Media
id: 6147641630498163037
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

Media FAULTED corrupted data
raidz1 ONLINE
replacing ONLINE
sdd ONLINE
sdb ONLINE
sdf ONLINE
sdg ONLINE
sde ONLINE
sdc ONLINE


***@Closet:~$ sudo zdb -l /dev/sdb
--------------------------------------------
LABEL 0
--------------------------------------------
version=10
name='Media'
state=0
txg=3890606
pool_guid=6147641630498163037
hostid=8323329
hostname='Closet'
top_guid=8580373564493692998
guid=14077944779403832088
vdev_tree
type='raidz'
id=0
guid=8580373564493692998
nparity=1
metaslab_array=14
metaslab_shift=32
ashift=9
asize=2500515594240
is_log=0
children[0]
type='replacing'
id=0
guid=5918939669881855399
whole_disk=0
children[0]
type='disk'
id=0
guid=2446188385678725189
path='/dev/sdc'
whole_disk=0
DTL=511
children[1]
type='disk'
id=1
guid=14077944779403832088
path='/dev/sdg'
whole_disk=0
DTL=598
children[1]
type='disk'
id=1
guid=13690338599323198328
path='/dev/sde'
whole_disk=0
DTL=510
children[2]
type='disk'
id=2
guid=14976928031112057161
path='/dev/sdf'
whole_disk=0
DTL=509
children[3]
type='disk'
id=3
guid=4297515819690964824
path='/dev/sdd'
whole_disk=0
DTL=553
children[4]
type='disk'
id=4
guid=14093232984823371266
path='/dev/sdb'
whole_disk=0
DTL=507
--------------------------------------------
LABEL 1
--------------------------------------------
version=10
name='Media'
state=0
txg=3890606
pool_guid=6147641630498163037
hostid=8323329
hostname='Closet'
top_guid=8580373564493692998
guid=14077944779403832088
vdev_tree
type='raidz'
id=0
guid=8580373564493692998
nparity=1
metaslab_array=14
metaslab_shift=32
ashift=9
asize=2500515594240
is_log=0
children[0]
type='replacing'
id=0
guid=5918939669881855399
whole_disk=0
children[0]
type='disk'
id=0
guid=2446188385678725189
path='/dev/sdc'
whole_disk=0
DTL=511
children[1]
type='disk'
id=1
guid=14077944779403832088
path='/dev/sdg'
whole_disk=0
DTL=598
children[1]
type='disk'
id=1
guid=13690338599323198328
path='/dev/sde'
whole_disk=0
DTL=510
children[2]
type='disk'
id=2
guid=14976928031112057161
path='/dev/sdf'
whole_disk=0
DTL=509
children[3]
type='disk'
id=3
guid=4297515819690964824
path='/dev/sdd'
whole_disk=0
DTL=553
children[4]
type='disk'
id=4
guid=14093232984823371266
path='/dev/sdb'
whole_disk=0
DTL=507
--------------------------------------------
LABEL 2
--------------------------------------------
version=10
name='Media'
state=0
txg=3890606
pool_guid=6147641630498163037
hostid=8323329
hostname='Closet'
top_guid=8580373564493692998
guid=14077944779403832088
vdev_tree
type='raidz'
id=0
guid=8580373564493692998
nparity=1
metaslab_array=14
metaslab_shift=32
ashift=9
asize=2500515594240
is_log=0
children[0]
type='replacing'
id=0
guid=5918939669881855399
whole_disk=0
children[0]
type='disk'
id=0
guid=2446188385678725189
path='/dev/sdc'
whole_disk=0
DTL=511
children[1]
type='disk'
id=1
guid=14077944779403832088
path='/dev/sdg'
whole_disk=0
DTL=598
children[1]
type='disk'
id=1
guid=13690338599323198328
path='/dev/sde'
whole_disk=0
DTL=510
children[2]
type='disk'
id=2
guid=14976928031112057161
path='/dev/sdf'
whole_disk=0
DTL=509
children[3]
type='disk'
id=3
guid=4297515819690964824
path='/dev/sdd'
whole_disk=0
DTL=553
children[4]
type='disk'
id=4
guid=14093232984823371266
path='/dev/sdb'
whole_disk=0
DTL=507
--------------------------------------------
LABEL 3
--------------------------------------------
version=10
name='Media'
state=0
txg=3890606
pool_guid=6147641630498163037
hostid=8323329
hostname='Closet'
top_guid=8580373564493692998
guid=14077944779403832088
vdev_tree
type='raidz'
id=0
guid=8580373564493692998
nparity=1
metaslab_array=14
metaslab_shift=32
ashift=9
asize=2500515594240
is_log=0
children[0]
type='replacing'
id=0
guid=5918939669881855399
whole_disk=0
children[0]
type='disk'
id=0
guid=2446188385678725189
path='/dev/sdc'
whole_disk=0
DTL=511
children[1]
type='disk'
id=1
guid=14077944779403832088
path='/dev/sdg'
whole_disk=0
DTL=598
children[1]
type='disk'
id=1
guid=13690338599323198328
path='/dev/sde'
whole_disk=0
DTL=510
children[2]
type='disk'
id=2
guid=14976928031112057161
path='/dev/sdf'
whole_disk=0
DTL=509
children[3]
type='disk'
id=3
guid=4297515819690964824
path='/dev/sdd'
whole_disk=0
DTL=553
children[4]
type='disk'
id=4
guid=14093232984823371266
path='/dev/sdb'
whole_disk=0
DTL=507
Matt Jacobsen
2008-12-15 20:48:12 UTC
Permalink
So, I've updated to the latest in the trunk via mercurial, built, and
installed. That didn't seem to help.

I've played around with removing some of the disks to see if I can
coerce zpool into importing the raidz array (of which there are 5
disks). No luck. After removing the new disk and returning to just the
original 5, I'm back at this point:

***@Closet:~$ sudo zpool import
pool: Media
id: 6147641630498163037
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

Media FAULTED corrupted data
raidz1 DEGRADED
replacing DEGRADED
sdc ONLINE
sdg UNAVAIL cannot open
sde ONLINE
sdf ONLINE
sdd ONLINE
sdb ONLINE
***@Closet:~$ sudo zpool import -f Media
cannot import 'Media': I/O error
***@Closet:~$ sudo fdisk -l

Disk /dev/sda: 30.0 GB, 30005821440 bytes
255 heads, 63 sectors/track, 3648 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000d229d

Device Boot Start End Blocks Id System
/dev/sda1 * 1 3492 28049458+ 83 Linux
/dev/sda2 3493 3648 1253070 5 Extended
/dev/sda5 3493 3648 1253038+ 82 Linux swap /
Solaris

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdb1 1 60802 488386552+ af Unknown

Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdc1 1 60802 488386552+ af Unknown

Disk /dev/sdd: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x8d399bc0

Device Boot Start End Blocks Id System
/dev/sdd1 1 60800 488375968+ c W95 FAT32
(LBA)

Disk /dev/sde: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sde1 1 60802 488386552+ af Unknown

Disk /dev/sdf: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdf1 1 60802 488386552+ af Unknown
***@Closet:~$

It shows that /dev/sdg is unavailable because it's not plugged in. But
plugging it in does not precipitate the replace/resliver nor allow
import. Any ideas?

Thanks,
Matt
Tobbe Lundberg
2008-12-15 20:57:32 UTC
Permalink
You could try burning an OpenSolaris LiveCD and see if you can import
the pool from there. If you can't you could always ask for help in the
#opensolaris irc channel. They're very helpfull :) (Just don't say
you're using zfs-fuse ;)

//Tobbe
Jonathan Schmidt
2008-12-15 20:54:52 UTC
Permalink
Hi Matt,

What happens if you unplug both sdc and sdg, leaving sde, sdf, sdd, sdb
available? That should be enough info to bring up the pool in degraded
mode.

If not, I think you may have weaved yourself into a corner of the zpool
state machine that was not designed for recovery. You might need
someone familiar with the ZFS code to comment on your situation.

As a general reminder to everyone, these type of problems are not
specific to ZFS or ZFS-FUSE. Every file system can have problems, and
for important data you always need a backup. There is a big difference
between redundancy and backups, and you usually need a bit of both.

Jonathan

Matt Jacobsen wrote:
> So, I've updated to the latest in the trunk via mercurial, built, and
> installed. That didn't seem to help.
>
> I've played around with removing some of the disks to see if I can
> coerce zpool into importing the raidz array (of which there are 5
> disks). No luck. After removing the new disk and returning to just the
> original 5, I'm back at this point:
>
> ***@Closet:~$ sudo zpool import
> pool: Media
> id: 6147641630498163037
> state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
> The pool may be active on another system, but can be imported using
> the '-f' flag.
> see: http://www.sun.com/msg/ZFS-8000-72
> config:
>
> Media FAULTED corrupted data
> raidz1 DEGRADED
> replacing DEGRADED
> sdc ONLINE
> sdg UNAVAIL cannot open
> sde ONLINE
> sdf ONLINE
> sdd ONLINE
> sdb ONLINE
> ***@Closet:~$ sudo zpool import -f Media
> cannot import 'Media': I/O error
> ***@Closet:~$ sudo fdisk -l
>
> Disk /dev/sda: 30.0 GB, 30005821440 bytes
> 255 heads, 63 sectors/track, 3648 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x000d229d
>
> Device Boot Start End Blocks Id System
> /dev/sda1 * 1 3492 28049458+ 83 Linux
> /dev/sda2 3493 3648 1253070 5 Extended
> /dev/sda5 3493 3648 1253038+ 82 Linux swap /
> Solaris
>
> Disk /dev/sdb: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x00000000
>
> Device Boot Start End Blocks Id System
> /dev/sdb1 1 60802 488386552+ af Unknown
>
> Disk /dev/sdc: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x00000000
>
> Device Boot Start End Blocks Id System
> /dev/sdc1 1 60802 488386552+ af Unknown
>
> Disk /dev/sdd: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x8d399bc0
>
> Device Boot Start End Blocks Id System
> /dev/sdd1 1 60800 488375968+ c W95 FAT32
> (LBA)
>
> Disk /dev/sde: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x00000000
>
> Device Boot Start End Blocks Id System
> /dev/sde1 1 60802 488386552+ af Unknown
>
> Disk /dev/sdf: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> Disk identifier: 0x00000000
>
> Device Boot Start End Blocks Id System
> /dev/sdf1 1 60802 488386552+ af Unknown
> ***@Closet:~$
>
> It shows that /dev/sdg is unavailable because it's not plugged in. But
> plugging it in does not precipitate the replace/resliver nor allow
> import. Any ideas?
>
> Thanks,
> Matt
> >
>
>
Matt Jacobsen
2008-12-15 22:02:27 UTC
Permalink
Thanks for your help/suggestions.

> What happens if you unplug both sdc and sdg, leaving sde, sdf, sdd, sdb
> available?  That should be enough info to bring up the pool in degraded
> mode.

You would think. But this what I get when I unplug the two drives that
were involved in the replacement:

***@Closet:~$ sudo zpool import
pool: Media
id: 6147641630498163037
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

Media FAULTED corrupted data
raidz1 DEGRADED
replacing UNAVAIL insufficient replicas
sdc UNAVAIL cannot open
sdg UNAVAIL cannot open
sdb ONLINE
sde ONLINE
sdd ONLINE
sdf ONLINE
***@Closet:~$ sudo zpool import -f Media
cannot import 'Media': I/O error

> If not, I think you may have weaved yourself into a corner of the zpool
> state machine that was not designed for recovery.  You might need
> someone familiar with the ZFS code to comment on your situation.

I hope there's a recovery strategy here. I mean all I did was execute
a zpool replace. That's not that crazy of an operation (i.e. not a lot
of weaving). Before executing the replace the 4 other drives were
acting normally. I was only getting read/write checksum errors on the
drive that I executed the replace for.

---
> You could try burning an OpenSolaris LiveCD and see if you can import
> the pool from there.

That would be great. Except that it appers none of the 3 widely
distributed ZFS implementations (zfs-fuse, open solaris, mac os x) are
compatible. I tried importing a Mac OS X raidz pool in both Ubuntu/zfs-
fuse and Open Solaris and an Open Solaris raidz pool on Ubuntu/zfs-
fuse and Mac OS X with no success. Granted, I didn't try an Ubuntu/zfs-
fuse on Open Solaris or Mac OS X. But since the version number
supported by Mac OS X is lower than either zfs-fuse or Open Solaris, I
doubt that'll work. It is however *possible* that zfs-fuse (ver 10) is
importable/mountable by Open Solaris. But getting Open Solaris to run
on my P4 seems to be dependent on how many fingers I have crossed at
the moment.

The net is that none of the ZFS impls really have much to do with each
other AFAICT.

>If you can't you could always ask for help in the #opensolaris irc channel.

I might, but I can't see how that would help since a replace is
supposed to terminate automatically (according to docs) when the
replacement drive is removed.

It seems more likely that I might be able to "cancel" the replace
somehow since it never started, but seems to be the reason why I can
import the pool now.

Thanks,
Matt
s***@public.gmane.org
2008-12-15 22:32:50 UTC
Permalink
>
> The net is that none of the ZFS impls really have much to do with each
> other AFAICT.
>
I'm not sure that is fair. I think it is very clear which versions are
merged where (see respective change logs).
The real problem is that relatively many features require different ODF.
I think that is rather unpractical from the end-user perspective. It
might be better that new features not be enabled by default on newly
created pools.


My trick is to use sufficiently low versions explicitely:

zpool create version=n poolname

That will avoid all the troubles.

Seth

Matt Jacobsen wrote:
> Thanks for your help/suggestions.
>
>
>> What happens if you unplug both sdc and sdg, leaving sde, sdf, sdd, sdb
>> available? That should be enough info to bring up the pool in degraded
>> mode.
>>
>
> You would think. But this what I get when I unplug the two drives that
> were involved in the replacement:
>
> ***@Closet:~$ sudo zpool import
> pool: Media
> id: 6147641630498163037
> state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
> The pool may be active on another system, but can be imported using
> the '-f' flag.
> see: http://www.sun.com/msg/ZFS-8000-72
> config:
>
> Media FAULTED corrupted data
> raidz1 DEGRADED
> replacing UNAVAIL insufficient replicas
> sdc UNAVAIL cannot open
> sdg UNAVAIL cannot open
> sdb ONLINE
> sde ONLINE
> sdd ONLINE
> sdf ONLINE
> ***@Closet:~$ sudo zpool import -f Media
> cannot import 'Media': I/O error
>
>
>> If not, I think you may have weaved yourself into a corner of the zpool
>> state machine that was not designed for recovery. You might need
>> someone familiar with the ZFS code to comment on your situation.
>>
>
> I hope there's a recovery strategy here. I mean all I did was execute
> a zpool replace. That's not that crazy of an operation (i.e. not a lot
> of weaving). Before executing the replace the 4 other drives were
> acting normally. I was only getting read/write checksum errors on the
> drive that I executed the replace for.
>
> ---
>
>> You could try burning an OpenSolaris LiveCD and see if you can import
>> the pool from there.
>>
>
> That would be great. Except that it appers none of the 3 widely
> distributed ZFS implementations (zfs-fuse, open solaris, mac os x) are
> compatible. I tried importing a Mac OS X raidz pool in both Ubuntu/zfs-
> fuse and Open Solaris and an Open Solaris raidz pool on Ubuntu/zfs-
> fuse and Mac OS X with no success. Granted, I didn't try an Ubuntu/zfs-
> fuse on Open Solaris or Mac OS X. But since the version number
> supported by Mac OS X is lower than either zfs-fuse or Open Solaris, I
> doubt that'll work. It is however *possible* that zfs-fuse (ver 10) is
> importable/mountable by Open Solaris. But getting Open Solaris to run
> on my P4 seems to be dependent on how many fingers I have crossed at
> the moment.
>
> The net is that none of the ZFS impls really have much to do with each
> other AFAICT.
>
>
>> If you can't you could always ask for help in the #opensolaris irc channel.
>>
>
> I might, but I can't see how that would help since a replace is
> supposed to terminate automatically (according to docs) when the
> replacement drive is removed.
>
> It seems more likely that I might be able to "cancel" the replace
> somehow since it never started, but seems to be the reason why I can
> import the pool now.
>
> Thanks,
> Matt
>
> >
>
>
>
Matt Jacobsen
2008-12-15 22:47:24 UTC
Permalink
I can appreciate your tip and how you'd like to argue for it. But this
is not the best topic for that. I shouldn't have mentioned that much
detail from my experience thus far with the 3 (major) ZFS impls.

I could however really use some help with this faulted zpool problem
though. I've reconnected all 6 drives again (including the newly
partitioned, ready to use as a replacement, drive). But still have
this:

***@Closet:~$ sudo zpool import
pool: Media
id: 6147641630498163037
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

Media FAULTED corrupted data
raidz1 ONLINE
replacing ONLINE
sdb ONLINE
sdd ONLINE
sdf ONLINE
sdg ONLINE
sde ONLINE
sdc ONLINE

And yet I can't import the pool which means I can't get the pool to
replace the drive. This is very frustrating.

Thanks,
Matt

On Dec 15, 2:32 pm, "sghee...-***@public.gmane.org" <sghee...-***@public.gmane.org>
wrote:
> > The net is that none of the ZFS impls really have much to do with each
> > other AFAICT.
>
> I'm not sure that is fair. I think it is very clear which versions are
> merged where (see respective change logs).
> The real problem is that relatively many features require different ODF.
> I think that is rather unpractical from the end-user perspective. It
> might be better that new features not be enabled by default on newly
> created pools.
>
> My trick is to use sufficiently low versions explicitely:
>
> zpool create version=n poolname
>
> That will avoid all the troubles.
>
> Seth
>
Matt Jacobsen
2008-12-15 23:05:12 UTC
Permalink
Some more info.
I recompiled zdb with debug=2 and then tried to checksum the pool via:

***@Closet:~$ sudo zdb -cv -e 6147641630498163037
zdb: can't open 6147641630498163037: Input/output error

Here 6147641630498163037 is the pool_guid for the Media zpool. After
deleting the zpool.cache, I tried again with debug=on:

***@Closet:~$ sudo rm -rf /etc/zfs/zpool.cache
***@Closet:~$ sudo zdb -cv -e 6147641630498163037 debug=on
kernel_init: physmem = 128920 pages (0.49 GB)
cv_timedwait: thread -1245025392 is at cv_timedwait at 2038.03 with
delta 1.00 secs
cv_timedwait: thread -1253418096 is at cv_timedwait at 2038.03 with
delta 1.00 secs
hdr_recl: hdr_recl called
cv_timedwait: thread -1245025392 exited cv_timedwait at 2038.65 (rem =
0.38)
cv_timedwait: thread -1245025392 is at cv_timedwait at 2038.65 with
delta 1.00 secs
vdev_queue_io_to_issue: read T=3790 off= 21000 agg= 17 old=
400 new= 4400
vdev_queue_io_to_issue: read T=3790 off= 26000 agg= 49 old=
400 new= c400
vdev_queue_io_to_issue: read T=3790 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off= 32400 agg= 55 old=
400 new= dc00
vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off= 22000 agg=120 old=
400 new=1e000
vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off= 60000 agg=119 old=
400 new=1dc00
vdev_queue_io_to_issue: read T=3790 off= 7e400 agg= 7 old=
400 new= 1c00
vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg= 25 old=
400 new= 6400
vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470ba7400 agg= 99 old=
400 new=18c00
vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg= 80 old=
400 new=14000
vdev_queue_io_to_issue: read T=3790 off= 21000 agg= 25 old=
400 new= 6400
vdev_queue_io_to_issue: read T=3790 off= 27800 agg= 24 old=
400 new= 6000
vdev_queue_io_to_issue: read T=3790 off= 2dc00 agg= 73 old=
400 new=12400
vdev_queue_io_to_issue: read T=3790 off= 60000 agg= 54 old=
400 new= d800
vdev_queue_io_to_issue: read T=3790 off=7470bf4800 agg= 46 old=
400 new= b800
vdev_queue_io_to_issue: read T=3790 off= 6e400 agg= 39 old=
400 new= 9c00
vdev_queue_io_to_issue: read T=3790 off= 78800 agg= 30 old=
400 new= 7800
vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
400 new=20000
dmu_objset_open_impl: reading [L0 DMU objset] 400L/200P DVA[0]
=<0:23200ff1000:400> DVA[1]=<0:13100d13c00:400> DVA[2]
=<0:1cf002da000:400> fletcher4 lzjb LE contiguous birth=3890606
fill=599 cksum=c13fda5a6:4ea1843394e:104008426824e:246313826d92d1
dbuf_create: ds=mos obj=mdn lvl=1 blkid=0 db=0x83886e78
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
arc_evict: only evicted 1024 bytes from 81ac940arc_evict_ghost: only
deleted 1024 bytes from 0x81ac9a0arc_evict_ghost: only deleted 0 bytes
from 0x81aca60spa_load: spa_load(): error 5 in dsl_pool_open()
spa_load: spa_load(): error 5
vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
400 new=20000
cv_timedwait: thread -1253418096 exited cv_timedwait at 2039.03 (rem =
0.00)
cv_timedwait: thread -1253418096 is at cv_timedwait at 2039.03 with
delta 1.00 secs
vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
400 new=20000
dmu_objset_open_impl: reading [L0 DMU objset] 400L/200P DVA[0]
=<0:23200ff1000:400> DVA[1]=<0:13100d13c00:400> DVA[2]
=<0:1cf002da000:400> fletcher4 lzjb LE contiguous birth=3890606
fill=599 cksum=c13fda5a6:4ea1843394e:104008426824e:246313826d92d1
dbuf_create: ds=mos obj=mdn lvl=1 blkid=0 db=0x83886e78
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
vdev_mirror_io_done: retrying i/o (err=52) on child raidz
vdev_raidz_io_done: rereading
vdev_raidz_io_done: rereading
arc_evict: only evicted 1024 bytes from 81ac940arc_evict_ghost: only
deleted 1024 bytes from 0x81ac9a0arc_evict_ghost: only deleted 0 bytes
from 0x81aca60spa_load: spa_load(): error 5 in dsl_pool_open()
spa_load: spa_load(): error 5
zdb: can't open 6147641630498163037: Input/output error


I apologize for the length, but does anyone know what spa_load() does?
It seems to be the problem here. I'm wondering if this Input/output
error is the same that is preventing the import.

Thanks,
Matt
Ricardo Correia
2008-12-15 23:34:56 UTC
Permalink
Hi Matt,

I apologize for not replying to you before (or to the other threads
being discussed), but it's been really hard for me to find the time to
go through my considerable e-mail backlog.

The spa_load()/dsl_pool_load() functions are responsible for loading the
pool.

It seems your pool somehow got corrupted on some critical piece of
metadata, which is preventing the ZFS code from successfully loading
it/importing it. It's hard to tell which piece of metadata got
corrupted, without doing some more extensive investigation.

To be honest, I don't know how this happened. AFAICT, it seems you
didn't do anything wrong.

One thing you might want to try is reverting back to the previous txg
number, which will require patching zfs-fuse. This might be able to save
your pool, but there are no guarantees.

I have written some details of how to do that in the thread titled
"Metadata Corrupted", back in November:

http://groups.google.com/group/zfs-fuse/browse_frm/month/2008-11

Another thing I should probably start suggesting is that zfs-fuse users
should create pools on partitions, instead of using partition-less
disks.

This would allow people to use the OpenSolaris LiveCD for recovery
purposes, which could prove to be quite helpful (e.g. for using DTrace
to find out exactly what is failing, or to use mdb to revert back to an
older txg without patching the code, or to use a more updated ZFS code
version, which with a recent improvement is now slightly more resistant
against corrupted metadata...).

Best regards,
Ricardo

On Seg, 2008-12-15 at 15:05 -0800, Matt Jacobsen wrote:
> Some more info.
> I recompiled zdb with debug=2 and then tried to checksum the pool via:
>
> ***@Closet:~$ sudo zdb -cv -e 6147641630498163037
> zdb: can't open 6147641630498163037: Input/output error
>
> Here 6147641630498163037 is the pool_guid for the Media zpool. After
> deleting the zpool.cache, I tried again with debug=on:
>
> ***@Closet:~$ sudo rm -rf /etc/zfs/zpool.cache
> ***@Closet:~$ sudo zdb -cv -e 6147641630498163037 debug=on
> kernel_init: physmem = 128920 pages (0.49 GB)
> cv_timedwait: thread -1245025392 is at cv_timedwait at 2038.03 with
> delta 1.00 secs
> cv_timedwait: thread -1253418096 is at cv_timedwait at 2038.03 with
> delta 1.00 secs
> hdr_recl: hdr_recl called
> cv_timedwait: thread -1245025392 exited cv_timedwait at 2038.65 (rem =
> 0.38)
> cv_timedwait: thread -1245025392 is at cv_timedwait at 2038.65 with
> delta 1.00 secs
> vdev_queue_io_to_issue: read T=3790 off= 21000 agg= 17 old=
> 400 new= 4400
> vdev_queue_io_to_issue: read T=3790 off= 26000 agg= 49 old=
> 400 new= c400
> vdev_queue_io_to_issue: read T=3790 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off= 32400 agg= 55 old=
> 400 new= dc00
> vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off= 22000 agg=120 old=
> 400 new=1e000
> vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off= 60000 agg=119 old=
> 400 new=1dc00
> vdev_queue_io_to_issue: read T=3790 off= 7e400 agg= 7 old=
> 400 new= 1c00
> vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3790 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg= 25 old=
> 400 new= 6400
> vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470ba7400 agg= 99 old=
> 400 new=18c00
> vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg= 80 old=
> 400 new=14000
> vdev_queue_io_to_issue: read T=3790 off= 21000 agg= 25 old=
> 400 new= 6400
> vdev_queue_io_to_issue: read T=3790 off= 27800 agg= 24 old=
> 400 new= 6000
> vdev_queue_io_to_issue: read T=3790 off= 2dc00 agg= 73 old=
> 400 new=12400
> vdev_queue_io_to_issue: read T=3790 off= 60000 agg= 54 old=
> 400 new= d800
> vdev_queue_io_to_issue: read T=3790 off=7470bf4800 agg= 46 old=
> 400 new= b800
> vdev_queue_io_to_issue: read T=3790 off= 6e400 agg= 39 old=
> 400 new= 9c00
> vdev_queue_io_to_issue: read T=3790 off= 78800 agg= 30 old=
> 400 new= 7800
> vdev_queue_io_to_issue: read T=3790 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3790 off=7470be0000 agg=128 old=
> 400 new=20000
> dmu_objset_open_impl: reading [L0 DMU objset] 400L/200P DVA[0]
> =<0:23200ff1000:400> DVA[1]=<0:13100d13c00:400> DVA[2]
> =<0:1cf002da000:400> fletcher4 lzjb LE contiguous birth=3890606
> fill=599 cksum=c13fda5a6:4ea1843394e:104008426824e:246313826d92d1
> dbuf_create: ds=mos obj=mdn lvl=1 blkid=0 db=0x83886e78
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> arc_evict: only evicted 1024 bytes from 81ac940arc_evict_ghost: only
> deleted 1024 bytes from 0x81ac9a0arc_evict_ghost: only deleted 0 bytes
> from 0x81aca60spa_load: spa_load(): error 5 in dsl_pool_open()
> spa_load: spa_load(): error 5
> vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
> 400 new=20000
> cv_timedwait: thread -1253418096 exited cv_timedwait at 2039.03 (rem =
> 0.00)
> cv_timedwait: thread -1253418096 is at cv_timedwait at 2039.03 with
> delta 1.00 secs
> vdev_queue_io_to_issue: read T=3791 off= 21000 agg=124 old=
> 400 new=1f000
> vdev_queue_io_to_issue: read T=3791 off= 60000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470ba0000 agg=128 old=
> 400 new=20000
> vdev_queue_io_to_issue: read T=3791 off=7470be0000 agg=128 old=
> 400 new=20000
> dmu_objset_open_impl: reading [L0 DMU objset] 400L/200P DVA[0]
> =<0:23200ff1000:400> DVA[1]=<0:13100d13c00:400> DVA[2]
> =<0:1cf002da000:400> fletcher4 lzjb LE contiguous birth=3890606
> fill=599 cksum=c13fda5a6:4ea1843394e:104008426824e:246313826d92d1
> dbuf_create: ds=mos obj=mdn lvl=1 blkid=0 db=0x83886e78
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> vdev_mirror_io_done: retrying i/o (err=52) on child raidz
> vdev_raidz_io_done: rereading
> vdev_raidz_io_done: rereading
> arc_evict: only evicted 1024 bytes from 81ac940arc_evict_ghost: only
> deleted 1024 bytes from 0x81ac9a0arc_evict_ghost: only deleted 0 bytes
> from 0x81aca60spa_load: spa_load(): error 5 in dsl_pool_open()
> spa_load: spa_load(): error 5
> zdb: can't open 6147641630498163037: Input/output error
>
>
> I apologize for the length, but does anyone know what spa_load() does?
> It seems to be the problem here. I'm wondering if this Input/output
> error is the same that is preventing the import.
>
> Thanks,
> Matt
> >
Matt Jacobsen
2008-12-16 00:34:57 UTC
Permalink
Ricardo,

Thanks for addressing this issue.

> The spa_load()/dsl_pool_load() functions are responsible for loading the
> pool.
>
> It seems your pool somehow got corrupted on some critical piece of
> metadata, which is preventing the ZFS code from successfully loading
> it/importing it. It's hard to tell which piece of metadata got
> corrupted, without doing some more extensive investigation.
>
> To be honest, I don't know how this happened. AFAICT, it seems you
> didn't do anything wrong.
>
> One thing you might want to try is reverting back to the previous txg
> number, which will require patching zfs-fuse. This might be able to save
> your pool, but there are no guarantees.
>
> I have written some details of how to do that in the thread titled
> "Metadata Corrupted", back in November:
>
> http://groups.google.com/group/zfs-fuse/browse_frm/month/2008-11

I can't believe this is the result of a replace. I have 5 working
drives (one was getting flaky) and a mysterious problem with pool
metadata has rendered all the redundancy worthless. That's very
surprising/frustrating.

I patch the current trunk with your patchfile (from the previous
post). I still can open the pool. But here's the output from zdb with
debug=on:

***@Closet:~/trunk/src$ sudo zdb -cv -t 3640288 -e
6147641630498163037 debug=on
kernel_init: physmem = 128920 pages (0.49 GB)
cv_timedwait: thread -1244972144 is at cv_timedwait at 7280.24 with
delta 1.00 secs
cv_timedwait: thread -1253364848 is at cv_timedwait at 7280.24 with
delta 1.00 secs
vdev_queue_io_to_issue: read T=13537 off= 21000 agg= 16 old=
400 new= 4000
hdr_recl: hdr_recl called
cv_timedwait: thread -1244972144 exited cv_timedwait at 7280.85 (rem =
0.39)
cv_timedwait: thread -1244972144 is at cv_timedwait at 7280.85 with
delta 1.00 secs
vdev_queue_io_to_issue: read T=13537 off= 25400 agg=107 old=
400 new=1ac00
vdev_queue_io_to_issue: read T=13537 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13537 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13537 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13537 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13537 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13537 off= 60000 agg= 92 old=
400 new=17000
vdev_queue_io_to_issue: read T=13538 off= 78000 agg= 32 old=
400 new= 8000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
spa_load: spa_load(): vdev guid sum doesn't match the uberblock
spa_load: spa_load(): error 6
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off= 21000 agg=124 old=
400 new=1f000
vdev_queue_io_to_issue: read T=13538 off= 60000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470ba0000 agg=128 old=
400 new=20000
vdev_queue_io_to_issue: read T=13538 off=7470be0000 agg= 62 old=
400 new= f800
vdev_queue_io_to_issue: read T=13538 off=7470bef800 agg= 66 old=
400 new=10800
dmu_objset_open_impl: reading [L0 DMU objset] 400L/200P DVA[0]
=<0:12001c03800:400> DVA[1]=<0:22e2292d800:400> DVA[2]
=<0:1c3003b0000:400> fletcher4 lzjb LE contiguous birth=3598437
fill=598 cksum=9cc3cb215:3ed823ccac9:ccaf397f3c19:1c367f2461f490
dbuf_create: ds=mos obj=mdn lvl=1 blkid=0 db=0x86aefe78
dbuf_create: ds=mos obj=mdn lvl=0 blkid=0 db=0x86aefd70
dbuf_create: ds=mos obj=1 lvl=0 blkid=0 db=0x86aefc68
dbuf_create: ds=mos obj=4 lvl=0 blkid=0 db=0x86aefa58
dbuf_create: ds=mos obj=11 lvl=0 blkid=0 db=0x86aef740
dsl_dir_close: dd=6147641630498163037/$MOS
dsl_dir_close: dd=6147641630498163037
dsl_dir_close: dd=6147641630498163037
arc_evict: only evicted 1024 bytes from 81ac940arc_evict_ghost: only
deleted 51712 bytes from 0x81ac9a0arc_evict_ghost: only deleted 0
bytes from 0x81aca60cv_timedwait: thread -1253364848 exited
cv_timedwait at 7281.24 (rem = -0.01)
cv_timedwait: thread -1253364848 is at cv_timedwait at 7281.24 with
delta 1.00 secs
spa_load: spa_load(): rvd->vdev_state <= VDEV_STATE_CANT_OPEN
spa_load: spa_load(): error 6
zdb: can't open 6147641630498163037: No such device or address

It looks like spa_load() is now giving some error info (along with a
diff error number 6). I'm eager to investigate this further if you can
point me in the right direction.

> Another thing I should probably start suggesting is that zfs-fuse users
> should create pools on partitions, instead of using partition-less
> disks.
>
> This would allow people to use the OpenSolaris LiveCD for recovery
> purposes, which could prove to be quite helpful (e.g. for using DTrace
> to find out exactly what is failing, or to use mdb to revert back to an
> older txg without patching the code, or to use a more updated ZFS code
> version, which with a recent improvement is now slightly more resistant
> against corrupted metadata...).

Actually, my pool is comprised of partitions (1 per disk). Are you
suggesting that the OpenSolaris LiveCD can recognize/recover perhaps?

Thanks,
Matt
Matt Jacobsen
2008-12-16 00:39:59 UTC
Permalink
The previous post should have read:

I patchED the current trunk with your patchfile (from the previous
post). I still can'T open the pool. But here's the output from zdb
with
debug=on:

Sorry about that. Typing too quickly.
Matt Jacobsen
2008-12-16 01:37:43 UTC
Permalink
So from reading

https://www.opensolaris.org/jive/thread.jspa?threadID=76960&tstart=30

and

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html

it seems like all the data should be intact but that the most recent
uberblock (or all 3 copies of them) is corrupt. I've noticed that the
patch adds some support for uberblock tnx ids (or something)? Is there
anyway I can invalidate the most recent uberblock and revert back to a
previous?

Thanks,
Matt

On Dec 15, 4:39 pm, Matt Jacobsen <matt.jacob...-***@public.gmane.org> wrote:
> The previous post should have read:
>
> I patchED the current trunk with your patchfile (from the previous
> post). I still can'T open the pool. But here's the output from zdb
> with
> debug=on:
>
> Sorry about that. Typing too quickly.
s***@public.gmane.org
2008-12-15 23:13:48 UTC
Permalink
I'm afraid I don't have much to say on that core question... sorry.
Frankly all that raidz has been above me for most of the time. I love
it, but I can't get my head around it.

All I can say is that the webpage (http://www.sun.com/msg/ZFS-8000-72)
doesn't bode well for your scenario:
>
> Even though all the devices are available, the on-disk data has been
> corrupted such that the pool cannot be opened. All data within the
> pool is lost, and the pool must be destroyed and restored from an
> appropriate backup source. ZFS includes built-in metadata replication
> to prevent this from happening even for unreplicated pools, but
> running in a replicated configuration will decrease the chances of
> this happening in the future.
>
> If this error is encountered during 'zpool import', the pool is
> unrecoverable and cannot be imported. The pool must be restored from
> an appropriate backup source.
>
I don't remember whether that had been quoted before. I do recall more
people being rather pessimistic about the issue you're faced with.

I do hope you find ways to recover, best of luck,
Seth


Matt Jacobsen wrote:
> I can appreciate your tip and how you'd like to argue for it. But this
> is not the best topic for that. I shouldn't have mentioned that much
> detail from my experience thus far with the 3 (major) ZFS impls.
>
> I could however really use some help with this faulted zpool problem
> though. I've reconnected all 6 drives again (including the newly
> partitioned, ready to use as a replacement, drive). But still have
> this:
>
> ***@Closet:~$ sudo zpool import
> pool: Media
> id: 6147641630498163037
> state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
> The pool may be active on another system, but can be imported using
> the '-f' flag.
> see: http://www.sun.com/msg/ZFS-8000-72
> config:
>
> Media FAULTED corrupted data
> raidz1 ONLINE
> replacing ONLINE
> sdb ONLINE
> sdd ONLINE
> sdf ONLINE
> sdg ONLINE
> sde ONLINE
> sdc ONLINE
>
> And yet I can't import the pool which means I can't get the pool to
> replace the drive. This is very frustrating.
>
> Thanks,
> Matt
>
> On Dec 15, 2:32 pm, "sghee...-***@public.gmane.org" <sghee...-***@public.gmane.org>
> wrote:
>
>>> The net is that none of the ZFS impls really have much to do with each
>>> other AFAICT.
>>>
>> I'm not sure that is fair. I think it is very clear which versions are
>> merged where (see respective change logs).
>> The real problem is that relatively many features require different ODF.
>> I think that is rather unpractical from the end-user perspective. It
>> might be better that new features not be enabled by default on newly
>> created pools.
>>
>> My trick is to use sufficiently low versions explicitely:
>>
>> zpool create version=n poolname
>>
>> That will avoid all the troubles.
>>
>> Seth
>>
>>
>
> >
>
>
>
Steve Freitas
2008-12-16 08:22:51 UTC
Permalink
On Mon, 2008-12-15 at 14:02 -0800, Matt Jacobsen wrote:
> But getting Open Solaris to run
> on my P4 seems to be dependent on how many fingers I have crossed at
> the moment.

A very silly and time-consuming idea, but if it comes to it, you could
run Nexenta in a Xen domU (or Vmware) container on Linux and give it
direct access to the drives in question, it might be easier to do that
than futz with getting OpenSolaris installed properly -- I wasted many
days of my life on that.

Steve
Tobbe Lundberg
2008-12-16 10:56:30 UTC
Permalink
> A very silly and time-consuming idea, but if it comes to it, you could
> run Nexenta in a Xen domU (or Vmware) container on Linux and give it
> direct access to the drives in question, it might be easier to do that
> than futz with getting OpenSolaris installed properly -- I wasted many
> days of my life on that.

Who said anything about installing? Just run it off of the LiveCD. And
the P4 shouldn't be a problem as long as you have 1GB ram or more.
s***@public.gmane.org
2008-12-16 11:20:51 UTC
Permalink
Steve Freitas wrote:
> On Mon, 2008-12-15 at 14:02 -0800, Matt Jacobsen wrote:
>
>> But getting Open Solaris to run
>> on my P4 seems to be dependent on how many fingers I have crossed at
>> the moment.
>>
>
> A very silly and time-consuming idea, but if it comes to it, you could
> run Nexenta in a Xen domU (or Vmware) container on Linux and give it
> direct access to the drives in question, it might be easier to do that
>
BE VERY CAREFUL: VmWare is known to have issues with ZFS corrupting it's
metadata (sic!). You *absolutely* need to enable RDM (Raw Device
Mapping) or you will get in a lot more trouble.

See http://opensolaris.org/jive/thread.jspa?messageID=248730 and lots of
other places.

Afaik RDM is a feature of ESX only, so it might not be available on your
lowly linux boxen.

Cheers,
Seth

> than futz with getting OpenSolaris installed properly -- I wasted many
> days of my life on that.
>
> Steve
>
>
> >
>
>
>


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "zfs-fuse" group.
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To unsubscribe from this group, send email to zfs-fuse+unsubscribe-/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/zfs-fuse?hl=en
-~----------~----~----~----~------~----~------~--~---
Matt Jacobsen
2008-12-16 16:54:27 UTC
Permalink
Thanks for the tips, but I don't intend to throw a VM in the mix at
this point.

What I have found is that I'm now able to read the current uberblock:

***@Closet:~/trunk/src$ sudo zdb -u -e 6147641630498163037
Uberblock

magic = 0000000000bab10c
version = 10
txg = 3890606
guid_sum = 10442127170067653826
timestamp = 1229360966 UTC = Mon Dec 15 09:09:26 2008

and I'm able to "open" the pool to traverse blocks and checksum them:

***@Closet:~/trunk/src$ sudo zdb -cv -e 6147641630498163037

Traversing all blocks to verify checksums and verify nothing
leaked ...

but this will no doubt take some time.
Interestingly, I cannot find a previous uberblock using the max txg id
of current-1:

***@Closet:~/trunk/src$ sudo zdb -u -t 3890605 -e
6147641630498163037
zdb: can't open 6147641630498163037: No such device or address

I suppose this is good news. But zpool still reports unavailable:

***@Closet:~/trunk/src/lib/libzpool$ sudo zpool import
pool: Media
id: 6147641630498163037
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: http://www.sun.com/msg/ZFS-8000-5E
config:

Media UNAVAIL insufficient replicas
raidz1 FAULTED corrupted data
replacing DEGRADED
sdc FAULTED corrupted data
sdg ONLINE
sde ONLINE
sdf ONLINE
sdd FAULTED corrupted data
sdc FAULTED corrupted data

You'll notice that it says a lot of the devices are corrupted. That
doesn't alarm me much because prior to this listing, zpool import was
displaying 4 online drives, 1 faulted, and 1 unavailable. I plugged in
the unavailable and zpool started reporting this. Given my experiences
with adding and removing drives in the last few days, this is normal
behavior. Until a reboot occurs, zpool import won't get it correct.

Thanks,
Matt
Matt Jacobsen
2008-12-16 16:59:39 UTC
Permalink
Since I'm seeing the 4 (good) drives from my raidz array spinning (the
5th not-so-good one is not plugged in) and zdb traversing the blocks,
it seems like zdb is able to read the pool. Can someone tell me how to
get zdb to copy the data off the pool? This seems like the only way at
this point to read the data out of the pool (perhaps using zdb -R ?).
I'm clinging to this as a sign that there's hope yet.

Thanks,
Matt
Ricardo M. Correia
2008-12-16 17:06:06 UTC
Permalink
Matt,

Wow, there's definitely something funky going on there!

The zdb behavior is encouraging, but I'm still very worried about the
zpool import output.

Are you sure that you haven't mixed up binaries from an older zfs-fuse
version and a current one?

If you are not sure, can you please remove all zfs-fuse, zpool, zfs and
zdb binaries that are in /usr/bin or /usr/local/bin (or anywhere else in
your path, for that matter), and then reinstalling from the source?

After that, instead of rebooting, can you try killing zfs-fuse,
removing /etc/zfs/zpool.cache, starting zfs-fuse again (with the
--no-daemon option, to catch any error output), and then running 'zpool
import' again?

This should make zfs-fuse start with a clean plate and remove all
possible confusion.

Also, do you see any errors in syslog related to devices?

Curiously, I'm not seeing how zdb would work, and zpool import
wouldn't...

Thanks,
Ricardo


On Ter, 2008-12-16 at 08:54 -0800, Matt Jacobsen wrote:
> ***@Closet:~/trunk/src$ sudo zdb -u -e 6147641630498163037
> Uberblock
>
> magic = 0000000000bab10c
> version = 10
> txg = 3890606
> guid_sum = 10442127170067653826
> timestamp = 1229360966 UTC = Mon Dec 15 09:09:26 2008
>
> and I'm able to "open" the pool to traverse blocks and checksum them:
>
> ***@Closet:~/trunk/src$ sudo zdb -cv -e 6147641630498163037
>
> Traversing all blocks to verify checksums and verify nothing
> leaked ...
>
> but this will no doubt take some time.
> Interestingly, I cannot find a previous uberblock using the max txg id
> of current-1:
>
> ***@Closet:~/trunk/src$ sudo zdb -u -t 3890605 -e
> 6147641630498163037
> zdb: can't open 6147641630498163037: No such device or address
>
> I suppose this is good news. But zpool still reports unavailable:
>
> ***@Closet:~/trunk/src/lib/libzpool$ sudo zpool import
> pool: Media
> id: 6147641630498163037
> state: UNAVAIL
> status: One or more devices contains corrupted data.
> action: The pool cannot be imported due to damaged devices or data.
> see: http://www.sun.com/msg/ZFS-8000-5E
> config:
>
> Media UNAVAIL insufficient replicas
> raidz1 FAULTED corrupted data
> replacing DEGRADED
> sdc FAULTED corrupted data
> sdg ONLINE
> sde ONLINE
> sdf ONLINE
> sdd FAULTED corrupted data
> sdc FAULTED corrupted data
>
> You'll notice that it says a lot of the devices are corrupted. That
> doesn't alarm me much because prior to this listing, zpool import was
> displaying 4 online drives, 1 faulted, and 1 unavailable. I plugged in
> the unavailable and zpool started reporting this. Given my experiences
> with adding and removing drives in the last few days, this is normal
> behavior. Until a reboot occurs, zpool import won't get it correct.
>
> Thanks,
> Matt
>
> >
Matt Jacobsen
2008-12-16 17:40:03 UTC
Permalink
I really appreciate all your help on this.

> The zdb behavior is encouraging, but I'm still very worried about the
> zpool import output.

Agreed. Hey, is there anyway to get the data off the drives using zdb
though? It's like I have this golden moment right now and I don't want
to screw it up.

> Are you sure that you haven't mixed up binaries from an older zfs-fuse
> version and a current one?

Reasonably sure, I recently just installed the patched trunk head via:
sudo scons debug=on install

But I'll remove all the binaries and do that again (as soon as the zdb
-cv finishes).

> After that, instead of rebooting, can you try killing zfs-fuse,
> removing /etc/zfs/zpool.cache, starting zfs-fuse again (with the
> --no-daemon option, to catch any error output), and then running 'zpool
> import' again?

I didn't know about the -no-daemon option printing out errors. I'll
try that.

> Also, do you see any errors in syslog related to devices?

I did a check upon boot up before. I didn't see anything out of the
ordinary. I do see a lot of messages when trying to access the drives
after they've gone to sleep. But I expect that's probably normal
because these are external usb/firewire drives that take several
seconds to spin back up.

> Curiously, I'm not seeing how zdb would work, and zpool import
> wouldn't...

Is it the case that zdb does an import before it can traverse the
blocks? Because zpool reports no pools imported.

Thanks,
Matt
Ricardo M. Correia
2008-12-16 17:56:41 UTC
Permalink
On Ter, 2008-12-16 at 09:40 -0800, Matt Jacobsen wrote:
> > The zdb behavior is encouraging, but I'm still very worried about the
> > zpool import output.
>
> Agreed. Hey, is there anyway to get the data off the drives using zdb
> though? It's like I have this golden moment right now and I don't want
> to screw it up.

Not that I'm aware of (at least not easily). But if zdb works, we can
get the data out of the pool with zfs-fuse.

> > Are you sure that you haven't mixed up binaries from an older zfs-fuse
> > version and a current one?
>
> Reasonably sure, I recently just installed the patched trunk head via:
> sudo scons debug=on install

Ok, I assume you haven't previously installed zfs-fuse by other means
(such as a binary packge).

> But I'll remove all the binaries and do that again (as soon as the zdb
> -cv finishes).

Ok, that would be great.

> > After that, instead of rebooting, can you try killing zfs-fuse,
> > removing /etc/zfs/zpool.cache, starting zfs-fuse again (with the
> > --no-daemon option, to catch any error output), and then running 'zpool
> > import' again?
>
> I didn't know about the -no-daemon option printing out errors. I'll
> try that.

(BTW, it's --no-daemon, not -no-daemon) :-)

> > Also, do you see any errors in syslog related to devices?
>
> I did a check upon boot up before. I didn't see anything out of the
> ordinary. I do see a lot of messages when trying to access the drives
> after they've gone to sleep.
> But I expect that's probably normal
> because these are external usb/firewire drives that take several
> seconds to spin back up.

This could be causing problems. If reads timeout due to sleeping/waking
up, it's possible that they might be causing problems when importing the
pool.

Please try and see if you can disable hard drive sleeping (maybe
hdparm/sdparm/blktool might help?).

Another point to mention is that (as far as I'm aware) zfs-fuse does not
flush the write cache in USB or firewire drives, which lends credibility
to my theory of write reordering problems when you killed zfs-fuse due
to the resilvering hang.

I think this is a limitation of most USB/firewire interfaces, but I'm
not an expert on this subject.

By the way, it would be good if you could verify if zfs-fuse emitted a
warning in syslog when you started it, saying that it couldn't flush the
write cache of some of the disks.

> > Curiously, I'm not seeing how zdb would work, and zpool import
> > wouldn't...
>
> Is it the case that zdb does an import before it can traverse the
> blocks? Because zpool reports no pools imported.

It uses the same code, but it doesn't actually import
into /etc/zfs/zpool.cache, it imports into a temporary file (or just
into memory, I'm not sure).

There's also a minor difference in that it imports the pool in
read-only mode, but this should make no difference as to the success of
the import.

Cheers,
Ricardo
Ricardo M. Correia
2008-12-16 17:46:48 UTC
Permalink
On Ter, 2008-12-16 at 08:54 -0800, Matt Jacobsen wrote:
> You'll notice that it says a lot of the devices are corrupted. That
> doesn't alarm me much because prior to this listing, zpool import was
> displaying 4 online drives, 1 faulted, and 1 unavailable. I plugged in
> the unavailable and zpool started reporting this.

BTW, I should mention that this suggests that the drive that you plugged
in caused a different uberblock to be used, which apparently might not
be the best one.

So I would suggest that when you run 'zpool import', that you have the
exact same drives plugged in as when you ran the successful zdb command.

If the same drives are plugged in, there should be no difference in
behavior.

> Given my experiences
> with adding and removing drives in the last few days, this is normal
> behavior.

I wouldn't say this is normal :-)

It seems that something wrong happened on your particular setup, but it
might be hardware related or software related.

It seems to me that this could have happened if some of your drives are
not flushing the write cache correctly, because this would cause blocks
to be reordered and the uberblock to be written before all the other
blocks under the block tree.

It is also worth to mention that if the write caches are indeed not
being flushed, then the more drives you have on the same pool, the more
likely this problem can happen.

I think that these write reordering problems can happen not just when
you reboot your machine, but if you have multiple disks on the same
pool, then they can also cause problems just by killing zfs-fuse. But of
course, this can only happen if write cache flushing is not working for
some reason.

This is not a bug in zfs-fuse or the ZFS code, this is actually a
problem either in the storage stack or in the disks themselves.

Anyway, if indeed this is the problem, then it should always be
recoverable by simply going back to a previous uberblock. In the future,
ZFS/zfs-fuse will be able to recover from this problem in a much easier
way.

> Until a reboot occurs, zpool import won't get it correct.

Rebooting shouldn't cause any change in behavior in zpool import (except
for possible drive renumbering).

If something like this is happening, then there is something else going
on..

Regards,
Ricardo
s***@public.gmane.org
2008-12-16 20:03:10 UTC
Permalink
Ricardo M. Correia wrote:
> After that, instead of rebooting, can you try killing zfs-fuse,
> removing /etc/zfs/zpool.cache, starting zfs-fuse again (with the
> --no-daemon option, to catch any error output), and then running 'zpool
> import' again?
and later:
> Rebooting shouldn't cause any change in behavior in zpool import (except
> for possible drive renumbering).
>


Now I hate to question any of your expert advice... But I really doubt
this is always true on linux. In my experience this highly depends on
kernel configuration and loaded modules. I make it a habit to manually
'sudo partprobe' after any change in disk (mapper) device (nodes). Still
I found that sometimes a reboot was in order, when (part of a) disk or
partition (table) was being used/had been in use by any kernel module at
all (apparently locks are held and caches are unflushable (from
userland?)).

Personally I feel really comfortable using partprobe to tell the kernel
to rescan the disks, but then again I also (subconsciously) learned to
avoid certain configs (like mixing LVM/ZFS/dm-raid on a single device).
So, it might just be that it works for me since I stopped playing with
overly complex setups.

Just my 20pence

Seth
Ricardo M. Correia
2008-12-16 20:30:10 UTC
Permalink
On Ter, 2008-12-16 at 21:03 +0100, sgheeren-***@public.gmane.org wrote:
> Now I hate to question any of your expert advice... But I really doubt
> this is always true on linux. In my experience this highly depends on
> kernel configuration and loaded modules. I make it a habit to manually
> 'sudo partprobe' after any change in disk (mapper) device (nodes). Still
> I found that sometimes a reboot was in order, when (part of a) disk or
> partition (table) was being used/had been in use by any kernel module at
> all (apparently locks are held and caches are unflushable (from
> userland?)).

Yes, if you make a change to the partitioning scheme and the kernel is
not able to update the devices because one of the partitions was in use,
then you might have to reboot in order for the new partitions to be
setup correctly (e.g. I think this would be necessary if you change the
partitioning scheme of the disk where your root filesystem is stored,
because you wouldn't be able to unmount it, even if you wanted).

But I don't think this applies to Matt's case, because we are not
changing the partitioning scheme.

So again, I think in his case, the output of 'zpool import' prior to a
reboot should not be any different than afterwards, except that the
disks may be renumbered. Otherwise, some other problem may be happening,
which we may be missing.

- Ricardo

> Personally I feel really comfortable using partprobe to tell the kernel
> to rescan the disks, but then again I also (subconsciously) learned to
> avoid certain configs (like mixing LVM/ZFS/dm-raid on a single device).
> So, it might just be that it works for me since I stopped playing with
> overly complex setups.
>
> Just my 20pence
>
> Seth
>
> >
Matt Jacobsen
2008-12-18 07:22:58 UTC
Permalink
Well the traversal checked out ok (running on 4 out of the 5 original
disks):

***@Closet:~/trunk/src$ sudo zdb -cv -e 6147641630498163037

Traversing all blocks to verify checksums and verify nothing
leaked ...

No leaks (block sum matches space maps exactly)

bp count: 14933437
bp logical: 1940755398144 avg: 129960
bp physical: 1939333102592 avg: 129865 compression: 1.00
bp allocated: 2425270340608 avg: 162405 compression: 0.80
SPA allocated: 2425270340608 used: 97.02%

and I'm able to read the current uberblock and dataset:

***@Closet:~/trunk/src/cmd/zpool$ sudo zdb -u -e 6147641630498163037
Uberblock

magic = 0000000000bab10c
version = 10
txg = 3890606
guid_sum = 10442127170067653826
timestamp = 1229360966 UTC = Mon Dec 15 09:09:26 2008

***@Closet:~/trunk/src$ sudo zdb -d -e 6147641630498163037
Dataset mos [META], ID 0, cr_txg 4, 7.40M, 599 objects
Dataset 6147641630498163037 [ZPL], ID 5, cr_txg 4, 1.76T, 29655
objects

The problem seems to be still that zpool import is doing something
very different that zdb when it's trying to open the pool. I've spent
the last 2 days learning the code and have debugged it down to the
point where it seems that both zdb and zpool import are calling
spa_load() with the same spa. Here's a snippet of printf debug code:

spa_load 1, 6147641630498163037 spa_config_parse error: 0
spa_load 2, 6147641630498163037 vdev_open error: 0
spa_load 3, 6147641630498163037 vdev_validate error: 0
spa_load 5, 6147641630498163037 uberblock ok
spa_load 5.5, dsl_pool_open, 6147641630498163037, spa_first_txg:
3890607
dsl_pool_open 1: err 0
dbuf_hold_impl 0, level 0, blkid 0
dbuf_hold_impl 1, db == NULL 1
dbuf_findbp 0, calling with: level: 0, blkid: 0, fail_sparse: 0
dbuf_hold_impl 0, level 1, blkid 0
dbuf_hold_impl 1, db == NULL 1
dbuf_findbp 0, calling with: level: 1, blkid: 0, fail_sparse: 0
dbuf_hold_impl 2, err 0
dbuf_findbp 2, recursive call to dbuf_hold_impl returned err 0
dbuf_findbp 3, dbuf_read err 0

This is from zdb. It goes on to read off the uberblock or do a
traversal-- basically open the spa without problems. The difference
with zpool import is that it stops with the last line from above
reading:

dbuf_findbp 3, dbuf_read err 6

This means the call to read a dnode via dbuf_read, returned an error
value of 6. AFAICT the spa, objset, etc. are all the same on both
calls. I can't understand what exactly is different (or where).

My current approach is to see if I can update zpool import to open the
spa exactly how zdb does. Alternatively, updating zdb to actually
import the spa might be easier. It appears that the *magic* that zdb
has to open the spa successfully is the following:

char *vdev_dir = NULL;
nvlist_t *exported_conf = NULL;
int errr = find_exported_zpool("6147641630498163037", &exported_conf,
vdev_dir);
if (errr == 0) {
nvlist_t *nvl = NULL;
if (nvlist_alloc(&nvl, NV_UNIQUE_NAME, 0) != 0)
errr = ENOMEM;
else if (nvlist_add_string(nvl, zpool_prop_to_name
(ZPOOL_PROP_ALTROOT), vdev_dir) != 0)
errr = ENOMEM;
printf("errr 1: %d\n", errr);
if (errr == 0)
errr = spa_import_faulted("6147641630498163037",
exported_conf, nvl);
printf("errr 2: %d\n", errr);
nvlist_free(nvl);
}
spa_open("6147641630498163037", &spa, FTAG);


Yes, I've hard coded the pool guid. Unfortunately, figuring out how
zpool import works seems too be less than obvious. That might take me
a few days to sort out. Any high level tips on how to achieve either
of my two approaches would be much appreciated (an explanation on what
zpool import does extra to get the pool imported/mounted -- or even
where it does the spa_load() would be really really nice at this
point).

Thanks,
Matt
Matt Jacobsen
2008-12-19 20:35:26 UTC
Permalink
I've figured out roughly how zdb is opening the pool:

pools = zpool_find_import_activeok(g_zfs, 0, NULL);
config = find_pool("6147641630498163037", pools);
spa_import_faulted("6147641630498163037", config, NULL);

The process to import via zpool import is much the same except that
zpool simply ioctl's commands through a pipe to zfs-fuse, which then
executes the operations. I've adjusted zpool import to retrieve the
pools and find the config exactly as zdb does. It then sends this
config to zfs-fuse which then executes:

spa_import_faulted("6147641630498163037", config, NULL);

I've verified that the config is exactly the same in zfs-fuse, zpool,
and zdb. Thus, the call to spa_import_faulted() has exactly the same
parameters. A long debug trace (below), shows that the api calls are
exactly the same until one line inside lib/libzpool/dbuf.c, in
function dbuf_read():

err = zio_wait(zio);

When called from zdb the zio struct reads:
dbuf_read 2.4, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>io_stage: 23, zio->io_error: 0, err: 0

When called from zfs-fuse (with the *exact same arguments*), the zio
struct reads:
dbuf_read 2.4, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>io_stage: 23, zio->io_error: 6, err: 6

So it seems zfs-fuse sets up some extra state that zdb does not.

The question then becomes, does anyone have any idea what zio-
>io_error value 6 means? Exactly? This is really quite frustrating.

Thanks,
Matt

---------------------------------
Debug output (for zfs-fuse call):

spa_config_parse for child: 0
spa_config_parse for child: 0
spa_config_parse for child: 0
spa_config_parse for child: 1
spa_config_parse for child: 1
spa_config_parse for child: 2
spa_config_parse for child: 3
spa_config_parse for child: 4
spa_load 1, 6147641630498163037 spa_config_parse error: 0
spa_load 2, 6147641630498163037 vdev_open error: 0
spa_load 3, 6147641630498163037 vdev_validate error: 0
spa_load 5, 6147641630498163037 uberblock ok
spa_load spa: -1210193920
spa_load spa->spa_name: 6147641630498163037
spa_load spa->spa_config: 0
spa_load spa->spa_config_syncing: 0
spa_load spa->spa_config_txg: 0
spa_load spa->spa_sync_pass: 0
spa_load spa->spa_state: 0
spa_load spa->spa_inject_ref: 0
spa_load spa->spa_traverse_wanted: 0
spa_load spa->spa_sync_on: 0
spa_load spa->spa_load_state: 2
spa_load spa->spa_zio_issue_taskq: -1210193832
spa_load spa->spa_zio_intr_taskq: -1210193808
spa_load spa->spa_dsl_pool: 0
spa_load spa->spa_normal_class: -1210184560
spa_load spa->spa_log_class: -1210184576
spa_load spa->spa_first_txg: 3890607
spa_load spa->spa_final_txg: 18446744073709551615
spa_load spa->spa_freeze_txg: 18446744073709551615
spa_load spa->spa_meta_objset: 0
spa_load spa->spa_root_vdev: id 0, guid: 6147641630498163037,
guid_sum: 10442127170067653826, state: 6, prevstate: 0, children: 1,
spa: -1210193920
spa_load spa->spa_load_guid: 6147641630498163037
spa_load spa->spa_config_object: 0
spa_load spa->spa_syncing_txg: 0
spa_load spa->spa_sync_bplist_obj: 0
spa_load spa->spa_ubsync: magic: 12235020, ver: 10, txg: 3890606,
guid_sum: 10442127170067653826, timestamp: 1229360966, rootbp: 3890606
spa_load spa->spa_uberblock: magic: 12235020, ver: 10, txg:
3890606, guid_sum: 10442127170067653826, timestamp: 1229360966,
rootbp: 3890606
spa_load spa->spa_scrub_inflight: 0
spa_load spa->spa_scrub_maxinflight: 60
spa_load spa->spa_scrub_errors: 0
spa_load spa->spa_scrub_active: 0
spa_load spa->spa_scrub_type: 0
spa_load spa->spa_scrub_finished: 0
spa_load spa->spa_scrub_started: 0
spa_load spa->spa_scrub_reopen: 0
spa_load spa->spa_async_thread: 0
spa_load spa->spa_async_suspended: 0
spa_load spa->spa_async_tasks: 0
spa_load spa->spa_root: (null)
spa_load spa->spa_ena: 0
spa_load spa->spa_last_open_failed: 0
spa_load spa->spa_errlog_last: 0
spa_load spa->spa_errlog_scrub: 0
spa_load spa->spa_deflate: 0
spa_load spa->spa_history: 0
spa_load spa->spa_pending_vdev: 0
spa_load spa->spa_pool_props_object: 0
spa_load spa->spa_bootfs: 0
spa_load spa->spa_delegation: 0
spa_load spa->spa_failmode: 0
spa_load spa->spa_aio_ctx: -1210184592
spa_load spa->spa_import_faulted: 1
spa_load spa->spa_is_root: 0
spa_load spa->spa_minref: 0
spa_load spa->spa_log_state: 0
spa_load spa->spa_refcount: 0
spa_load 5.5, dsl_pool_open, 6147641630498163037, spa_first_txg:
3890607
dsl_pool_open 1: err 0
dsl_pool_open 1.5, objset checksum: 7
dbuf_hold_impl 0, level 0, blkid 0
dbuf_hold_impl 1, db == NULL 1
dbuf_findbp 0, calling with: level: 0, blkid: 0, fail_sparse: 0
dbuf_hold_impl 0, level 1, blkid 0
dbuf_hold_impl 1, db == NULL 1
dbuf_findbp 0, calling with: level: 1, blkid: 0, fail_sparse: 0
dbuf_hold_impl 2, err 0
dbuf_findbp 2, recursive call to dbuf_hold_impl returned err 0
debuf_findbp 2.5, parentp: 0, db size 16384
dbuf_read 2, db->db_state == DB_UNCACHED
dbuf_read 2.1, zio == NULL
dbuf_read 2.2, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0
dbuf_read 2.3, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>io_stage: 0, zio->io_error: 0
dbuf_read 2.4, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>io_stage: 23, zio->io_error: 0, err: 0
dbuf_findbp 3, dbuf_read err 0
dbuf_hold_impl 2, err 0
dnode_hold_impl 1, db == NULL 0
dbuf_read 2, db->db_state == DB_UNCACHED
dbuf_read 2.1, zio == NULL
s***@public.gmane.org
2008-12-19 20:46:18 UTC
Permalink
I don't have direct answers.

Just triggering: have you checked that
users/permission/ulimit/environment are equivalent between the
zdb/zfs-fuse(from zpool) approaches?

Seth

PS. great job for persevering. You might be helping quite a few people
in the future *and* you're almost getting me to the point of thinking:
hey I understand this a bit, I could spend some time on zfs-fuse dev
someday...

Matt Jacobsen wrote:
> I've figured out roughly how zdb is opening the pool:
>
> pools = zpool_find_import_activeok(g_zfs, 0, NULL);
> config = find_pool("6147641630498163037", pools);
> spa_import_faulted("6147641630498163037", config, NULL);
>
> The process to import via zpool import is much the same except that
> zpool simply ioctl's commands through a pipe to zfs-fuse, which then
> executes the operations. I've adjusted zpool import to retrieve the
> pools and find the config exactly as zdb does. It then sends this
> config to zfs-fuse which then executes:
>
> spa_import_faulted("6147641630498163037", config, NULL);
>
> I've verified that the config is exactly the same in zfs-fuse, zpool,
> and zdb. Thus, the call to spa_import_faulted() has exactly the same
> parameters. A long debug trace (below), shows that the api calls are
> exactly the same until one line inside lib/libzpool/dbuf.c, in
> function dbuf_read():
>
> err = zio_wait(zio);
>
> When called from zdb the zio struct reads:
> dbuf_read 2.4, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>
>> io_stage: 23, zio->io_error: 0, err: 0
>>
>
> When called from zfs-fuse (with the *exact same arguments*), the zio
> struct reads:
> dbuf_read 2.4, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>
>> io_stage: 23, zio->io_error: 6, err: 6
>>
>
> So it seems zfs-fuse sets up some extra state that zdb does not.
>
> The question then becomes, does anyone have any idea what zio-
>
>> io_error value 6 means? Exactly? This is really quite frustrating.
>>
>
> Thanks,
> Matt
>
> ---------------------------------
> Debug output (for zfs-fuse call):
>
> spa_config_parse for child: 0
> spa_config_parse for child: 0
> spa_config_parse for child: 0
> spa_config_parse for child: 1
> spa_config_parse for child: 1
> spa_config_parse for child: 2
> spa_config_parse for child: 3
> spa_config_parse for child: 4
> spa_load 1, 6147641630498163037 spa_config_parse error: 0
> spa_load 2, 6147641630498163037 vdev_open error: 0
> spa_load 3, 6147641630498163037 vdev_validate error: 0
> spa_load 5, 6147641630498163037 uberblock ok
> spa_load spa: -1210193920
> spa_load spa->spa_name: 6147641630498163037
> spa_load spa->spa_config: 0
> spa_load spa->spa_config_syncing: 0
> spa_load spa->spa_config_txg: 0
> spa_load spa->spa_sync_pass: 0
> spa_load spa->spa_state: 0
> spa_load spa->spa_inject_ref: 0
> spa_load spa->spa_traverse_wanted: 0
> spa_load spa->spa_sync_on: 0
> spa_load spa->spa_load_state: 2
> spa_load spa->spa_zio_issue_taskq: -1210193832
> spa_load spa->spa_zio_intr_taskq: -1210193808
> spa_load spa->spa_dsl_pool: 0
> spa_load spa->spa_normal_class: -1210184560
> spa_load spa->spa_log_class: -1210184576
> spa_load spa->spa_first_txg: 3890607
> spa_load spa->spa_final_txg: 18446744073709551615
> spa_load spa->spa_freeze_txg: 18446744073709551615
> spa_load spa->spa_meta_objset: 0
> spa_load spa->spa_root_vdev: id 0, guid: 6147641630498163037,
> guid_sum: 10442127170067653826, state: 6, prevstate: 0, children: 1,
> spa: -1210193920
> spa_load spa->spa_load_guid: 6147641630498163037
> spa_load spa->spa_config_object: 0
> spa_load spa->spa_syncing_txg: 0
> spa_load spa->spa_sync_bplist_obj: 0
> spa_load spa->spa_ubsync: magic: 12235020, ver: 10, txg: 3890606,
> guid_sum: 10442127170067653826, timestamp: 1229360966, rootbp: 3890606
> spa_load spa->spa_uberblock: magic: 12235020, ver: 10, txg:
> 3890606, guid_sum: 10442127170067653826, timestamp: 1229360966,
> rootbp: 3890606
> spa_load spa->spa_scrub_inflight: 0
> spa_load spa->spa_scrub_maxinflight: 60
> spa_load spa->spa_scrub_errors: 0
> spa_load spa->spa_scrub_active: 0
> spa_load spa->spa_scrub_type: 0
> spa_load spa->spa_scrub_finished: 0
> spa_load spa->spa_scrub_started: 0
> spa_load spa->spa_scrub_reopen: 0
> spa_load spa->spa_async_thread: 0
> spa_load spa->spa_async_suspended: 0
> spa_load spa->spa_async_tasks: 0
> spa_load spa->spa_root: (null)
> spa_load spa->spa_ena: 0
> spa_load spa->spa_last_open_failed: 0
> spa_load spa->spa_errlog_last: 0
> spa_load spa->spa_errlog_scrub: 0
> spa_load spa->spa_deflate: 0
> spa_load spa->spa_history: 0
> spa_load spa->spa_pending_vdev: 0
> spa_load spa->spa_pool_props_object: 0
> spa_load spa->spa_bootfs: 0
> spa_load spa->spa_delegation: 0
> spa_load spa->spa_failmode: 0
> spa_load spa->spa_aio_ctx: -1210184592
> spa_load spa->spa_import_faulted: 1
> spa_load spa->spa_is_root: 0
> spa_load spa->spa_minref: 0
> spa_load spa->spa_log_state: 0
> spa_load spa->spa_refcount: 0
> spa_load 5.5, dsl_pool_open, 6147641630498163037, spa_first_txg:
> 3890607
> dsl_pool_open 1: err 0
> dsl_pool_open 1.5, objset checksum: 7
> dbuf_hold_impl 0, level 0, blkid 0
> dbuf_hold_impl 1, db == NULL 1
> dbuf_findbp 0, calling with: level: 0, blkid: 0, fail_sparse: 0
> dbuf_hold_impl 0, level 1, blkid 0
> dbuf_hold_impl 1, db == NULL 1
> dbuf_findbp 0, calling with: level: 1, blkid: 0, fail_sparse: 0
> dbuf_hold_impl 2, err 0
> dbuf_findbp 2, recursive call to dbuf_hold_impl returned err 0
> debuf_findbp 2.5, parentp: 0, db size 16384
> dbuf_read 2, db->db_state == DB_UNCACHED
> dbuf_read 2.1, zio == NULL
> dbuf_read 2.2, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0
> dbuf_read 2.3, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>
>> io_stage: 0, zio->io_error: 0
>>
> dbuf_read 2.4, zio->io_txg: 0, zio->io_size: 0, zio->io_bp: 0, zio-
>
>> io_stage: 23, zio->io_error: 0, err: 0
>>
> dbuf_findbp 3, dbuf_read err 0
> dbuf_hold_impl 2, err 0
> dnode_hold_impl 1, db == NULL 0
> dbuf_read 2, db->db_state == DB_UNCACHED
> dbuf_read 2.1, zio == NULL
>
>
> >
>
>
>
Matt Jacobsen
2008-12-20 00:44:03 UTC
Permalink
Thanks for the encouragement.
I've dug in deeper into the zio_wait() issue. It actually is a call to
execute a zio read (in this case) via a pipelined state machine like
execution. The net is that right now I have 5 drives plugged in, 4
original (good ones), and the currently empty replacement drive. When
I get to the call that's failing I discovered that it fails at lib/
libzpool/vdev_raidz.c in vdev_raidz_io_done(). This is one of the vdev
specific functions that get called during the pipeline (it corresponds
to io_done). This vdev_raidz_io_done() function is pretty big, but
here's the difference between calls.

For the zdb based call that works:

zio_execute 3, -2076599920 zio_pipeline[16](zio), 0
vdev_raidz_io_done 1, checking rm 5 cols
vdev_raidz_io_done 3, zio->io_numerrors: 0, parity_errors: 0,
data_errors: 0, unexpected_errors: 0, parity_untried: 1
vdev_raidz_io_done 5, zio->io_numerrors: 0
vdev_raidz_io_done 6, trying correctable number of errors
vdev_raidz_io_done 7, data_errors: 0
vdev_raidz_io_done 7.1, zio_checksum_error(zio) == 0
vdev_raidz_io_done 17, zio->io_error: 0
vdev_raidz_io_done 17.1, zio->io_error: 0
vdev_raidz_io_done 18, zio->io_error: 0
vdev_raidz_io_done 18.1, zio->io_error: 0
zio_execute 4, -2076599920 rv: 256, zio->io_error: 0

And for the zfs-fuse based call that fails:

zio_execute 3, -1613796544 zio_pipeline[16](zio), 0
vdev_raidz_io_done 1, checking rm 5 cols
vdev_raidz_io_done 2, rc->rc_error 6
vdev_raidz_io_done 2.1, zio->io_error now: 6
vdev_raidz_io_done 3, zio->io_numerrors: 1, parity_errors: 0,
data_errors: 1, unexpected_errors: 0, parity_untried: 0
vdev_raidz_io_done 5, zio->io_numerrors: 1
vdev_raidz_io_done 6, trying correctable number of errors
vdev_raidz_io_done 8, data_errors: 1
vdev_raidz_io_done 8.1, column with error: 3
vdev_raidz_io_done 8.2, vdev_raidz_reconstruct_p
vdev_raidz_io_done 11, read every block
vdev_raidz_io_done 11.1, rm->rm_col[0].rc_tried: 1
vdev_raidz_io_done 11.1, rm->rm_col[1].rc_tried: 1
vdev_raidz_io_done 11.1, rm->rm_col[2].rc_tried: 1
vdev_raidz_io_done 11.1, rm->rm_col[3].rc_tried: 1
vdev_raidz_io_done 11.1, rm->rm_col[4].rc_tried: 1
vdev_raidz_io_done 12, zio->io_numerrors: 1 >= rm-
>rm_firstdatacol 1
vdev_raidz_io_done 17, zio->io_error: 6
vdev_raidz_io_done 17.1, zio->io_error: 6
vdev_raidz_io_done 18, zio->io_error: 6
vdev_raidz_io_done 18.1, zio->io_error: 6
zio_execute 4, -1613796544 rv: 256, zio->io_error: 6

Now mind you this is with the same exact code and configuration.
By examining the source I can see that a data error is found on one of
the drives (i.e. raidz columns) when zfs-fuse reads the data, but the
error is considered a parity_untried error when zdb reads it. Or more
specifically, the raidz column has no error set when zdb reads the
data.

Let me stress again, these are the same drives, same config found,
same library calls made. It seems like zdb isn't checking something
and therefore *not* screwing things up such that the pool can't be
imported.

Any helpful hints, especially from those who've written some of this,
would be greatly appreciated.

Thanks,
Matt
Matt Jacobsen
2008-12-21 18:35:55 UTC
Permalink
After spending a week going through the source and trying to figure
out what the problem is, I've met with success.

While, I still can't answer the questions I posted regarding the
difference between how zdb imports the pool vs. how zfs-fuse imports
the pool, I was able to identify the problems with the zfs-fuse code
that was causing the import & mount to fail.

First, I plugged in all the drives (even the empty one that was to
replace the faulted one). Then I had to effectively disable any writes
to the drive, even those that are in the form of updating the config.
This turned out to be the big issue. By having spa_config_update_common
() (in lib/libzpool/spa_config.c) never create a transaction and try
to sync it, I avoided a lot of problems. Then I found that some of my
metaslab_t types (for my vdev's -- in my case vdev_disk's) were
created without initialized locks (for synchronization). This means
that I had metaslab_t's that sometimes would simply not have a pointer
to a valid mutex lock and a call to mutex_enter would seg fault. I'm
inclined to think this is a bug. I simply avoided freeing metaslab's
that were "incomplete". Lastly, I forced the import process to use
spa_import_faulted() (in /lib/libzpool/spa.c). This is different from
spa_import() because it allows faulted pools to be imported. This is
the same call that zdb uses to import. To force the use of
spa_import_faulted(), you need to zpool import -F. The -F switch is an
undocumented feature.

As a side note, I discovered a copy & paste error in vdev_raidz_io_done
() (in the portion of code that tries to reconstruct data from p & q
-- the identification of columns c & c1 are incorrect). Luckily my
raidz array can only reconstruct from 1 column.

At this point, I'm copying off all the data from the pool onto 2 new/
fresh drives (note, only the 4 "good" drives in the raidz array are
working towards this). Given what I've learned of zfs-fuse, I think it
has a lot of potential. But (and I think I've earned the right to say
this), I feel like it's too fragile to be used in raidz configuration.
The addition of more "recovery-style" code (not simply just asserts)
and better tools to handle the guesswork that happens when reading
configs off the vdevs (i.e. off each disk) is IMO absolutely necessary
before it's robust enough to consider for any sort of real use.

I know people will be quick to say, "I've been using it for x years
with no problem." Well, so had I, but apparently you're one bad write
away from loosing access to your entire pool (see links if you think
I'm exaggerating, https://www.opensolaris.org/jive/thread.jspa?threadID=76960&tstart=30
and http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html).

Others may be just a quick to say "You shouldn't have trusted beta
software to store your files..." This is a silly statement. Indeed, I
take responsibility for not having adequate backups. But I'm a grad
student (not flush with $$ or resources) and was looking for a way to
store data with redundancy without breaking the bank. My experience
with zfs on Mac OS X was very favorable. It turns out that zfs-fuse
was not the right choice (it may be after it's "baked" for a few more
years). For those using zfs-fuse, are you expecting to lose all your
data? I'd wager no. Otherwise, why even bother using zfs-fuse, right?
I'd also like to point out that while Google Groups isn't beta per se,
Gmail still is (officially). We've grown to trust that we won't lose
*all* our email with one errant disk write. It's often hard to tell
what level of stability the "beta" moniker implies.

The take-away (for my part), is that filesystems have a higher level
of expectation associated with them. Despite all the great features,
zfs-fuse shouldn't sacrifice recovery (or risk catastrophic loss) for
cool features.

Going forward, I'm sorry to say this experience has soured me on zfs-
fuse. Perhaps when more robust recovery has been seriously addressed,
I'll reconsider. Until then, I'm going to use a more traditional
filesystem. Despite their failings, I'll at least have better options
for recovery in the event of hardware failure (without having to spend
a week learning/tweaking the code).

Thanks,
Matt
Jonathan Schmidt
2008-12-21 19:45:09 UTC
Permalink
> Going forward, I'm sorry to say this experience has soured me on zfs-
> fuse. Perhaps when more robust recovery has been seriously addressed,
> I'll reconsider. Until then, I'm going to use a more traditional
> filesystem. Despite their failings, I'll at least have better options
> for recovery in the event of hardware failure (without having to spend
> a week learning/tweaking the code).

I agree with much of what you said, and essentially it really
re-emphasizes the need to keep a separate backup of your data. The
concepts of redundancy and backups are completely separate, and you need a
bit of both. Wherever your filesystem journey takes you, please keep that
in mind. You are always one software glitch away from losing everything.

It's sad to lose someone who is obviously so versed in the ZFS code base
now. However, if you must go, PLEASE post patches for the bugs you found,
or at least a good enough description of where the problem occurs that
another developer can get in and fix it. This is exactly how this project
will mature -- learning from the errors it has caused.

Thanks for telling the tale of your heroic recovery, and congratulations :)
s***@public.gmane.org
2008-12-21 20:21:43 UTC
Permalink
Matt,

great job. I'm stumped by both the bugs you uncovered and the fact that
you overcame them.

The fact of the matter is, I had already cooled down a bit on zfs-fuse
once I decided it needed a bit of work on the side of performance, and I
don't have time to throw at it.

Now I'm going to assume that the bug in vdev_raidz_io_done also affects
ZFS on Solaris? That adds quite a bit of essential worry.

Regards,
Seth

Matt Jacobsen wrote:
> After spending a week going through the source and trying to figure
> out what the problem is, I've met with success.
>
> While, I still can't answer the questions I posted regarding the
> difference between how zdb imports the pool vs. how zfs-fuse imports
> the pool, I was able to identify the problems with the zfs-fuse code
> that was causing the import & mount to fail.
>
> First, I plugged in all the drives (even the empty one that was to
> replace the faulted one). Then I had to effectively disable any writes
> to the drive, even those that are in the form of updating the config.
> This turned out to be the big issue. By having spa_config_update_common
> () (in lib/libzpool/spa_config.c) never create a transaction and try
> to sync it, I avoided a lot of problems. Then I found that some of my
> metaslab_t types (for my vdev's -- in my case vdev_disk's) were
> created without initialized locks (for synchronization). This means
> that I had metaslab_t's that sometimes would simply not have a pointer
> to a valid mutex lock and a call to mutex_enter would seg fault. I'm
> inclined to think this is a bug. I simply avoided freeing metaslab's
> that were "incomplete". Lastly, I forced the import process to use
> spa_import_faulted() (in /lib/libzpool/spa.c). This is different from
> spa_import() because it allows faulted pools to be imported. This is
> the same call that zdb uses to import. To force the use of
> spa_import_faulted(), you need to zpool import -F. The -F switch is an
> undocumented feature.
>
> As a side note, I discovered a copy & paste error in vdev_raidz_io_done
> () (in the portion of code that tries to reconstruct data from p & q
> -- the identification of columns c & c1 are incorrect). Luckily my
> raidz array can only reconstruct from 1 column.
>
> At this point, I'm copying off all the data from the pool onto 2 new/
> fresh drives (note, only the 4 "good" drives in the raidz array are
> working towards this). Given what I've learned of zfs-fuse, I think it
> has a lot of potential. But (and I think I've earned the right to say
> this), I feel like it's too fragile to be used in raidz configuration.
> The addition of more "recovery-style" code (not simply just asserts)
> and better tools to handle the guesswork that happens when reading
> configs off the vdevs (i.e. off each disk) is IMO absolutely necessary
> before it's robust enough to consider for any sort of real use.
>
> I know people will be quick to say, "I've been using it for x years
> with no problem." Well, so had I, but apparently you're one bad write
> away from loosing access to your entire pool (see links if you think
> I'm exaggerating, https://www.opensolaris.org/jive/thread.jspa?threadID=76960&tstart=30
> and http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html).
>
> Others may be just a quick to say "You shouldn't have trusted beta
> software to store your files..." This is a silly statement. Indeed, I
> take responsibility for not having adequate backups. But I'm a grad
> student (not flush with $$ or resources) and was looking for a way to
> store data with redundancy without breaking the bank. My experience
> with zfs on Mac OS X was very favorable. It turns out that zfs-fuse
> was not the right choice (it may be after it's "baked" for a few more
> years). For those using zfs-fuse, are you expecting to lose all your
> data? I'd wager no. Otherwise, why even bother using zfs-fuse, right?
> I'd also like to point out that while Google Groups isn't beta per se,
> Gmail still is (officially). We've grown to trust that we won't lose
> *all* our email with one errant disk write. It's often hard to tell
> what level of stability the "beta" moniker implies.
>
> The take-away (for my part), is that filesystems have a higher level
> of expectation associated with them. Despite all the great features,
> zfs-fuse shouldn't sacrifice recovery (or risk catastrophic loss) for
> cool features.
>
> Going forward, I'm sorry to say this experience has soured me on zfs-
> fuse. Perhaps when more robust recovery has been seriously addressed,
> I'll reconsider. Until then, I'm going to use a more traditional
> filesystem. Despite their failings, I'll at least have better options
> for recovery in the event of hardware failure (without having to spend
> a week learning/tweaking the code).
>
> Thanks,
> Matt
> >
>
>
>
Jonathan Schmidt
2008-12-21 20:41:39 UTC
Permalink
> Now I'm going to assume that the bug in vdev_raidz_io_done also affects
> ZFS on Solaris? That adds quite a bit of essential worry.

A bug that is found is light years better than a bug that you don't know
exists yet. All software is buggy in some respect -- finding them is the
key. I feel relief when bugs are found, not worry.
s***@public.gmane.org
2008-12-21 23:43:31 UTC
Permalink
In general, that is a simple truth and the simple ones are the most
important.

To me, it fully depends on the nature of the bug found, and the nature
of the find. Judging that is a pretty subjective call...[1]

Particularly when hunting for bug A, if there is a significant
'side-catch' I might be easily worried. It's years of professional
coding experience talking. I don't like the feeling if code just 'tells'
you more bugs on simple reading/review (that is in the category of
CodeSmell). I'd need to read a fair number of lines of code that doesn't
trigger any alarms to balance that, and I think that is quite fair
[again, being subjective of course].

Now to highlight my statement, that you picked to respond upon, would
you agree that the potential impact objectively is greater, when this[2]
happens in the part of the code that is common to all ZFS
ports/flavours, than if it were in zfs-fuse specific parts?

I guess 'essential' was more central to my thought than 'worry' there.

Regards,
Seth

[1] I'm not at all worried at the 'problem' (if it is) that prevented
this particular user from accessing a pool; however I'm slightly worried
at the fact that the fact that there *seem* to be spurious
'copy/paste'-type errors right in the heart of the logic. Now we still
have to review the actual patch... but my character tells me to be
slightly worried till proven otherwise...
[2] i.e. stumbling on bugs

Jonathan Schmidt wrote:
>> Now I'm going to assume that the bug in vdev_raidz_io_done also affects
>> ZFS on Solaris? That adds quite a bit of essential worry.
>>
>
> A bug that is found is light years better than a bug that you don't know
> exists yet. All software is buggy in some respect -- finding them is the
> key. I feel relief when bugs are found, not worry.
>
>
> >
>
>
>
Jonathan Schmidt
2008-12-22 07:33:16 UTC
Permalink
Agreed.

TBH I haven't read much of the ZFS or zfs-fuse code, so I shouldn't
comment on the code quality. I like having this message board though --
makes me feel like problems are getting addressed and that people actually
care.

If I weren't typing this from my mobile I'd probably dig into the Sun ZFS
code and see if the same copy/paste error exists. Oh I guess I'd also need
an approximate location of where the bug exists in zfs-fuse as a starting
point.

I remember reading somewhere that raidz and mirror are better tested than
raidz2, which may partially excuse the problem if it only affects
raidz2...

> In general, that is a simple truth and the simple ones are the most
> important.
>
> To me, it fully depends on the nature of the bug found, and the nature
> of the find. Judging that is a pretty subjective call...[1]
>
> Particularly when hunting for bug A, if there is a significant
> 'side-catch' I might be easily worried. It's years of professional
> coding experience talking. I don't like the feeling if code just 'tells'
> you more bugs on simple reading/review (that is in the category of
> CodeSmell). I'd need to read a fair number of lines of code that doesn't
> trigger any alarms to balance that, and I think that is quite fair
> [again, being subjective of course].
>
> Now to highlight my statement, that you picked to respond upon, would
> you agree that the potential impact objectively is greater, when this[2]
> happens in the part of the code that is common to all ZFS
> ports/flavours, than if it were in zfs-fuse specific parts?
>
> I guess 'essential' was more central to my thought than 'worry' there.
>
> Regards,
> Seth
>
> [1] I'm not at all worried at the 'problem' (if it is) that prevented
> this particular user from accessing a pool; however I'm slightly worried
> at the fact that the fact that there *seem* to be spurious
> 'copy/paste'-type errors right in the heart of the logic. Now we still
> have to review the actual patch... but my character tells me to be
> slightly worried till proven otherwise...
> [2] i.e. stumbling on bugs
>
> Jonathan Schmidt wrote:
>>> Now I'm going to assume that the bug in vdev_raidz_io_done also affects
>>> ZFS on Solaris? That adds quite a bit of essential worry.
>>>
>>
>> A bug that is found is light years better than a bug that you don't know
>> exists yet. All software is buggy in some respect -- finding them is
>> the
>> key. I feel relief when bugs are found, not worry.
s***@public.gmane.org
2008-12-22 08:37:06 UTC
Permalink
>
> I remember reading somewhere that raidz and mirror are better tested than
> raidz2, which may partially excuse the problem if it only affects
> raidz2...
Agreed. It is not all that well-advertised then, is it. It seems it is
true that it it might apply to raidz2 only. The code pointer there is
vdev_raidz_io_done for you. I wouldn't think there is a large number of
functions by that same name.

Cheers,
Seth


Jonathan Schmidt wrote:
> Agreed.
>
> TBH I haven't read much of the ZFS or zfs-fuse code, so I shouldn't
> comment on the code quality. I like having this message board though --
> makes me feel like problems are getting addressed and that people actually
> care.
>
> If I weren't typing this from my mobile I'd probably dig into the Sun ZFS
> code and see if the same copy/paste error exists. Oh I guess I'd also need
> an approximate location of where the bug exists in zfs-fuse as a starting
> point.
>
> I remember reading somewhere that raidz and mirror are better tested than
> raidz2, which may partially excuse the problem if it only affects
> raidz2...
>
>
>> In general, that is a simple truth and the simple ones are the most
>> important.
>>
>> To me, it fully depends on the nature of the bug found, and the nature
>> of the find. Judging that is a pretty subjective call...[1]
>>
>> Particularly when hunting for bug A, if there is a significant
>> 'side-catch' I might be easily worried. It's years of professional
>> coding experience talking. I don't like the feeling if code just 'tells'
>> you more bugs on simple reading/review (that is in the category of
>> CodeSmell). I'd need to read a fair number of lines of code that doesn't
>> trigger any alarms to balance that, and I think that is quite fair
>> [again, being subjective of course].
>>
>> Now to highlight my statement, that you picked to respond upon, would
>> you agree that the potential impact objectively is greater, when this[2]
>> happens in the part of the code that is common to all ZFS
>> ports/flavours, than if it were in zfs-fuse specific parts?
>>
>> I guess 'essential' was more central to my thought than 'worry' there.
>>
>> Regards,
>> Seth
>>
>> [1] I'm not at all worried at the 'problem' (if it is) that prevented
>> this particular user from accessing a pool; however I'm slightly worried
>> at the fact that the fact that there *seem* to be spurious
>> 'copy/paste'-type errors right in the heart of the logic. Now we still
>> have to review the actual patch... but my character tells me to be
>> slightly worried till proven otherwise...
>> [2] i.e. stumbling on bugs
>>
>> Jonathan Schmidt wrote:
>>
>>>> Now I'm going to assume that the bug in vdev_raidz_io_done also affects
>>>> ZFS on Solaris? That adds quite a bit of essential worry.
>>>>
>>>>
>>> A bug that is found is light years better than a bug that you don't know
>>> exists yet. All software is buggy in some respect -- finding them is
>>> the
>>> key. I feel relief when bugs are found, not worry.
>>>
>
>
>
> >
>
>
>
Matt Jacobsen
2008-12-22 16:32:07 UTC
Permalink
The aforementioned copy and paste error is in lib/libzpool/
vdev_raidz.c on line 972 under the comment that reads "Find the two
columns that reported errors." This section contains two for loops to
find columns c and c1 (the two columns that reported errors). The
first loop works fine (as it's a copy from the single parity case
above). The second loop doesn't correctly change over the c variables
to c1. Thus, this would affect only those recovering from double
parity faults.

Thanks,
Matt

On Dec 22, 12:37 am, "sghee...-***@public.gmane.org" <sghee...-***@public.gmane.org>
wrote:
> > I remember reading somewhere that raidz and mirror are better tested than
> > raidz2, which may partially excuse the problem if it only affects
> > raidz2...
>
> Agreed. It is not all that well-advertised then, is it. It seems it is
> true that it it might apply to raidz2 only. The code pointer there is
> vdev_raidz_io_done for you. I wouldn't think there is a large number of
> functions by that same name.
>
> Cheers,
> Seth
>
> Jonathan Schmidt wrote:
> > Agreed.
>
> > TBH I haven't read much of the ZFS or zfs-fuse code, so I shouldn't
> > comment on the code quality. I like having this message board though --
> > makes me feel like problems are getting addressed and that people actually
> > care.
>
> > If I weren't typing this from my mobile I'd probably dig into the Sun ZFS
> > code and see if the same copy/paste error exists. Oh I guess I'd also need
> > an approximate location of where the bug exists in zfs-fuse as a starting
> > point.
>
> > I remember reading somewhere that raidz and mirror are better tested than
> > raidz2, which may partially excuse the problem if it only affects
> > raidz2...
>
> >> In general, that is a simple truth and the simple ones are the most
> >> important.
>
> >> To me, it fully depends on the nature of the bug found, and the nature
> >> of the find. Judging that is a pretty subjective call...[1]
>
> >> Particularly when hunting for bug A, if there is a significant
> >> 'side-catch' I might be easily worried. It's years of professional
> >> coding experience talking. I don't like the feeling if code just 'tells'
> >> you more bugs on simple reading/review (that is in the category of
> >> CodeSmell). I'd need to read a fair number of lines of code that doesn't
> >> trigger any alarms to balance that, and I think that is quite fair
> >> [again, being subjective of course].
>
> >> Now to highlight my statement, that you picked to respond upon, would
> >> you agree that the potential impact objectively is greater, when this[2]
> >> happens in the part of the code that is common to all ZFS
> >> ports/flavours, than if it were in zfs-fuse specific parts?
>
> >> I guess 'essential' was more central to my thought than 'worry' there.
>
> >> Regards,
> >> Seth
>
> >> [1] I'm not at all worried at the 'problem' (if it is) that prevented
> >> this particular user from accessing a pool; however I'm slightly worried
> >> at the fact that the fact that there *seem* to be spurious
> >> 'copy/paste'-type errors right in the heart of the logic. Now we still
> >> have to review the actual patch... but my character tells me to be
> >> slightly worried till proven otherwise...
> >> [2] i.e. stumbling on bugs
>
> >> Jonathan Schmidt wrote:
>
> >>>> Now I'm going to assume that the bug in vdev_raidz_io_done also affects
> >>>> ZFS on Solaris? That adds quite a bit of essential worry.
>
> >>> A bug that is found is light years better than a bug that you don't know
> >>> exists yet.  All software is buggy in some respect -- finding them is
> >>> the
> >>> key.  I feel relief when bugs are found, not worry.
Ricardo M. Correia
2008-12-22 16:44:20 UTC
Permalink
Hi Matt,

On Seg, 2008-12-22 at 08:32 -0800, Matt Jacobsen wrote:
> The aforementioned copy and paste error is in lib/libzpool/
> vdev_raidz.c on line 972 under the comment that reads "Find the two
> columns that reported errors." This section contains two for loops to
> find columns c and c1 (the two columns that reported errors). The
> first loop works fine (as it's a copy from the single parity case
> above). The second loop doesn't correctly change over the c variables
> to c1. Thus, this would affect only those recovering from double
> parity faults.

AFAICS, I don't think this is a copy/paste error.

I agree that this is a bit confusing, but if you look more closely,
you'll see that the c variable is being used to loop over all the
columns, and the c1 variable is used to save the first column that had
errors.

So in the end (after the two "for" loops), c1 will contain the first
column with errors and c will contain the second column.

BTW, this code is part of the upstream ZFS source code:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

- Ricardo
Matt Jacobsen
2008-12-22 18:00:32 UTC
Permalink
I stand corrected. You're quite right. I naturally assumed c was only
the starting column for c1and that the initialization in the for loop
was not serving the purpose of "saving off" the value of c.

My apologies for declaring a bug and causing a stir.

Thanks,
matt

On Dec 22, 8:44 am, "Ricardo M. Correia" <Ricardo.M.Corr...-UdXhSnd/***@public.gmane.org>
wrote:
> Hi Matt,
>
> On Seg, 2008-12-22 at 08:32 -0800, Matt Jacobsen wrote:
>
> > The aforementioned copy and paste error is in lib/libzpool/
> > vdev_raidz.c on line 972 under the comment that reads "Find the two
> > columns that reported errors." This section contains two for loops to
> > find columns c and c1 (the two columns that reported errors). The
> > first loop works fine (as it's a copy from the single parity case
> > above). The second loop doesn't correctly change over the c variables
> > to c1. Thus, this would affect only those recovering from double
> > parity faults.
>
> AFAICS, I don't think this is a copy/paste error.
>
> I agree that this is a bit confusing, but if you look more closely,
> you'll see that the c variable is being used to loop over all the
> columns, and the c1 variable is used to save the first column that had
> errors.
>
> So in the end (after the two "for" loops), c1 will contain the first
> column with errors and c will contain the second column.
>
> BTW, this code is part of the upstream ZFS source code:
>
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/com...
>
> - Ricardo
s***@public.gmane.org
2008-12-22 18:37:19 UTC
Permalink
sgheeren-***@public.gmane.org wrote:
> [...] however I'm slightly worried
> at the fact that the fact that there *seem* to be spurious
> 'copy/paste'-type errors right in the heart of the logic. Now we still
> have to review the actual patch... but my character tells me to be
> slightly worried till proven otherwise...
Ricardo M. Correia wrote:
> AFAICS, I don't think this is a copy/paste error.
>
> [.......] and c will contain the second column.
So I go from 'slightly worried till proven otherwise' to 'neutral'!

Thanks for diving into that question first,

Seth
Ricardo M. Correia
2008-12-16 20:17:56 UTC
Permalink
On Ter, 2008-12-16 at 18:46 +0100, Ricardo M. Correia wrote:
> I think that these write reordering problems can happen not just when
> you reboot your machine, but if you have multiple disks on the same
> pool, then they can also cause problems just by killing zfs-fuse. But of
> course, this can only happen if write cache flushing is not working for
> some reason.

After thinking a bit more about this, I think I need to make a
correction - I believe the statement I made above is not correct.

Killing zfs-fuse should not cause any kind of corruption problems, even
if zfs-fuse can't flush the disks' write caches.

Matt, you also mentioned that you had partitions on your disks.

Just as an FYI, it appears that your pool was not created on the
partitions (sdx1) - it appears that it was created on the raw disks
(sdx).

So I don't think using the OpenSolaris LiveCD would work.

- Ricardo
Steve Freitas
2008-12-17 09:40:47 UTC
Permalink
On Tue, 2008-12-16 at 12:20 +0100, sgheeren-***@public.gmane.org wrote:
> BE VERY CAREFUL: VmWare is known to have issues with ZFS corrupting
> it's metadata (sic!). You *absolutely* need to enable RDM (Raw Device
> Mapping) or you will get in a lot more trouble.

Thanks for the tip, Seth! I'm planning to build my own home fileserver
on Debian with a ZFS raidz setup, but until ZFS-FUSE gets more broad
usage and speed upgrades, I'm intending to use Nexenta in a Xen domU to
run the disks. I'll look around to see if anyone's having a similar
problem.

Steve
Loading...