[zfs-fuse] zfs-fuse: sending ioctl 2285 to a partition

Discussion:

Gordan Bobic

2015-05-28 16:56:41 UTC

I've been looking at the above issue. I have seen references to it having
been reported a long time ago to:

http://zfs-fuse.net/issues/143

which is long gone.

As far as I can tell, the offending code is in function flushSCSIwc in
lib/libzpool/flushwc.c

Please forgive me if this is a stupid question, but is there a reason
why instead of
ioctl(fd, SG_IO, &io_hdr)

it would not be appropriate to use something like

fsync(fd)

Would the latter not work appropriately on a raw block device?

--
--
To post to this group, send email to zfs-***@googlegroups.com
To visit our Web site, click on http://zfs-fuse.net/
---
You received this message because you are subscribed to the Google Groups "zfs-fuse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-fuse+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordan Bobic

2015-05-29 09:18:46 UTC

Permalink

Looking at this flushwc.c a little further, it looks suspiciously like only
SCSI and ATA disk cases are considered. I have it running on an SD card and
it seems to work, but according to the code in that file, if it's not an
IDE or SCSI disk:

default:
//Unknown block device driver. Can't flush the write cache.
return ENOTSUP;

Whatever it does in this case clearly isn't fatal. But looking at how this
works, I think this entire file could be simplified to a single function:
***
int flushwc(vnode_t *vn) {
int major_number;

if(!S_ISBLK(vn->v_stat.st_mode))
// We can only flush the write cache of a block device.
return ENOTSUP;

return ioctl (vn->v_fd, BLKFLSBUF, 0);
***

That should work universally for flushing caches on any block device (it's
a higher level ioctl call than for a raw disk), which means that SCSI and
ATA specific functions could be removed.

Would anyone care to voice an opinion, or at least willingness to test on
some scratch data under various harsh conditions (e.g. pulling power while
under heavy write load) vs. the behavious with the original implementation?

Many thanks.

Gordan

Post by Gordan Bobic
I've been looking at the above issue. I have seen references to it having
http://zfs-fuse.net/issues/143
which is long gone.
As far as I can tell, the offending code is in function flushSCSIwc in
lib/libzpool/flushwc.c
Please forgive me if this is a stupid question, but is there a reason why instead of
ioctl(fd, SG_IO, &io_hdr)
it would not be appropriate to use something like
fsync(fd)
Would the latter not work appropriately on a raw block device?

Gordan Bobic

2015-05-29 16:23:40 UTC

Permalink

So I've been reading up on the background of this issue. The core of the
problem is that the disk flush SCSI or ATA command is being sent through to
a partition. When raw disks are used with zfs-fuse, this doesn't happen
because the command then gets sent to the fd that is the raw block device
rather than a partition.

What is not clear (to me at least) is what happens when the SCSI flush
cache command gets sent to the partition, i.e. whether the kernel issues it
to the disk, or whether it is discarded with a warning. If the latter is
what happens, then the current implementation has broken flushing anyway
when partitions are used. It is worth pointing out at this stage that at
least Solaris and ZoL default to setting up partitions and using those,
even when the raw disk device nodes are specified at pool creation time.

So, options:

1) If disk flush command is ignored, then something like the attached patch
would make the situation a little less bad.

What it does is check whether the minor device node number implies the
device is a partition, and if it is a partition it does:
fsync(fd);
ioctl(fd, BLKFLSBUF, 0);

It also does this if the block device is neither SCSI nor ATA, since those
calls should be generically supported on all block devices.

Neither of the above calls issue a flush of hardware disk caches, but it
should be better than nothing, and it ought to do away with the warning
resulting of sending a SCSI command to a partition.

2) Infer the disk node itself, preferably at pool import time, and store it
somehow/somewhere (maybe in the vnode in memory, assuming that doesn't
upset reflections between in-memory and on-disk data formats), and invoke
the cache write-out ioctl on that. Workable but a little dirty and
diverging further from other implementations.

3) Try to mimic more closely what other implementations do. Unfortunately,
different forks seems to disagree on what to do.

FreeBSD doesn't appear to issue a cache flush here at all:
https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c

ZoL does, but it does so via some SPL (Solaris Porting Layer) magic by
dispatching a vdev_file_io_fsync task to the vdev_file_taskq:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_file.c

Addopting the required SPL functionality into zfs-fuse may be an option (I
have to track through the code to see just how much i would need to pull
out for all the dependencies of this path to work, and more importantly, to
figure out whether there could be side effects of that code running from
userspace rather than from within the kernel. In short - complicated.
Perhaps too complicated for a somebody who has only just started looking at
ZFS code for the first time.

Any thoughts on this? Especially the part above regarding whether my patch
to handle partitions different (by not issuing a SCSI/ATA command to a
partition) will make the cache flushing situation worse instead of better.

Gordan

Post by Gordan Bobic
Looking at this flushwc.c a little further, it looks suspiciously like
only SCSI and ATA disk cases are considered. I have it running on an SD
card and it seems to work, but according to the code in that file, if it's
//Unknown block device driver. Can't flush the write cache.
return ENOTSUP;
Whatever it does in this case clearly isn't fatal. But looking at how this
***
int flushwc(vnode_t *vn) {
int major_number;
if(!S_ISBLK(vn->v_stat.st_mode))
// We can only flush the write cache of a block device.
return ENOTSUP;
return ioctl (vn->v_fd, BLKFLSBUF, 0);
***
That should work universally for flushing caches on any block device (it's
a higher level ioctl call than for a raw disk), which means that SCSI and
ATA specific functions could be removed.
Would anyone care to voice an opinion, or at least willingness to test on
some scratch data under various harsh conditions (e.g. pulling power while
under heavy write load) vs. the behavious with the original implementation?
Many thanks.
Gordan