Discussion:
ashift and 4k disk alignment... Is ZFS cancelling out our partitioning efforts?
(too old to reply)
Daniel Smedegaard Buus
2011-07-20 12:30:40 UTC
Permalink
Hello :)

Stumbled upon this by accident while reading up on stuff regarding ZFS
on Linux (on a side note, progress seems to be massive over there):
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

What exactly does this mean to us owners of 4k drives? Does this mean
that performance with ZFS on these drives is the same (as in sub-
optimal) regardless of whether our partitions are properly or
improperly aligned? Or does it mean that we (owners of 4k drives)
should actually do both - that is, properly align our partitions AND
modify ZFS to use this similar 4k alignment internally? Or is it all
just FUD?

Please, some thoughts from those of you more informed than me :)

Cheers,
Daniel :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2011-07-20 12:50:20 UTC
Permalink
Post by Daniel Smedegaard Buus
Hello :)
Stumbled upon this by accident while reading up on stuff regarding ZFS
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html
Also, see this
https://github.com/zfsonlinux/zfs/issues/289#issuecomment-1495456
in relation to SSDs and erase blocks. It is not over yet, ashift=12
might not be enough for you.
Post by Daniel Smedegaard Buus
What exactly does this mean to us owners of 4k drives? Does this mean
that performance with ZFS on these drives is the same (as in sub-
optimal) regardless of whether our partitions are properly or
improperly aligned? Or does it mean that we (owners of 4k drives)
should actually do both - that is, properly align our partitions AND
modify ZFS to use this similar 4k alignment internally? Or is it all
just FUD?
It's not FUD, however I have no data on it, other than that my Solaris
nv147 pool runs nicely on whole disk with EADS as well as EARS disks (I
purposely mixed mirrored pools so as to avoid wearing both disks out at
around the same time). zdb -C tells m all three pools are ashift=9...
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Manuel Amador (Rudd-O)
2011-07-20 19:54:28 UTC
Permalink
With SSDs, it's all over the place. Using ashift=12 helps a bit, but
what helps MUCH MORE is starting sector alignment (aligned to erase
block size). We're talking about 25 to 90 MB/s write performance, and a
similar boost in read performance.

I know this because I tested this. I posted a data.ods file on
zfsonlinux earlier when I was redoing my laptop SSD and I measured it.
Post by sgheeren
Post by Daniel Smedegaard Buus
Hello :)
Stumbled upon this by accident while reading up on stuff regarding ZFS
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html
Also, see this
https://github.com/zfsonlinux/zfs/issues/289#issuecomment-1495456
in relation to SSDs and erase blocks. It is not over yet, ashift=12
might not be enough for you.
Post by Daniel Smedegaard Buus
What exactly does this mean to us owners of 4k drives? Does this mean
that performance with ZFS on these drives is the same (as in sub-
optimal) regardless of whether our partitions are properly or
improperly aligned? Or does it mean that we (owners of 4k drives)
should actually do both - that is, properly align our partitions AND
modify ZFS to use this similar 4k alignment internally? Or is it all
just FUD?
It's not FUD, however I have no data on it, other than that my Solaris
nv147 pool runs nicely on whole disk with EADS as well as EARS disks (I
purposely mixed mirrored pools so as to avoid wearing both disks out at
around the same time). zdb -C tells m all three pools are ashift=9...
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2011-07-21 11:42:17 UTC
Permalink
One more thing: If I were to use a "hacked" zfs build with ashift=12
hardcoded into it, could I just continue using the pool with "normal"
builds such as the official zfs-fuse build and native FreeBSD version?
Or is that "undetermined"?
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Manuel Amador (Rudd-O)
2011-07-21 23:06:47 UTC
Permalink
It won't work. You'll likely corrupt the data if you force the ashift
12 in the internal data structures.
Post by Daniel Smedegaard Buus
One more thing: If I were to use a "hacked" zfs build with ashift=12
hardcoded into it, could I just continue using the pool with "normal"
builds such as the official zfs-fuse build and native FreeBSD version?
Or is that "undetermined"?
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2011-07-22 08:46:47 UTC
Permalink
Post by Manuel Amador (Rudd-O)
It won't work. You'll likely corrupt the data if you force the ashift
12 in the internal data structures.
What other change is necessary? Where does that change come from
(upstream onnv-gate latest version or LLNL?)

In the case of upstream, it should already have been merged into
unstable and perhaps testing. I'd like to check, though, so any
specifics would be welcome
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2011-07-22 08:53:50 UTC
Permalink
Post by sgheeren
Post by Manuel Amador (Rudd-O)
It won't work. You'll likely corrupt the data if you force the ashift
12 in the internal data structures.
What other change is necessary? Where does that change come from
(upstream onnv-gate latest version or LLNL?)
In the case of upstream, it should already have been merged into
unstable and perhaps testing.
For example, this is what we have now
<http://gitweb.zfs-fuse.net/?p=official&a=search&h=refs%2Fheads%2Funstable&st=pickaxe&s=ashift>:

*git log -Sashift unstable*
<http://gitweb.zfs-fuse.net/?p=official&a=search&h=refs%2Fheads%2Funstable&st=pickaxe&s=ashift>


commit af19acde5f7cd5791d158012bcef1f4aace4ef73
Author: Victor Latushkin <Victor.Latushkin-UdXhSnd/***@public.gmane.org>
Date: Sun Feb 21 22:58:08 2010 +0100

hg commit 11725:6720637 want zdb -l option to dump uberblock
arrays as well

commit db2f633064b5b229ddc26b1003dadff3dbfcab85
Author: Mark J Musante <Mark.Musante-UdXhSnd/***@public.gmane.org>
Date: Wed Feb 17 15:19:58 2010 +0100

hg commit 11422:PSARC/2009/511 zpool split
5097228 provide 'zpool split' to create new pool by breaking all
mirrors
6880831 memory leak in zpool add
6891438 zfs_ioc_userspace_upgrade could reference uninitialised
error variable
6891441 zvol_create_minor sets local variable zv but never
references it
6891442 spa_import() sets local variable spa but never references it
6895446 vdevs left open after removing slogs or offlining
device/file

commit 5cdd8cf8067a48b121e39a6a1766238bfa8b98b2
Author: Jeff Bonwick <Jeff.Bonwick-UdXhSnd/***@public.gmane.org>
Date: Tue Nov 10 15:02:11 2009 +0100

hg commit 10922:PSARC 2009/571 ZFS Deduplication Properties
6677093 zfs should have dedup capability

commit c8e9062d8679f9a30fbdb826ac7d9f8857f35e06
Author: Adam Leventhal <adam.leventhal-xsfywfwIY+***@public.gmane.org>
Date: Wed Nov 4 13:55:58 2009 +0100

hg commit 10105:6854612 triple-parity RAID-Z
6854621 RAID-Z should mind the gap on writes
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Fajar A. Nugraha
2011-07-22 09:03:14 UTC
Permalink
Post by sgheeren
It won't work.  You'll likely corrupt the data if you force the ashift
12 in the internal data structures.
What other change is necessary? Where does that change come from
(upstream onnv-gate latest version or LLNL?)
In the case of upstream, it should already have been merged into
unstable and perhaps testing. I'd like to check, though, so any
specifics would be welcome
It depends on how you define '"hacked" zfs build with ashift=12
hardcoded into it'.

If, like Rudd-O implies, you start with an existing pool and edit the
on disk data to somehow force ashift=12, then most likely it would
lead to corruption.

However, if you mean "some implementation of zfs/zpool that can force
the use of ashift=12 at pool creation time", then the resulting pool
should be accessible in other implementations.

For example, zfsonlinux has ashift as zpool create option which you
could easily use to create pool with ashift=12. In (open)solaris you'd
need to use a workaround using a device which has 4k sector size as
top level vdev (iscsi is easiest), see
http://www.mail-archive.com/zfs-discuss-***@public.gmane.org/msg46498.html
or http://opensolaris.org/jive/thread.jspa?threadID=139316 for
example. The resulting pool (whichever implementation/method used to
create it) should be readable on other implementation (as long as it's
capable of reading the pool version).
--
Fajar
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2011-07-22 10:10:14 UTC
Permalink
Post by Fajar A. Nugraha
If, like Rudd-O implies, you start with an existing pool and edit the
on disk data to somehow force ashift=12, then most likely it would
lead to corruption.
Ok, I get it. I'm not sure that he was implying such shotgun surgery but
good points on the consequences of course!
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2011-07-23 08:17:07 UTC
Permalink
On 07/22/2011 11:03 AM, Fajar A. Nugraha wrote:> If, like Rudd-O implies, you start with an existing pool and edit the
Post by Fajar A. Nugraha
on disk data to somehow force ashift=12, then most likely it would
lead to corruption.
Ok, I get it. I'm not sure that he was implying such shotgun surgery but
good points on the consequences of course!
Sorry for the confusion guys, what I meant was creating a fresh pool
with ashift=12 using a hacked binary, then throwing away the binary
and installing more generic ones, and whether or not that would be
possible. Seems it would, thanks :)

I think the confusion stems from adding my other question - whether in
time it's plausible or even possible that a future set of utils would
be able to "upgrade" a sub-optimal pool configuration of ashift=9 to
ashift=12. This is the question that haunts me the most, because the
answer might mean I'll have to re-create my newly created pool once
more and that's a pretty darn time-consuming process :)

Thanks for all your help!
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Christ Schlacta
2011-07-23 09:50:01 UTC
Permalink
Post by Daniel Smedegaard Buus
On 07/22/2011 11:03 AM, Fajar A. Nugraha wrote:> If, like Rudd-O implies, you start with an existing pool and edit the
Post by Fajar A. Nugraha
on disk data to somehow force ashift=12, then most likely it would
lead to corruption.
Ok, I get it. I'm not sure that he was implying such shotgun surgery but
good points on the consequences of course!
Sorry for the confusion guys, what I meant was creating a fresh pool
with ashift=12 using a hacked binary, then throwing away the binary
and installing more generic ones, and whether or not that would be
possible. Seems it would, thanks :)
I think the confusion stems from adding my other question - whether in
time it's plausible or even possible that a future set of utils would
be able to "upgrade" a sub-optimal pool configuration of ashift=9 to
ashift=12. This is the question that haunts me the most, because the
answer might mean I'll have to re-create my newly created pool once
more and that's a pretty darn time-consuming process :)
Thanks for all your help!
zfs send | zfs recv isn't so bad, you can even use -R if you have enough
drive space at your disposal to simply do it, then excellent, and if
not, you need to find a way to store it while you do the replace...
however, it is indeed quite simple, and a "set it and forget it" process.
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2011-07-21 11:40:16 UTC
Permalink
Thanks, both of you. Feels terrible having just created a new 19-drive
38 TB RAIDZ-3 pool and mirrored back 10 TB of data, discovering this :
( I was very careful in creating GPTs with proper 4k alignments (all
drives are 4k, most of them are EADS/EARS), and now this... At least
it seems the performance penalty with ashift=9 isn't necessarily that
massive:

http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/

Would be nice to see some more numbers, though.

What do you think about the possibility of a future zfs revision
offering an "ashift upgrade" option? I'm guessing slim to no chance in
hell :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Manuel Amador (Rudd-O)
2011-07-20 19:52:39 UTC
Permalink
Do BOTH. I have those disks, and I get a halving in write performance
with partitions misaligned, even with ashift=12. The good news is, with
today's fdisk, even if you have bad sector size reporting from the disk,
you have to PURPOSEFULLY misalign your partitions to a non-multiple of
4K if you want to get it wrong. In the olden days the first sector
would be 63, now it's 2048.
Post by Daniel Smedegaard Buus
Hello :)
Stumbled upon this by accident while reading up on stuff regarding ZFS
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html
What exactly does this mean to us owners of 4k drives? Does this mean
that performance with ZFS on these drives is the same (as in sub-
optimal) regardless of whether our partitions are properly or
improperly aligned? Or does it mean that we (owners of 4k drives)
should actually do both - that is, properly align our partitions AND
modify ZFS to use this similar 4k alignment internally? Or is it all
just FUD?
Please, some thoughts from those of you more informed than me :)
Cheers,
Daniel :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Loading...