Discussion:
Something has killed zfs-fuse performance
(too old to reply)
DC
2010-11-29 22:07:26 UTC
Permalink
OK, something odd is going one - performance on zfs-fuse seems to have
just died really badly. There have been a couple of infrastructure
changes to the box, but I can't see how they shold affect things this
much.

Original system config :

Debian Lenny/stable
ii fuse-utils 2.8.1-1.2 Filesystem in
USErspace (utilities)
ii libfuse-dev 2.8.1-1.2 Filesystem in
USErspace (development files)
ii libfuse2 2.8.1-1.2 Filesystem in
USErspace library
ii zfs-fuse 0.6.0+critical20100301- ZFS on FUSE

Athlon II X2 240 (dual-core, 2.8GHz)
4G ram
3x 2TB drives set up as 2 1TB partitions
4x 1TB drives set up as 1 1TB partition
One pool containing 2x raidz1 vdevs, each using one partition from
each 2TB
drive, and 2x 1TB drives
2T 2T 2T 1T 1T 1T 1T
raidz1 sda1 sdb1 sdc1 sdd1 sde1
raidz1 sda2 sdb2 sdc2 sdf1 sdg1

Performance was OK. Saw about 50MB/s for rsync transfers to/from the
zfs
box, coming from another Debian Lenny box using software raid. Local
cp/mv
within a dataset or between datasets was better, about 45MB/s.

Just recently upgraded the system, replacing cpu and the 1TB drives.

Athlon II X4 955 (quad-core, 3.2GHz)

Replaced 4x 1TB drives with 2x 2TB drives, using a 1TB partition from
each
2TB to replace each 1TB.
2T 2T 2T 2T 2T
raidz1 sda1 sdb1 sdc1 sdd1 sde1
raidz1 sda2 sdb2 sdc2 sdd2 sde2

The resilvering process took forever, about 30 hours if I recall. I
did the
four zfs replace commands one right after another, so all four were
running at
once. Each one seemed to restart the resilvering process(es) that
were already
running BTW.

Reading seems to be OK. Running an rsync FROM the ZFS box to the
other Lenny
box saw the same 50MB/s. Writing is poor though, rsync TO the ZFS box
results
in only about 15MB/s. Ugh. If there's other activity going on (other
copies etc)
then that drops down even further.

But worst is local cp/mv/rsync in ZFS. 8.9MB/s. That's within a
dataset or
between datasets, and even if it's the only IO going on right then.

What happened ? Where can I look for the bottleneck ? I know
collapsing the 4
1TB spindles down to 2 2TB ones will have *some* impact, but THAT
MUCH ? That's
just going from 7 spindles to 5, should not have been much of a
difference. There
must be something else going on somewhere.
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-11-29 22:24:56 UTC
Permalink
Woah, long story. Hard to tell if there is a question.
So many details, that it raises a zillion questions. At the moment, I
can't even tell whether you have just recently added 4 disks, or removed 2?!

So, to cut a long story short:
(a) when was performance ok/for how long did that continue to be ok
(b) when did it change/when did you first notice
Post by DC
Each one seemed to restart the resilvering process(es) that
were already
running BTW
Seemed is the right word. Take a look at zpool history -i to find the
exact parameters with which the scrubs (resilvers) are being run. The
percentage starts at 0% again, but the remaining job is smaller.
Post by DC
2T 2T 2T 1T 1T 1T 1T
raidz1 sda1 sdb1 sdc1 sdd1 sde1
raidz1 sda2 sdb2 sdc2 sdf1 sdg1
This is not clear. Are you really running separate legs of a ZFS stripe
(or mirror, that part is unclear) on the same hard disks?
If the first and second line are in one pool you will be striping across
[sdb1, sdb2], [sda1, sda2] and [sdc1, sdc2]. This is a sure way to
maximize seek times?

Hint: consider providing zpool commands with output to clearly document
what you have. E.g. zpool list and zpool status -v would be helpful in
addition to the prose version.
Post by DC
[...] Writing is poor though, rsync TO the ZFS box
results
in only about 15MB/s.
That's not low. Writing is much slower than reading on zfs-fuse. My best
guess is that the addition of the drives killed write performance. I
know, it's not intuitive, but I've seen behaviour in the past. Have you
got a way to retest?
Post by DC
But worst is local cp/mv/rsync in ZFS. 8.9MB/s. That's within a
dataset or
between datasets, and even if it's the only IO going on right then.
Without any other clues, I'm going to guess you have enabled dedup (zfs
get -r dedup -slocal,received poolname) and the DDT has grown to exceed
available resident memory. See the "Evil Tuning Guide" and "Dedup FAQ"
on Solaris.

Another contributing factor could be that rsync is simply getting slower
because it is not the first time (a first time run will effectively copy
as if --inplace and --whole-file are active, a second run will _read_
the target data by default. Use --whole-file to mitigate this on a local
system, are anywhere where network bandwidth is copious).
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
DC
2010-11-30 00:50:54 UTC
Permalink
Heh, sorry about the novel. OK, some more detail ...

I originally had 3x 2TB and 4x 1TB drives, 7 total. Most efficient
way to utilize the space
was to do 2x raidz1 vdevs, each using 5x 1TB partitions, 3 from the
2TB and 2 full 1TB drives.
This is with the Athlon II X2 240 dual-core 2.8GHz.

Performance here was pretty reasonable. From my notes, created a pool
called ZFStest,
and two datasets within it, dir1 and dir2. Target files did not
exist, so was new full copies.

Copying a file within a dataset
-rwxrw-rw- 1 21 root 1743311693 2009-12-06 04:10 File1.mkv
***@Lenny:/mnt/ZFStest# time cp -av dir1/File1.mkv dir1/
File1test.mkv
`File1.mkv' -> `File1test.mkv'
real 1m7.546s
1743311693/67 = 26,019,577/s
1743311693/1024/67 = 25MB/s

Copying a file between datasets
-rwxrw-rw- 1 21 root 1743311693 2009-12-06 04:10 File1.mkv
***@Lenny:/mnt/ZFStest# time mv dir1/File1.mkv dir2
real 0m37.528s
1743311693/37.52 = 46,463,531/s or 46MB/s
1743311693/1024/37.52 = 45MB/s

Next, upgraded :) Replaced cpu with Athlon II X4 955 quad-core
3.2GHz. Replaced the
4x 1TB drives with 2x 2TB drives. I know it's not an ideal disk
layout, but the only way to
use the 2TB disks as single full devices was a complete rebuild of the
pool. I will if I have
to, but ick.

Actual layout looks like this :

# zpool status
pool: stuff
state: ONLINE
scrub: none requested
config:

NAME
STATE READ WRITE CKSUM
stuff
ONLINE 0 0 0
raidz1
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9W9AV-part1
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9PLTV-part1
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0101673-part1
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289676-part1
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289805-part1
ONLINE 0 0 0
raidz1
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9W9AV-part2
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9PLTV-part2
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0101673-part2
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289676-part2
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289805-part2
ONLINE 0 0 0

errors: No known data errors

# zpool upgrade
This system is currently running ZFS pool version 16.

All pools are formatted using this version.

So, no dedup on this version, right ?

Immediately I noticed "poorer" performance on the box. Hrm, 4 cores
now running faster than the original
2 cores - that's not right. Shuffling files around I noticed that
things took quite a bit longer, and started
digging around. And that led me here ... :)

There is a later version of zfs-fuse available ...

# apt-cache policy zfs-fuse
zfs-fuse:
Installed: 0.6.0+critical20100301-5
Candidate: 0.6.9-1
Version table:
0.6.9-1 0
400 http://http.us.debian.org testing/main Packages
300 http://http.us.debian.org unstable/main Packages
*** 0.6.0+critical20100301-5 0
100 /var/lib/dpkg/status

Next step will be to upgrade, but not before I have a verified
backup. Waiting on an
additional sata controller for the backup box, which is where the 4x
1TB went to live.

BTW, thanks for the amazingly rapid reply.

D.
Post by sgheeren
Woah, long story. Hard to tell if there is a question.
So many details, that it raises a zillion questions. At the moment, I
can't even tell whether you have just recently added 4 disks, or removed 2?!
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-11-30 01:57:01 UTC
Permalink
Post by DC
Heh, sorry about the novel. OK, some more detail ...
"Sorry about the novel, I meant it to be an epic novel" ? :)
Well, I asked for it :)

[...]
Post by DC
use the 2TB disks as single full devices was a complete rebuild of the
pool. I will if I have
to, but ick.
Well I see the point in not wanting to rebuild, but you could always
have created a separate pool on the other halfs of the large disks. As
it stands, you are actively telling ZFS to stripe across physical
partitions on the same disk. This is really going to harm performance,
even with proper kernel disk write ordering (e.q. cfq or elevator
scheduling).

The ideal would be to use whole disks and disable scheduling by the
kernel (noop, although in my experience cfq is ok as well).

The good news is, all slices being equal sized (I hope!) you can
probably 'zpool replace' your way out of this if you accept running a
degraded pool during the transition. If you have a spare disk (or
multiple) you could either speed it up or avoid running in degraded mode
(by temporarily attaching mirrors)
Post by DC
# zpool status
pool: stuff
state: ONLINE
scrub: none requested
NAME
STATE READ WRITE CKSUM
stuff
ONLINE 0 0 0
raidz1
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9W9AV-part1
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9PLTV-part1
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0101673-part1
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289676-part1
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289805-part1
ONLINE 0 0 0
raidz1
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9W9AV-part2
ONLINE 0 0 0
disk/by-id/ata-Hitachi_HDS722020ALA330_JK1131YAG9PLTV-part2
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0101673-part2
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289676-part2
ONLINE 0 0 0
disk/by-id/ata-WDC_WD20EARS-00J2GB0_WD-WCAYY0289805-part2
ONLINE 0 0 0
errors: No known data errors
# zpool upgrade
This system is currently running ZFS pool version 16.
All pools are formatted using this version.
So, no dedup on this version, right ?
Clear. No dedup. The version may well be the culprit, btw. Somehow you
got an old version of zfs-fuse. Could it be that
(a) the old installation held a newer version
(b) you forgot to copy e.g. /etc/zfs/zfsrc over?

That could explain a loss of performance over your old setup.
Post by DC
Next step will be to upgrade, but not before I have a verified
backup.
+1 good idea
Post by DC
Waiting on an
additional sata controller for the backup box, which is where the 4x
1TB went to live.
If you do, why not rebuild the pool (zfs send to a backup device,
recreate pool with other layout, zfs receive it back)
Post by DC
BTW, thanks for the amazingly rapid reply.
D.
HTH,
Seth
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
DC
2010-11-30 02:43:50 UTC
Permalink
On 11/30/2010 01:50 AM, DC wrote:> Heh, sorry about the novel.  OK, some more detail ...
[...]> use the 2TB disks as single full devices was a complete rebuild of the
pool.  I will if I have to, but ick.
Each 2TB disk is split into two 1TB partitions. All the partition1's
make up one vdev,
all the partition2's make up the 2nd vdev. The two vdevs make up the
pool.
Well I see the point in not wanting to rebuild, but you could always
have created a separate pool on the other halfs of the large disks. As
it stands, you are actively telling ZFS to stripe across physical
partitions on the same disk. This is really going to harm performance,
even with proper kernel disk write ordering (e.q. cfq or elevator
scheduling).
You're saying it would have been better to have two pools, one using
the partition1's
and the other using the partition2's ? If you're copying a large file
from one to the
other, how is that different from the existing setup ? Not being
pissy here, I feel
like I'm missing something fundamental.
The ideal would be to use whole disks and disable scheduling by the
kernel (noop, although in my experience cfq is ok as well).
Is that something I should do now, until (and after) I can rebuild the
pool ?
The good news is, all slices being equal sized (I hope!) you can
probably 'zpool replace' your way out of this if you accept running a
degraded pool during the transition. If you have a spare disk (or
multiple) you could either speed it up or avoid running in degraded mode
(by temporarily attaching mirrors)
Yup - all partitions 1TB, see above. I thought you could only replace
a raidz1 member
with the same size or larger, and you can't remove members. Wouldn't
this require
another 5x 2TB drives, winding up with (drool) 20TB ?
# zpool upgrade
This system is currently running ZFS pool version 16.
All pools are formatted using this version.
So, no dedup on this version, right ?
Clear. No dedup. The version may well be the culprit, btw. Somehow you
got an old version of zfs-fuse. Could it be that
(a) the old installation held a newer version
(b) you forgot to copy e.g. /etc/zfs/zfsrc over?
Umm, never had a /etc/zfs/zfsrc ... Ain't got one now.

System was built as a Lenny/stable box back in March I think.
0.6.0+critical20100301-5
was the version of zfs-fuse available then in the testing and unstable
releases. I'm using
pinning to be able to install packages from testing and unstable, so
they don't upgrade
automagically. I'm assuming that the release managers upgraded from
0.6.0 to 0.6.9 at
one point in the last few months ...
That could explain a loss of performance over your old setup.
Next step will be to upgrade, but not before I have a verified backup.  
+1 good idea
Waiting on an
additional sata controller for the backup box, which is where the 4x
1TB went to live.
If you do, why not rebuild the pool (zfs send to a backup device,
recreate pool with other layout, zfs receive it back)
Yup, I think that's the right path forward. Time to get a couple of
extra gigE nics and
enable LACP. Hrm, would the disk subsystem be able to keep up ? What
sort of
throughput should be reasonably expected for a setup like this ?

Athlon II X4 955 quad-core 3.2GHz
4G ddr2-2000 ram
5x 2TB sata300 disks

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
stuff 4.29T 2.83T 59.1K /stuff

Might not seem like much, but even over a single gigE link, that takes
a LONG
time to transfer.

Many thanks for the time -

D.
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-11-30 08:46:19 UTC
Permalink
Post by DC
If you're copying a large file
from one to the
other, how is that different from the existing setup ? Not being
pissy here, I feel
like I'm missing something fundamental.
*If* that is the case, *then* there is no difference. The difference is:
you can be smart and avoid that.
Stuffing it all in the one pool makes that impossible, and you'll *by
guarantee* end up with high seek activity on write. It doesn't matter to
which dataset you are writing, because it is all in one pool.

My point is: with two pools you can have
(a) the same total capacity
(b) no seek penalty when running operations on the individual pools

$0.02
Post by DC
Is that something I should do now, until (and after) I can rebuild the
pool ?
Of course using whole disks cannot be done unless you rebuild the pool.
Also note that for maximum portability
(a) create an MSDOS disk label (standard linux fdisk style) with a
single partition
(b) consider using a specific (not latest) pool version e.g. for use
with FreeBSD/Solaris-clones.

You can manage the elevator algorithm on the fly (/etc/rc.local or
sysctl.d are common spots)

cat /sys/block/sd[abcde]/queue/scheduler


or

echo cfq > /sys/block/sda/queue/scheduler


(echo cfq | sudo tee -a sys/block/sda/queue/scheduler if you are a sudo
fan).
Post by DC
Yup - all partitions 1TB, see above. I thought you could only replace
a raidz1 member
with the same size or larger, and you can't remove members. Wouldn't
this require
another 5x 2TB drives, winding up with (drool) 20TB ?
Important point is that you cannot replace dev A by dev B if A is say
2kilobytes larger than B.
No you wouldn't actually need all that extra capacity if you are willing
to temporarily degrade your pool (have backups).
With a raidz1, you have some redundancy. You should be able to
transparently offline/remove 1 physdev from both raidz1 nodes. This
might give you just enough headroom to shuffle things around.

Of course having spare disks is
(a) easier
(b) more efficient
(c) safer

(and you get to keep the backup copy if you so choose)

$0.02
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-11-30 08:52:01 UTC
Permalink
Post by DC
Yup, I think that's the right path forward. Time to get a couple of
extra gigE nics and
enable LACP. Hrm, would the disk subsystem be able to keep up ? What
sort of
throughput should be reasonably expected for a setup like this ?
There is no need to speculate. You have been measuring 50Mb/s read
speeds. So that is the reasonably expected rate for a setup like this. I
adivse you to have a simple/linear pool setup on the 'temporary' end for
max efficiency.
You can, alternatively, store it to a flat file (use no compression or
you will be CPU bound)

zfs send -Rv ***@rebuilding | ssh ***@host 'dd bs=1M
of=zfs_rebuilding.stream'

That way you eliminate the overhead, potential instability under load,
and even the need to have a ZFS-capable system on the receiving end. In
my mind 1GbE (dedicated) is appropriate for read speeds under 100Mbit [1].
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-11-30 19:02:26 UTC
Permalink
Post by sgheeren
of=zfs_rebuilding.stream'
Oh, I don't know why I forgot to mention, if you realyl want to tweak
throughput, mix in a little mbuffer on the receiving end and substitute
netcat for ssh. (Of course assuming secure network.)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Aneurin Price
2010-11-30 15:27:39 UTC
Permalink
Post by sgheeren
The ideal would be to use whole disks and disable scheduling by the
kernel (noop, although in my experience cfq is ok as well).
Wow. Maybe this should be in big letters on the main web page - I just
tried setting all my ZFS disks to noop (from cfq) and got a huge boost
in speed - in the case of files with a good spread across disks my
read speed went from ~25MB/s to ~50MB/s

Nye
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-11-30 19:00:57 UTC
Permalink
Post by Aneurin Price
Post by sgheeren
The ideal would be to use whole disks and disable scheduling by the
kernel (noop, although in my experience cfq is ok as well).
Wow. Maybe this should be in big letters on the main web page - I just
tried setting all my ZFS disks to noop (from cfq) and got a huge boost
in speed - in the case of files with a good spread across disks my
read speed went from ~25MB/s to ~50MB/s
Nye
Hmmm... interesting. I'm pretty sure that cfq should be no worse than
noop in practice (though I forgot the science/details).

On what distribution was this?
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Aneurin Price
2010-11-30 20:19:31 UTC
Permalink
Post by sgheeren
Post by Aneurin Price
Post by sgheeren
The ideal would be to use whole disks and disable scheduling by the
kernel (noop, although in my experience cfq is ok as well).
Wow. Maybe this should be in big letters on the main web page - I just
tried setting all my ZFS disks to noop (from cfq) and got a huge boost
in speed - in the case of files with a good spread across disks my
read speed went from ~25MB/s to ~50MB/s
Nye
Hmmm... interesting. I'm pretty sure that cfq should be no worse than
noop in practice (though I forgot the science/details).
On what distribution was this?
This is Debian stable, though I'm running a 2.6.36 kernel from
experimental. It's a pool consisting of 3 mirrored pairs of disks,
where one is SATA and 5 are USB, which might be a factor.
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2010-12-12 20:19:35 UTC
Permalink
Hmmm, I realize this is a seriously aging thread, but joining the
party late, I feel that I could've given the OP some good advice. So
here goes anyway, if someone wants it ;)

First of all, to clarify what sghereen is pointing out about the
problem with the dual-vdev on different partitions on the same
physical drives: The issue is that when you're assembling to vdevs
into the same pool, ZFS is going to see those vdevs as being
completely unrelated, and it is assuming they are comprised of
different hardware components. Therefore, the optimal thing to do when
reading and writing to these vdevs is to distribute every piece of
data between these two pools. Compare it to RAID-0. If ZFS thinks it
has two different hardware receptors in its pool, it'll assume it can
double the throughput by distributing data between them.

It does not know, that in fact these two vdevs are partly on the same
hard drives. Partly, because if you had 1TB disks all the way, still
with two vdevs, you *would* get a massive throughput increase. But, as
the three 2TB drives are members of both vdevs, you'll see the exact
opposite: a dramatic performance penalty as the heads on the 2TB disks
will be seeking about *half the platter* for every sector they're
writing; ZFS will try to write to both vdevs simultaneously, i.e. both
partitions simultaneously, i.e. head thrashing galore. This is not
only hard on performance, it's also wearing out your drives!

The main reason I'm thinking I should answer this is because I've
tested this quite a bit, due to the fact that I'm doing basically the
same as you (DC). That is, assembling a large pool from different
sized drives. I raised this question a while back on a Solaris forum
("Question about multiple RAIDZ vdevs using slices on the same disk",
http://web.archiveorange.com/archive/v/LmwutMbI7CdWikmGbohM). This was
my first pool, and the idea was to create it like yours using 4 vdevs
distributed over a total of 16 drives: 2x2TB with 4 slices and vdev
memberships each, 6x1TB with 2 memberships, and 8x½TB with just 1
membership each.

I ended up just testing it, and I can tell you that the performance
hit was massive. Head thrashing from here to eternity. I opted for a
different approach altogether: creating underlying mdadm linear (JBOD)
devices by assembling the smaller disks to make up 2TB vdev members
for ZFS. This works amazingly well and seems to create negligible
extra resource consumption. Keep in mind that md is a kernel module,
and JBOD/linear does not do parity checking, so this is a very very
fast layer to introduce. You're not making yourself any more
vulnerable with regards to data security. It also has a few
interesting advantages:

1) If you have the controllers for it, you can use striping on the
smaller disks (which are usually of older build and slower than their
larger, more recent counterparts) to further enhance sustained read/
write performance. This would of course require enough bandwidth on
the controllers to not already be saturated reading from just one
drive. E.g. in my setup, I use three PCI32 4-port SATA cards to hook
up the eight 500GB drives and two of the 1TB drives, and the onboard
ICH10R controller drives the remaining disks. In this setup, the PCI
bus was already almost completely saturated in JBOD mode, so striping
didn't yield any significant increase in throughput.
2) If you *don't* stripe the smaller drives but JBOD them instead like
I do, you're putting less wear on each drive, as it would only be hit
by something like every (number_of_members_in_mdadm)'th request. This
could also help reduce seek penalties when doing simultaneous reads
and writes.
3) You can expand your zpool without replacing disks and without
adding new vdevs! As you know, ZFS doesn't (yet) support adding disks
to an existing vdev. It does, however, support expanding itself when
its existing members grow. This means that for instance you could have
a zpool with 6 members made of underlying mdadm arrays, and you could
simply buy another controller and six additional drives, add them to
the mdadm arrays, grow the partitions on them, and watch ZFS
automatically claim the additional newfound space. No downtime.

Had I the option now to redo anything about my setup, I'd have used
LVM instead of mdadm, and set it up from the beginning for every vdev
member, even the 2TB drives that make out a whole member. This would
make it very easy to expand my zpool later as described in advantage
3). Also, I believe you can port LVM devices to Solaris if I ever
wanted to go that route. Seems that's not possible with mdadm. As it
is now, partitions on my 2TB drives are raw members of the vdev, so
expanding those will require a bit of dd'ing and additional mdadm/LVM
setup.

But, you live and you learn, right ;)

Recently, I had *three* disks failing in my zpool. This is RAIDZ-2, so
it would normally spell doom, but because they were smaller drives,
and their failing sectors distributed fortunately, no data loss
occurred. This also means my setup changed a bit, so that it's now
4x2TB, 2x(2x1TB), 1x(4x½TB). Though this is really a completely
different disk setup, ZFS is none the wiser about this, as the members
it sees are abstractions of the actual drives. It really is a quite
flexible setup :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-12-13 00:47:25 UTC
Permalink
Thanks for the writeup, mate, it is always nice to see how others tinker
and optimize.
All in all, I would personally vote for anything that reduces system
complexity (using multiple volume management layers goes against that
for me).
Post by Daniel Smedegaard Buus
You're not making yourself any more
vulnerable with regards to data security
Well in a way you are. If a single drive is a single point of failure,
than a JBOD disk comprised of 3 drives is logically 3x more likely to
fail. There is no usable guarantee that the OS is going to continue to
have access to the other parts of a linear volume when one failed
(although most situations should be recoverable to a degree).

Of course, there is no real damage as long as you are using ZFS for
redundancy anyway, but keep in mind that
(a) MTBF is going to be cut down
(b) It will be harder (if at all possible) to access your pools portably

In my book, this is reducing data security (marginally, but still)
Post by Daniel Smedegaard Buus
I'd have used
LVM instead of mdadm,
AFAIK, using a lvm2 logical volume will _not_ allow ZFS to reliably
synch IO transactions. So, better have UPS and be able to afford up to
5-10 seconds worth of data in case of hardware failure.
Post by Daniel Smedegaard Buus
and set it up from the beginning for every vdev
member, even the 2TB drives that make out a whole member.
Remember to optimize for block alignment (e.g. use --metadatasize=250k
or something like that on your physical volumes). Otherwise, just
interposing lvm2 volumes between ZFS and the physical disk, might mess
up any alignment policies that ZFS tries to maintain for maximum
throughput. I'm not so sure how much this matters with spindle disks,
but it matters a lot with common SSD technology, e.g.
Post by Daniel Smedegaard Buus
3). Also, I believe you can port LVM devices to Solaris if I ever
wanted to go that route
Can you tell me how, because I've been looking for this for ages. I
haven't found it yet

Seth
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2010-12-13 11:16:53 UTC
Permalink
Post by sgheeren
All in all, I would personally vote for anything that reduces system
complexity (using multiple volume management layers goes against that
for me).
Well, the purist in me says, "me too", and had I been a sysadm and
walked in on a setup like mine in production somewhere, you'd probably
have read about this on thedailywtf.com instead, heh ;)

The other part of me, the dirty cheapskate lateral thinking crazy
person loves Linux because you get to do stuff like this. And in all
honesty, when I read about other people here complaining about
performance problems, my face is one big question mark, I have
wonderful performance across the board (even writing is really good by
my standards, and this is using the cheap Pentium E2200 processor),
point in fact:
***@lnxsrv:/titanic/archive$ dd if=/dev/zero of=/titanic/archive/
zeroes.ddimg bs=1M count=10000 && dd if=/titanic/archive/zeroes.ddimg
of=/dev/null bs=1M
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 202,221 s, 51,9 MB/s
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 64,1314 s, 164 MB/s
(note: 4G system ram)

Besides, the ZFS-FUSE "anti-ZFS" recommendation of using partitions
rather than raw disks as recommended on OS was kinda my "green card"
to start doing it like this ;) Plus, hey, such great things I learn!
Post by sgheeren
On 12/12/2010 09:19 PM, Daniel Smedegaard Buus wrote:> You're not making yourself any more
Post by Daniel Smedegaard Buus
vulnerable with regards to data security
Well in a way you are. If a single drive is a single point of failure,
than a JBOD disk comprised of 3 drives is logically 3x more likely to
fail. There is no usable guarantee that the OS is going to continue to
have access to the other parts of a linear volume when one failed
(although most situations should be recoverable to a degree).
Hm. I see your point. Though - and this is completely unsubstantiated
gut feelings - I have a suspicion it might be just as much the
opposite. I have about 50 hard drives in my apartment (never throwing
away dying drives, I just leave damaged sectors occupied by "don't
touch" data and dump non-vital files on them, then use them as plug
units for my external USB drive bay). In my experience, the larger and
more recent the drives are, the shorter their lifespan has been. (I'd
like to take this pause to remind about the gut feeling disclaimer)
Also, I'm really liking my theory about the JBODs lessening wear on
each underlying device. This, of course, has the prerequisite that ZFS
doesn't try to scatter writes across the pool when writing, but
prefers to just try to write to contiguous blocks. This is do not
know.

Besides, I don't mind so much about MTBF so long as the failures don't
affect my pool. And the alternative (counting out multiple vdevs)
would be to move a lot of perfectly healthy drives to my "external
drive bay faction" and honestly, there's only so much porn one man can
download ;) (Just kidding!)

Last I experienced the triple-drive failure, it was quite simple to
rebuild (though time-consuming with lots of downtime, something that
would not be acceptable for a professional system). I exported the
pool, kept the JBODs running and dd'd my data onto two new 2TB drives.
Re-import the pool, scrub, ZFS fixed data corruption where bad sectors
had been present, done and happy.

But yeah, it should be pointed out that this is pretty f'in far from
best practice. But I had the feeling that DC wasn't going that way
anyway ;)
Post by sgheeren
Of course, there is no real damage as long as you are using ZFS for
redundancy anyway, but keep in mind that
(a) MTBF is going to be cut down
(b) It will be harder (if at all possible) to access your pools portably
Very very true. It is also adviced to keep backups of /etc/md* and /
etc/zfs*.
Post by sgheeren
Post by Daniel Smedegaard Buus
I'd have used
LVM instead of mdadm,
AFAIK, using a lvm2 logical volume will _not_ allow ZFS to reliably
synch IO transactions. So, better have UPS and be able to afford up to
5-10 seconds worth of data in case of hardware failure.
I was under the impression that using partitions had the same effect?
That is, screwing up ZFS's ability to reliably use caches. How does
this apply to md arrays, btw? I don't even know if caches are in play
here.
Post by sgheeren
Post by Daniel Smedegaard Buus
 and set it up from the beginning for every vdev
member, even the 2TB drives that make out a whole member.
Remember to optimize for block alignment (e.g. use --metadatasize=250k
or something like that on your physical volumes). Otherwise, just
interposing lvm2 volumes between ZFS and the physical disk, might mess
up any alignment policies that ZFS tries to maintain for maximum
throughput. I'm not so sure how much this matters with spindle disks,
but it matters a lot with common SSD technology, e.g.
Thanks for the tip. I've not yet had any real incentive to do
performance tweaking (ref. dd output above). I'm guessing I would do
this for each md array as their optimal settings would differ (2x1TB
vs. 4x½TB)?
Post by sgheeren
Post by Daniel Smedegaard Buus
3). Also, I believe you can port LVM devices to Solaris if I ever
wanted to go that route
Can you tell me how, because I've been looking for this for ages. I
haven't found it yet
See, this is why I rarely regret talking crap online. Because
someone's gonna call you on that, and then you find out you're
mistaken; lesson learned. For the life of me, I cannot find anything
to back me up on that. I believe I must have short-circuited something
in my brain, possible from the fact the the LVM tools on Solaris are
ported from the LVM tools on Linux.

Thanks for your feedback, Seth! :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-12-13 14:58:38 UTC
Permalink
Post by Daniel Smedegaard Buus
The other part of me, the dirty cheapskate lateral thinking crazy
person loves Linux because you get to do stuff like this
+1 but I can certainly recommend throwing a little money at it and
buying some peace of mind

I know I liked it when I finally built my NAS somewhere last year. Up to
then it was 10+ years of toying with all these technologies :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2010-12-13 16:38:09 UTC
Permalink
On 12/13/2010 12:16 PM, Daniel Smedegaard Buus wrote:> The other part of me, the dirty cheapskate lateral thinking crazy
Post by Daniel Smedegaard Buus
person loves Linux because you get to do stuff like this
+1 but I can certainly recommend throwing a little money at it and
buying some peace of mind
I know I liked it when I finally built my NAS somewhere last year. Up to
then it was 10+ years of toying with all these technologies :)
I bet you miss it too, though ;) Besides, if everyone was using proper
storage solutions and following best practices to the letter, who
would catch that next stray bug that arose when combining lvm + mdadm
+ cromfs with loop images + zfs-fuse on a hacked 2.7 kernel? I'm doing
community service here, man! Hehe :D
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-12-13 15:13:29 UTC
Permalink
Post by Daniel Smedegaard Buus
I'm really liking my theory about the JBODs lessening wear on
each underlying device.
You might have a point. Didn't that google study on harddisk durability
have any data points on that topic (more todo items for brosing... )
Post by Daniel Smedegaard Buus
This, of course, has the prerequisite that ZFS
doesn't try to scatter writes across the pool when writing, but
prefers to just try to write to contiguous blocks.
It is logstructured -> sequential; perhaps a bit too single-handed in
that respect (quoting from memory Jeff Bonwicks own comment that the
block allocator is somewhat simplistic but they will change it 'sometime
in the future, when the need arises' :))
Post by Daniel Smedegaard Buus
I was under the impression that using partitions had the same effect?
That is, screwing up ZFS's ability to reliably use caches. How does
this apply to md arrays, btw? I don't even know if caches are in play
here.
It's not about caching. It's about guarantees that certain writes have
hit the disk, and in the desired order.
If there is no sync capability on the underlying device you will see a
warning at pool activation in syslog

By the way, I think you are actually referring to ZFS not being the 'bus
master' so to say if using partitions. That depends on what other
activity goes to same disk. I think that if left alone, there will be
little/no difference.

Do check cat /sys/block/sda/queue/scheduler, though to prevent the
kernel from interfering (too much). Use 'noop' of 'cfq' for performance.
Post by Daniel Smedegaard Buus
LVM tools on Solaris are
Hmmm... I'm still interested in a link, if you can find it :)
Post by Daniel Smedegaard Buus
ported from the LVM tools on Linux.
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2010-12-13 17:02:03 UTC
Permalink
Post by sgheeren
Didn't that google study on harddisk durability
have any data points on that topic (more todo items for brosing... )
The headlines I remember from that study was that drives that would
fail tended to do so quite early (I'm looking at you, 4-month old WD
2TB drive with SMART warnings about 3 relocated bad sectors, hrmph!),
and that the difference between high-end SCSI (and equivalent) drives
and consumer-level drives weren't as pronounced as everyone's saying.
And then something about temps not having to be particularly low, but
that at a particular threshold @ really high values you'd be looking
at death raining quickly from above... But my memory is flaky at best,
so...
Post by sgheeren
Post by Daniel Smedegaard Buus
This, of course, has the prerequisite that ZFS
doesn't try to scatter writes across the pool when writing, but
prefers to just try to write to contiguous blocks.
It is logstructured -> sequential; perhaps a bit too single-handed in
that respect (quoting from memory Jeff Bonwicks own comment that the
block allocator is somewhat simplistic but they will change it 'sometime
in the future, when the need arises' :))
Sounds like Microsoft? :D Are we in need of a defragmenter?
Post by sgheeren
Post by Daniel Smedegaard Buus
I was under the impression that using partitions had the same effect?
That is, screwing up ZFS's ability to reliably use caches. How does
this apply to md arrays, btw? I don't even know if caches are in play
here.
It's not about caching. It's about guarantees that certain writes have
hit the disk, and in the desired order.
If there is no sync capability on the underlying device you will see a
warning at pool activation in syslog
Now that I have your attention apparently, and we're already on this
(off-)topic, I was wondering about this actually. Quite a while back I
was doing some tests with USB sticks and ZFS-FUSE to see how well it
managed to cope with missing devices. I was trying to find a good way
to distribute my boatload of drives on all my crappy PCI32 controllers
so that if a *controller* died rather than a harddisk, it wouldn't be
able to take down the array. The concern arose because I read some
(older) Solaris posts where a power failure or similar had completely
destroyed pools. I tried unplugging sticks and managed to fudge up my
pools.

Looking back, I'm pretty sure it had to do with Linux caching a lot of
data (should've actively flushed), because at that time I managed to
kill the pool completely, but running on actual disks now, I've never
experienced any issues, even after having to pull the power cord on my
box a number of times.

Long blah blah introduction over, question right here: If my
horrendous setup of layer-upon-layer should somehow fudge the journal
or whatever the ZFS equivalent is on a power outage, will I still be
able to return to the state of an old snapshot? Provided, of course,
that my hardware is still untouched?
Post by sgheeren
By the way, I think you are actually referring to ZFS not being the 'bus
master' so to say if using partitions. That depends on what other
activity goes to same disk. I think that if left alone, there will be
little/no difference.
Funny, I always thought that giving access to the entire device would
also give direct access to SMART stuff and other hardware
thingamajigs. Thinking twice, it makes sense that "sda" would be no
less of a block device than "sda1", and that access to hardware on the
disk itself would require accessing completely detached kernel
whatchamacallems.
Post by sgheeren
Do check cat /sys/block/sda/queue/scheduler, though to prevent the
kernel from interfering (too much). Use 'noop' of 'cfq' for performance.
Thanks! Says "noop deadline [cfq]" already, guess the performance-
oriented setup is default now? Read about this tip just recently on
this list, was it about Samba performance? Something's wrong with that
guy's setup, btw, I get +70 MBps on my Samba shares on this setup...
Maybe I should revisit that thread...
Post by sgheeren
Post by Daniel Smedegaard Buus
LVM tools on Solaris are
Hmmm... I'm still interested in a link, if you can find it :)
Me, too... Stop listening to me. I think I pinpointed my brain
haemorrhage: http://en.wikipedia.org/wiki/Logical_volume_management#Notes
- note 7. BSD, not Solaris. I'm a ranting type of guy, you'll have to
excuse me ;)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
sgheeren
2010-12-13 22:07:05 UTC
Permalink
Post by Daniel Smedegaard Buus
The headlines I remember from that study was that drives that would
fail tended to do so quite early (I'm looking at you, 4-month old WD
2TB drive with SMART warnings about 3 relocated bad sectors, hrmph!),
and that the difference between high-end SCSI (and equivalent) drives
and consumer-level drives weren't as pronounced as everyone's saying.
And then something about temps not having to be particularly low, but
at death raining quickly from above... But my memory is flaky at best,
so...
I had roughly the same recollection; alas nothing on

* many smaller disks being more durable than some larger and stuff like that
* combining disks linearly (JBOD style) to level the wear

Haha, we need to run our own investigation there [1]
Post by Daniel Smedegaard Buus
Sounds like Microsoft? :D Are we in need of a defragmenter?
Nope, Microsoft would not have told you. They'd have the marketeers
draft up adverts stating how the block allocation algorithm is so
optimized that fragmentation can never have a measurable effect [and you
can always reinstall from scratch]

Anyways, 'somewhat simplistic' is evindently nowhere as simplistic as
early nilfs2/btrfs implementations. I think last time I tested nilfs,
everything went reaaaaally fast until the disk filled up. Then it was
toast as in: end of data.
Post by Daniel Smedegaard Buus
If my
horrendous setup of layer-upon-layer should somehow fudge the journal
or whatever the ZFS equivalent is on a power outage, will I still be
able to return to the state of an old snapshot? Provided, of course,
that my hardware is still untouched?
First, this is right On-Topic. ... For the zfs-discuss list ...
Second: this is okay, there used to be issues with that (importing would
fail if last uberblock corrupted or log not replayable). These issues
have been fixed (if my memory serves me well, the last of them around
b121). There are now 'finetune' options (-f, -F and the undocumented -V
(verbatim) and -X (extreme rewind) options. Note also you can use -n
(dryrun) with that to see what would be lost. If you really don't like
it, you can always try the even less-documented zdb tool to salvage some
things before you run the pool back to a sane point in the past.
Post by Daniel Smedegaard Buus
Me, too... Stop listening to me. I think I pinpointed my brain
haemorrhage: http://en.wikipedia.org/wiki/Logical_volume_management#Notes
- note 7. BSD, not Solaris. I'm a ranting type of guy, you'll have to
excuse me ;)
As long as the noise threshold keeps below 1 dB, I'm ok for now!


Cheers,
Seth

[1] not
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Daniel Smedegaard Buus
2010-12-14 15:50:57 UTC
Permalink
On 12/13/2010 06:02 PM, Daniel Smedegaard Buus wrote:> The headlines I remember from that study was that drives that would
Post by Daniel Smedegaard Buus
at death raining quickly from above... But my memory is flaky at best,
so...
I had roughly the same recollection; alas nothing on
* many smaller disks being more durable than some larger and stuff like that
* combining disks linearly (JBOD style) to level the wear
Haha, we need to run our own investigation there [1]
Hehe, one guinea pig reporting for service! ;) (And from the hurt part
in me: that would have to be from a different study anyway, wouldn't
it?)
Post by Daniel Smedegaard Buus
Sounds like Microsoft? :D Are we in need of a defragmenter?
Nope, Microsoft would not have told you.
And you'd have a million advocates shutting down any complaints with
"don't blame MS for bad drivers". Sorry... Don't wanna start that ;)
[and you can always reinstall from scratch]
(TM)
Post by Daniel Smedegaard Buus
If my
horrendous setup of layer-upon-layer should somehow fudge the journal
or whatever the ZFS equivalent is on a power outage, will I still be
able to return to the state of an old snapshot? Provided, of course,
that my hardware is still untouched?
First, this is right On-Topic. ... For the zfs-discuss list ...
Second: this is okay, there used to be issues with that (importing would
fail if last uberblock corrupted or log not replayable). These issues
have been fixed (if my memory serves me well, the last of them around
b121). There are now 'finetune' options (-f, -F and the undocumented -V
(verbatim) and -X (extreme rewind) options. Note also you can use -n
(dryrun) with that to see what would be lost. If you really don't like
it, you can always try the even less-documented zdb tool to salvage some
things before you run the pool back to a sane point in the past.
Definitely seems like this issue has addressed :O ! I hope I'll never
have to use any of those commands, but it's bookmarked now, thanks :)

Thanks,
Daniel :)
--
To post to this group, send email to zfs-fuse-/***@public.gmane.org
To visit our Web site, click on http://zfs-fuse.net/
Loading...