LMDB crash consistency, again

Discussion:

Howard Chu

2014-10-20 10:44:37 UTC

Content preview: This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
describes a potential crash vulnerability in LMDB due to its use of fdatasync
instead of fsync when syncing writes to the data file. The vulnerability
exists because fdatasync omits syncs of the file metadata; if the data file
needed to grow as a result of any writes then this requires a metadata update.
[...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

This paper
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
describes a potential crash vulnerability in LMDB due to its use of fdatasync
instead of fsync when syncing writes to the data file. The vulnerability
exists because fdatasync omits syncs of the file metadata; if the data file
needed to grow as a result of any writes then this requires a metadata update.

This is a well-understood issue in LMDB; we briefly touched on it in this
earlier email thread
http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's
been a topic of discussion on IRC ever since the first multi-FS
microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/

It's worth noting that this vulnerability doesn't exist on Windows, MacOSX,
Android, or *BSD, because none of these OSs have a function equivalent to
fdatasync in the first place - they always use fsync (or the Windows
equivalent). (Android is an oddball; the underlying Linux kernel of course
supports fdatasync, but the C library, bionic, does not.)

We have a couple approaches for Linux:
1) provide an option to preallocate the file, using fallocate().
Unfortunately this doesn't completely eliminate metadata updates - filesystem
drivers tend to try to be "smart" and make fallocate cheap; they allocate the
space in the FS metadata but they also mark it as "unseen." The first time a
process accesses an unseen page, it gets zeroed out. Up until that point,
whatever old contents of the disk page are still present. The act of marking a
page from "unseen" to "seen" requires a metadata update of its own.

We had a discussion of this FS mis-feature a while ago, but it was fruitless.
https://lkml.org/lkml/2012/12/7/396

2) preallocate the file by explicitly writing zeros to it. This has a
couple other disadvantages:
a) on SSDs, doing such a write needlessly contributes to wearout of the
flash.
b) Windows detects all-zero writes and compresses them out, creating a
sparse file, thus defeating the attempt at preallocation.

3) track the allocated size of the file, and toggle between fsync and
fdatasync depending on whether the allocated size actually grows or not. This
is the approach I'm currently taking in a development branch. Whether we add
this to a new 0.9.x release, or just in 1.0, I haven't yet decided.

As another footnote, I plan to add support for LMDB on a raw partition in 1.x.
Naturally, fsync vs fdatasync will be irrelevant in that case.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

2014-11-15 00:10:38 UTC

Permalink

Content preview: Catching up with old mail... On 20/10/14 12:44, Howard Chu
wrote: > This paper > https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai

describes a potential crash vulnerability in LMDB due to its use of > fdatasync

instead of fsync when syncing writes to the data file. The > vulnerability
exists because fdatasync omits syncs of the file metadata; > if the data
file needed to grow as a result of any writes then this > requires a metadata
update. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Catching up with old mail...

This paper
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
describes a potential crash vulnerability in LMDB due to its use of
fdatasync instead of fsync when syncing writes to the data file. The
vulnerability exists because fdatasync omits syncs of the file metadata;
if the data file needed to grow as a result of any writes then this
requires a metadata update.

Looks like an OS bug. fdatasync() should not break data integrity, it
may only skip metadata which are unneeded for retrieving the data. So
size changes are synced. So say the Posix spec and the Linux manpage.

--
Hallvard

Howard Chu

2014-11-15 01:57:38 UTC

Permalink

Content preview: Hallvard Breien Furuseth wrote: > Catching up with old mail...

Post by Hallvard Breien Furuseth

On 20/10/14 12:44, Howard Chu wrote: >> This paper >> https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai

Post by Howard Chu
describes a potential crash vulnerability in LMDB due to its use of

fdatasync instead of fsync when syncing writes to the data file. The >>

vulnerability exists because fdatasync omits syncs of the file metadata;

Post by Hallvard Breien Furuseth

if the data file needed to grow as a result of any writes then this >>

requires a metadata update. > > Looks like an OS bug. fdatasync() should not
break data integrity, it > may only skip metadata which are unneeded for
retrieving the data. So > size changes are synced. So say the Posix spec and
the Linux manpage. > Ah good point. If you check out their slides, #103 of
106 asks the question; the only failure they found in LMDB occurred on ext3
(and not on XFS) so we may just chalk this up to a flaw in ext3 instead.
[...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Hallvard Breien Furuseth
Catching up with old mail...

Ah good point. If you check out their slides, #103 of 106 asks the
question; the only failure they found in LMDB occurred on ext3 (and not
on XFS) so we may just chalk this up to a flaw in ext3 instead.

Given that ext3 has already been superseded by ext4, this result of
theirs may not be all that useful in the real world. We already have
disrecommended ext3 for performance reasons, perhaps we should just note
this and move on.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

2014-11-15 10:05:42 UTC

Permalink

Content preview: On 15/11/14 02:57, Howard Chu wrote: > Ah good point. If you
check out their slides, #103 of 106 asks the > question; the only failure
they found in LMDB occurred on ext3 (and not > on XFS) so we may just chalk
this up to a flaw in ext3 instead. > > Given that ext3 has already been superseded
by ext4, this result of > theirs may not be all that useful in the real world.
We already have > disrecommended ext3 for performance reasons, perhaps we
should just note > this and move on. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Howard Chu
Ah good point. If you check out their slides, #103 of 106 asks the
question; the only failure they found in LMDB occurred on ext3 (and not
on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of
theirs may not be all that useful in the real world. We already have
disrecommended ext3 for performance reasons, perhaps we should just note
this and move on.

No, ext4 breaks too. Their paper's page 459, 5.1.2 LightningDB:

The fact that the journal commit block (op#402) is
flushed with the next pwrite64 in the same thread
means fdatasync on ext3 does not wait for the comple-
tion of journaling (similar behavior has been observed
on ext4).

I guess O_DSYNC and fdatasync() should not be the lmdb
defaults yet, at least not on Linux:-(

We need a bug report to Linux or to the distro they were
using, noting that the power faults are simulated. And is
this O_DSYNC, fdatasync or both? The problem might be with
only one of them. I'm not reading the paper in detail yet.

--
Hallvard

Howard Chu

2015-01-05 11:50:43 UTC

Permalink

Content preview: Hallvard Breien Furuseth wrote: > On 15/11/14 02:57, Howard
Chu wrote: >> Ah good point. If you check out their slides, #103 of 106 asks
the >> question; the only failure they found in LMDB occurred on ext3 (and
not >> on XFS) so we may just chalk this up to a flaw in ext3 instead. >>

Post by Hallvard Breien Furuseth

Given that ext3 has already been superseded by ext4, this result of >>

theirs may not be all that useful in the real world. We already have >> disrecommended
ext3 for performance reasons, perhaps we should just note >> this and move

Post by Hallvard Breien Furuseth

The fact that the journal commit block (op#402) is > flushed with the

next pwrite64 in the same thread > means fdatasync on ext3 does not wait
for the comple- > tion of journaling (similar behavior has been observed >
on ext4). > > I guess O_DSYNC and fdatasync() should not be the lmdb > defaults
yet, at least not on Linux:-( > > We need a bug report to Linux or to the
distro they were > using, noting that the power faults are simulated. And
is > this O_DSYNC, fdatasync or both? The problem might be with > only one
of them. I'm not reading the paper in detail yet. > This appears to be quite
old news. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Hallvard Breien Furuseth

Ah good point. If you check out their slides, #103 of 106 asks the
question; the only failure they found in LMDB occurred on ext3 (and not
on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of
theirs may not be all that useful in the real world. We already have
disrecommended ext3 for performance reasons, perhaps we should just note
this and move on.

The fact that the journal commit block (op#402) is
flushed with the next pwrite64 in the same thread
means fdatasync on ext3 does not wait for the comple-
tion of journaling (similar behavior has been observed
on ext4).
I guess O_DSYNC and fdatasync() should not be the lmdb
defaults yet, at least not on Linux:-(
We need a bug report to Linux or to the distro they were
using, noting that the power faults are simulated. And is
this O_DSYNC, fdatasync or both? The problem might be with
only one of them. I'm not reading the paper in detail yet.

This appears to be quite old news.

https://lkml.org/lkml/2012/9/3/83

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

2015-01-05 11:50:59 UTC

Permalink

Post by Hallvard Breien Furuseth

Given that ext3 has already been superseded by ext4, this result of >>

theirs may not be all that useful in the real world. We already have >> disrecommended
ext3 for performance reasons, perhaps we should just note >> this and move

Post by Hallvard Breien Furuseth

The fact that the journal commit block (op#402) is > flushed with the

Post by Hallvard Breien Furuseth

Ah good point. If you check out their slides, #103 of 106 asks the
question; the only failure they found in LMDB occurred on ext3 (and not
on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of
theirs may not be all that useful in the real world. We already have
disrecommended ext3 for performance reasons, perhaps we should just note
this and move on.

This appears to be quite old news.

https://lkml.org/lkml/2012/9/3/83

It has references going back to at least 2008.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

2015-01-05 11:58:41 UTC

Permalink

Content preview: Howard Chu wrote: > Hallvard Breien Furuseth wrote: >> On
15/11/14 02:57, Howard Chu wrote: >>> Ah good point. If you check out their
slides, #103 of 106 asks the >>> question; the only failure they found in
LMDB occurred on ext3 (and not >>> on XFS) so we may just chalk this up to
a flaw in ext3 instead. >>> >>> Given that ext3 has already been superseded
by ext4, this result of >>> theirs may not be all that useful in the real
world. We already have >>> disrecommended ext3 for performance reasons, perhaps
we should just note >>> this and move on. >> >> >> No, ext4 breaks too. Their
paper's page 459, 5.1.2 LightningDB: >> >> The fact that the journal commit
block (op#402) is >> flushed with the next pwrite64 in the same thread >>
means fdatasync on ext3 does not wait for the comple- >> tion of journaling
(similar behavior has been observed >> on ext4). >> >> I guess O_DSYNC and
fdatasync() should not be the lmdb >> defaults yet, at least not on Linux:-(

Post by Howard Chu

We need a bug report to Linux or to the distro they were >> using,

noting that the power faults are simulated. And is >> this O_DSYNC, fdatasync
or both? The problem might be with >> only one of them. I'm not reading the
paper in detail yet. >> > This appears to be quite old news. > > https://lkml.org/lkml/2012/9/3/83

Post by Howard Chu

It has references going back to at least 2008. > The LKML thread indicates

that this bug was already fixed. The zheng mai paper says they used RHEL6,
which shipped with kernel 2.6.32 so it apparently was too old to have the
fix. [...]

Content analysis details: (-0.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
1.0 SINGLE_HEADER_2K A single header contains 2K-3K characters
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Howard Chu

This appears to be quite old news.
https://lkml.org/lkml/2012/9/3/83
It has references going back to at least 2008.

The LKML thread indicates that this bug was already fixed. The zheng mai
paper says they used RHEL6, which shipped with kernel 2.6.32 so it
apparently was too old to have the fix.

All in all a bunch of bogus reporting; claiming that all DBs are broken
when in fact LMDB is perfectly correct.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

2015-01-05 13:29:32 UTC

Permalink

Content preview: On 01/05/2015 12:58 PM, Howard Chu wrote: > The LKML thread
indicates that this bug was already fixed. The zheng mai > paper says they
used RHEL6, which shipped with kernel 2.6.32 so it apparently > was too old
to have the fix. > > All in all a bunch of bogus reporting; claiming that
all DBs are broken when > in fact LMDB is perfectly correct. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Howard Chu
The LKML thread indicates that this bug was already fixed. The zheng mai
paper says they used RHEL6, which shipped with kernel 2.6.32 so it apparently
was too old to have the fix.
All in all a bunch of bogus reporting; claiming that all DBs are broken when
in fact LMDB is perfectly correct.

True - but often uninteresting from the user's perspective. So I do
think Linux should default to fsync for some years - at least when the
file may have grown. Makefile can explain the problem and provide a
variable to always use fdatasync, if the admin knows the kernel is OK.

As for how to know the synced size, if you want to do more than always
use fsync on an OS where fdatasync is unreliable:

I drafted some code to get around it, but it got messy. If we
use more code for this than just '#define MDB_FDATASYNC fsync',
I suggest to handle it all in mdb_env_sync() which can fstat():

struct MDB_env:
off_t me_size; /**< file size known to be synced, or 0 */

mdb_env_sync() {
...;
#if MDB_BUGGY_FDATASYNC
size_t sz = 0;
if (mdb_fsize(env->me_fd, &sz) != MDB_SUCCESS || sz != env->me_size) {
if (fsync(env->me_fd))
rc = ErrCode();
else if (sz)
env->me_size = sz;
} else
#endif
...normal sync...;
}

mdb_env_open() does not know if the current filesize has been
synced, so drop setting me_size there.

--
Hallvard

Hallvard Breien Furuseth

2015-01-05 22:25:16 UTC

Permalink

Content preview: Another sync issue: mdb_env_sync() syncs the wrong way (msync
vs fdatasync) if it runs in a process with a different MDB_WRITEMAP setting
than one which committed with MDB_NOSYNC or MDB_NOMETASYNC. I.e. this statement
in lmdb.h is too weak: [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.17 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Another sync issue:

mdb_env_sync() syncs the wrong way (msync vs fdatasync) if it
runs in a process with a different MDB_WRITEMAP setting than
one which committed with MDB_NOSYNC or MDB_NOMETASYNC.

I.e. this statement in lmdb.h is too weak:

* Processes with and without MDB_WRITEMAP on the same environment do
* not cooperate well.

I think I added it after an IRC chat. But it should either say
that it can break ACID, or env_sync() called explicitly should
sync more aggressively - at least if the MDB_env did not commit
all transactions since last known sync.

--
Hallvard

Hallvard Breien Furuseth

2015-01-05 22:31:03 UTC

Permalink

Content preview: On 01/05/2015 11:25 PM, Hallvard Breien Furuseth wrote: >
I think I added it after an IRC chat. But it should either say > that it can
break ACID, or env_sync() called explicitly should > sync more aggressively
- at least if the MDB_env did not commit > all transactions since last known
sync. [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.17 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Hallvard Breien Furuseth
I think I added it after an IRC chat. But it should either say
that it can break ACID, or env_sync() called explicitly should
sync more aggressively - at least if the MDB_env did not commit
all transactions since last known sync.

Never mind the "called explicitly". Same thing when the sync in
txn_commit if last txn committed with a different WRITEMAP setting.

--
Hallvard