ext3/ext4 fsync hack (was: openldap.git branch mdb.master updated. 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe)

Discussion:

Hallvard Breien Furuseth

2014-12-24 12:04:59 UTC

Date: Thu Dec 18 04:38:53 2014 +0000 > > Hack for potential ext3/ext4 corruption

issue > > Use regular fsync() if we think this commit grew the DB file. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe
Date: Thu Dec 18 04:38:53 2014 +0000
Hack for potential ext3/ext4 corruption issue
Use regular fsync() if we think this commit grew the DB file.

This does not catch all cases:

If the new pages below mt_next_pgno were freed instead of
written, me_size becomes too big. Later when the file does
grow, me_size may be >= actual filesize so it fdatasync()s.
Similar to b09e46904c1c059bd5086243e3915b6be510e57d
"ITS#7886 fix mdb_copy write size".
We can fix me_size, grow the file anyway (ftruncate), or
give the pages back to mt_next_pgno in mdb_freelist_save().

Another issue: After an MDB_NOSYNC commit, mdb_env_sync()
only fdatasync()s. It does not know when the file grew.
The planned "group commits" may get the same problem if
the user checkpoints with mdb_env_sync().

Howard Chu

2015-01-06 14:18:39 UTC

Permalink

Content preview: Hallvard Breien Furuseth wrote: > On 18/12/14 05:40, openldap-***@OpenLDAP.org
wrote: >> commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe >> Author: Howard
Chu <***@openldap.org> >> Date: Thu Dec 18 04:38:53 2014 +0000 >> >> Hack
for potential ext3/ext4 corruption issue >> >> Use regular fsync() if we
think this commit grew the DB file. > > This does not catch all cases: > >
If the new pages below mt_next_pgno were freed instead of > written, me_size
becomes too big. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Hallvard Breien Furuseth

commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe
Date: Thu Dec 18 04:38:53 2014 +0000
Hack for potential ext3/ext4 corruption issue
Use regular fsync() if we think this commit grew the DB file.

If the new pages below mt_next_pgno were freed instead of
written, me_size becomes too big.

Huh? mt_next_pgno definitively tells how many pages have ever been used
in the DB file.

Post by Hallvard Breien Furuseth
Later when the file does
grow, me_size may be >= actual filesize so it fdatasync()s.
Similar to b09e46904c1c059bd5086243e3915b6be510e57d
"ITS#7886 fix mdb_copy write size".
We can fix me_size, grow the file anyway (ftruncate), or
give the pages back to mt_next_pgno in mdb_freelist_save().
Another issue: After an MDB_NOSYNC commit, mdb_env_sync()
only fdatasync()s. It does not know when the file grew.

I suppose we can change the FORCE flag to also cause fsync() to be used.

Post by Hallvard Breien Furuseth
The planned "group commits" may get the same problem if
the user checkpoints with mdb_env_sync().

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

2015-01-06 14:27:12 UTC

Permalink

Content preview: On 01/06/2015 03:18 PM, Howard Chu wrote: > Hallvard Breien
Furuseth wrote: >> (....) >> If the new pages below mt_next_pgno were freed
instead of >> written, me_size becomes too big. > > Huh? mt_next_pgno definitively
tells how many pages have ever been used in > the DB file. [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.17 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

(....)
If the new pages below mt_next_pgno were freed instead of
written, me_size becomes too big.

Huh? mt_next_pgno definitively tells how many pages have ever been used in
the DB file.

No, see ITS#7886:

"Allocate an ovpage from mt_next_pgno, mdb_ovpage_free() it
and commit: The datafile may end before MDB_meta.mm_last_pg
since the ovpage was never written. mdb_env_copyfd() & co
break when they read the file to mm_last_pg."

Later when the file does
grow, me_size may be >= actual filesize so it fdatasync()s.
Similar to b09e46904c1c059bd5086243e3915b6be510e57d
"ITS#7886 fix mdb_copy write size".
We can fix me_size, grow the file anyway (ftruncate), or
give the pages back to mt_next_pgno in mdb_freelist_save().

--
Hallvard

Hallvard Breien Furuseth

2015-01-06 14:40:32 UTC

Permalink

Content preview: Sorry, forgot this one. On 01/06/2015 03:18 PM, Howard Chu
wrote: > Hallvard Breien Furuseth wrote: >> Another issue: After an MDB_NOSYNC
commit, mdb_env_sync() >> only fdatasync()s. It does not know when the file
grew. > > I suppose we can change the FORCE flag to also cause fsync() to
be used. [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.15 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Sorry, forgot this one.

Post by Howard Chu

Post by Hallvard Breien Furuseth
Another issue: After an MDB_NOSYNC commit, mdb_env_sync()
only fdatasync()s. It does not know when the file grew.

I suppose we can change the FORCE flag to also cause fsync() to be used.

Insufficient if the user commits with MDB_NOSYNC (maybe when creating
the DB), then turns off MDB_NOSYNC and does mdb_env_sync(env, 0).
Or another process without MDB_NOSYNC doing mdb_env_sync(env, 0).

The lockfile could track what has been synced how, though.
Except it won't know at init, so if someone does a lot of
<open env, commit, close env> they'll end up fsync'ing each time.

--
Hallvard