Question regarding MDB

Discussion:

Question regarding MDB_NOLOCK

David Barbour

2015-01-30 21:57:38 UTC

(This is a quick follow-up to an earlier discussion, a note left for anyone
confused by MDB_NOLOCK.)

On Mon, Dec 1, 2014 at 6:55 AM, Hallvard Breien Furuseth <

A write transaction frees pages which its new snapshot cannot see.
A later writer will overwrite them, when no *known* readers can see
them either. But with MDB_NOLOCK, writers don't know about old
readers and might overwrite pages which old readers can see.
Last snapshot is never overwritten. So readers which did begin/renew
after latest commit(write txn) are safe from txn_begin(write txn).
The same with the commit of the write txn before that. I think.
MDB keeps the last two snapshots in the metapages.

I've been reading the MDB source code a bit more to verify this assumption.
It is not valid.

MDB does keep the last two metapages, but may begin to dismantle the elder
of the two for pages if there are no readers for it. With MDB_NOLOCK, it
simply assumes there are no readers for it (cf. mdb_find_oldest). **Only
the most recent snapshot is preserved.**

For my current use case, I believe that I can still achieve a sufficient
level of parallelism even if limited to double-buffering (whereas two
snapshots would give me triple-buffering). I'm not going to press for any
changes at this time.

David Barbour

2015-01-30 22:49:17 UTC

Permalink

Post by David Barbour
For my current use case, I believe that I can still achieve a sufficient
level of parallelism even if limited to double-buffering (whereas two
snapshots would give me triple-buffering). I'm not going to press for any
changes at this time.

After having examined this further, I've changed my mind.

With triple buffering, I can guarantee that the writer *almost* never waits
on a short-running reader, and that the readers never wait on the writer.
With double buffering, the probability of the writer waiting on even
short-running readers, assuming they are frequent, is nearly 100%. Triple
buffering is thus a huge advantage for users of MDB_NOLOCK.

The update to support this is almost trivial: tweak `mdb_find_oldest` such
that both meta-page snapshots are considered to have active readers. I'm
willing to develop and submit a patch, but only if this change also sounds
good to the main LMDB developers.

Regards,

Dave

Howard Chu

2015-01-31 00:21:56 UTC

Permalink

Content preview: David Barbour wrote: > > > On Fri, Jan 30, 2015 at 3:57 PM,

Post by David Barbour

For my current use case, I believe that I can still achieve a > sufficient

level of parallelism even if limited to double-buffering > (whereas two snapshots
would give me triple-buffering). I'm not > going to press for any changes
at this time. > > > After having examined this further, I've changed my mind.

Post by David Barbour
With triple buffering, I can guarantee that the writer *almost* never

waits on a short-running reader, and that the readers never wait on the
writer. With double buffering, the probability of the writer waiting on
even short-running readers, assuming they are frequent, is nearly 100%.
Triple buffering is thus a huge advantage for users of MDB_NOLOCK. > >

The update to support this is almost trivial: tweak `mdb_find_oldest` > such
that both meta-page snapshots are considered to have active > readers. I'm
willing to develop and submit a patch, but only if this > change also sounds
good to the main LMDB developers. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 RCVD_IN_DNSWL_BLOCKED RBL: ADMINISTRATOR NOTICE: The query to DNSWL
was blocked. See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[69.43.206.106 listed in list.dnswl.org]
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

For my current use case, I believe that I can still achieve a
sufficient level of parallelism even if limited to double-buffering
(whereas two snapshots would give me triple-buffering). I'm not
going to press for any changes at this time.
After having examined this further, I've changed my mind.
With triple buffering, I can guarantee that the writer *almost* never
waits on a short-running reader, and that the readers never wait on the
writer. With double buffering, the probability of the writer waiting on
even short-running readers, assuming they are frequent, is nearly 100%.
Triple buffering is thus a huge advantage for users of MDB_NOLOCK.
The update to support this is almost trivial: tweak `mdb_find_oldest`
such that both meta-page snapshots are considered to have active
readers. I'm willing to develop and submit a patch, but only if this
change also sounds good to the main LMDB developers.

That is supposed to be its current behavior already. I.e., no page that
either of the two meta pages points to is ever allowed to be reclaimed.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

David Barbour

2015-01-31 03:44:39 UTC

Permalink

Post by Howard Chu
That is supposed to be its current behavior already. I.e., no page that
either of the two meta pages points to is ever allowed to be reclaimed.

Okay then. On Monday, I'll see if I can write a test demonstrating this
bug. And, if so, a patch to fix it.

Hallvard Breien Furuseth

2015-01-31 04:04:04 UTC

Permalink

Content preview: On 30/01/15 22:57, David Barbour wrote: > On Mon, Dec 1, 2014
at 6:55 AM, Hallvard Breien Furuseth > <***@usit.uio.no <mailto:***@usit.uio.no>>
wrote: > (...) > Last snapshot is never overwritten. So readers which did
begin/renew > after latest commit(write txn) are safe from txn_begin(write
txn). > > The same with the commit of the write txn before that. I think.

MDB keeps the last two snapshots in the metapages. > > > I've been reading

the MDB source code a bit more to verify this > assumption. It is not valid.

Post by David Barbour
MDB does keep the last two metapages, but may begin to dismantle the

elder of the two for pages if there are no readers for it. [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.17 listed in list.dnswl.org]
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: uio.no]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

On Mon, Dec 1, 2014 at 6:55 AM, Hallvard Breien Furuseth
(...)
Last snapshot is never overwritten. So readers which did begin/renew
after latest commit(write txn) are safe from txn_begin(write txn).
The same with the commit of the write txn before that. I think.
MDB keeps the last two snapshots in the metapages.

Post by David Barbour
I've been reading the MDB source code a bit more to verify this

assumption. It is not valid.
MDB does keep the last two metapages, but may begin to dismantle the
elder of the two for pages if there are no readers for it.

Yes, true. The last two snapshots' *data pages* are never
overwritten, and any readers using them will have read the
metapage and do not need it again.

I confused myself because I was thinking of sync issues:
Even if the oldest metapage has been overwritten, that
does not mean it is gone yet: If it has not been synced
to disk, a system crash can bring it back. And with it,
its refs to datapages.

Looking closer though, that's only relevant if with
MDB_NOSYNC, where the previous metapage has not been
synced either.

David Barbour

2015-01-31 06:34:14 UTC

Permalink

Okay, I think I see what's happening here:

mdb_find_oldest(): will return the most recent snapshot if no readers exist

mdb_page_alloc(): will search FREE_DBI for a transaction `last` that is
less than oldest, and will try to find a contiguous range of pages that
were free'd by said transaction, potentially merging free pages from many
transactions. If nothing is found, will instead grow the database.

Since `last` < `oldest` when we reuse any old pages, and we're only using
the 'freed' pages from last (not the data pages), we know that at the data
pages for the eldest two transactions are protected.

Is this right?

My earlier assumption (before reading mdb_page_alloc) was that LMDB would
be aggressive about grabbing pages freed by transactions that are not
actively being read. If we're relying on `last < oldest` to create a two
page discrepancy, this means when we actually have readers on older
transactions that we're being little more conservative than necessary. But
it does protect the last two snapshots.

...

Idle question: what happens with freeing old pages when the txnid_t wraps
around on a 32-bit system? Do the pages free'd by those transactions just
get stuck?

Howard Chu

2015-01-31 06:50:30 UTC

Permalink

Content preview: David Barbour wrote: > Idle question: what happens with freeing
old pages when the txnid_t > wraps around on a 32-bit system? Do the pages
free'd by those > transactions just get stuck? Probably. Again, I really
don't see 32-bit systems as being worthy of consideration. [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[69.43.206.106 listed in list.dnswl.org]
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by David Barbour
Idle question: what happens with freeing old pages when the txnid_t
wraps around on a 32-bit system? Do the pages free'd by those
transactions just get stuck?

Probably. Again, I really don't see 32-bit systems as being worthy of
consideration.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

David Barbour

2015-01-31 07:03:31 UTC

Permalink

Post by Howard Chu
Probably. Again, I really don't see 32-bit systems as being worthy of
consideration.

That's understandable. There are other embedded databases suitable for
32-bit systems.

Howard Chu

2015-01-31 07:19:52 UTC

Permalink

Content preview: David Barbour wrote: > Probably. Again, I really don't see
32-bit systems as being worthy > of consideration. > That's understandable.
There are other embedded databases suitable for > 32-bit systems. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 RCVD_IN_DNSWL_BLOCKED RBL: ADMINISTRATOR NOTICE: The query to DNSWL
was blocked. See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[69.43.206.106 listed in list.dnswl.org]
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Post by Howard Chu
Probably. Again, I really don't see 32-bit systems as being worthy
of consideration.
That's understandable. There are other embedded databases suitable for
32-bit systems.

If anyone ever runs a 32 bit server fast enough and long enough to
process 4 billion write transactions, they can always do an mdb_copy to
reset the txnIDs.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

2015-01-31 06:54:43 UTC

Permalink

Content preview: David Barbour wrote: > > Okay, I think I see what's happening
here: > > mdb_find_oldest(): will return the most recent snapshot if no readers
exist > > mdb_page_alloc(): will search FREE_DBI for a transaction `last`
that is > less than oldest, and will try to find a contiguous range of pages
that > were free'd by said transaction, potentially merging free pages from

many transactions. If nothing is found, will instead grow the database.

Since `last` < `oldest` when we reuse any old pages, and we're only >

using the 'freed' pages from last (not the data pages), we know that at >
the data pages for the eldest two transactions are protected. > > Is this
right? [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[69.43.206.106 listed in list.dnswl.org]
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

mdb_find_oldest(): will return the most recent snapshot if no readers exist
mdb_page_alloc(): will search FREE_DBI for a transaction `last` that is
less than oldest, and will try to find a contiguous range of pages that
were free'd by said transaction, potentially merging free pages from
many transactions. If nothing is found, will instead grow the database.
Since `last` < `oldest` when we reuse any old pages, and we're only
using the 'freed' pages from last (not the data pages), we know that at
the data pages for the eldest two transactions are protected.
Is this right?

Yep.

My earlier assumption (before reading mdb_page_alloc) was that LMDB
would be aggressive about grabbing pages freed by transactions that are
not actively being read. If we're relying on `last < oldest` to create a
two page discrepancy, this means when we actually have readers on older
transactions that we're being little more conservative than necessary.

More than necessary? I don't think so.

But it does protect the last two snapshots.

Yes, always.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

David Barbour

2015-01-31 07:01:44 UTC

Permalink

Post by David Barbour
My earlier assumption (before reading mdb_page_alloc) was that LMDB

would be aggressive about grabbing pages freed by transactions that are
not actively being read. If we're relying on `last < oldest` to create a
two page discrepancy, this means when we actually have readers on older
transactions that we're being little more conservative than necessary.

More than necessary? I don't think so.

You'll conserve exactly one more transaction's free pages than necessary in
the case where a reader-lock exists on any transaction older than the most
recent snapshot.