Discussion:
Freeing unused pages between two revisions...
Emmanuel Lécharny
2014-12-10 08:45:42 UTC
Permalink
Content preview: Hi guys, we have had long discussions with Howard on LMDB
about the release of pages that are not anymore in use. This is a critical
part of a MVCC based B-tree, as each update will copy many pages, and we
want to re-use the copied pages. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(elecharny[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid

Hi guys,

we have had long discussions with Howard on LMDB about the release of
pages that are not anymore in use. This is a critical part of a MVCC
based B-tree, as each update will copy many pages, and we want to re-use
the copied pages.

The problem is that those copied pages may be still in use by older
readers. Let's quickly explain how it all works.

At Ti, we are updating the B-tree, and that creates many pages by
copying pages from previous revisions. We will note those new pages
based on the level in the B-tree they were stored in : P[0, i-1] will be
for the root page, on level 0, and this page previous version was k (and
as it's the root page, k is obviously one version earlier than the
current revision). P[k, a] will be for a page at level k, and with a
previous revision 'a' (where a <= i-1).

One more important point is that a copied page may be split in two new
pages (if teh previous page was full), as we may copy the content of two
old pages into one single new page (when the two old pages contain only
N/2 elemnts and we removed one element in one of the two pages). Last,
not least, we may have to remove one level, if the root page is removed.
All in all, what is important here is to *know* which pages have been
copied during the update operation, because this is those pages that
will be removed at some point.

Now, assuming we keep a track of all the copied pages for a revision R,
we can also assume we can remove all those pages when this revision R is
not anymore in use (ie no reader thread is using R *and* no other reader
thread is using a revision below R). This is what LMDB does, AFAICT.

Can we go a bit further, and get rid of copied pages when a reader is
using a revision < R ? I believe so.

Let say that for a revision R, we have copied a page P[k, c], and let
say we have a reader using a revision P < c (obviously, if a reader uses
a revision > c, we can't free P[k, c], because it's fully part of the
B-tree used by reader P). If c > P, that means we have copied a page
which revison was < P at some point, so we can now get rid of this copy,
no other reader will use it.

case 1 : P is newer than the copied page :

P R
| |
v v
o--------[c]---------[P]---------[n]------> time
^ | /|
| | / +--> we create a new revision, which
copy page [c] and create page [n]
.-----------|--------.
|
+--------------> Revision P is using page [c]

Here, we can't get rid of page [c], as it's still part of revision P B-tree

case 2 : P is older than the copied page

P Q R
| | |
v v v
o--------[P]---------[c]---------[n]------> time
| ^ /|
| | / +--> we create a new revision, which
copy page [c] and create page [n]
| .--------.
|
+--------------> Revision P

Here, an update at revision Q has created a page [c] which has been
copied by the update at revision R. We can get rid of page [c] because
no other reader will ever use it.

That is all clear and easy. Now, the real pb is how do we determinate
the pages we can free based on the copied pages at revision R ? At
revision R, we know which pages we have copied, and each page has a
unique revision number. That allows us to know which pages to free quite
immediately : we just free all the pages which revision is newer than
the most recent reader revision.

For instance, we can immediately get rid of all the root pages for any
revision not in use, because this root page is immediately copied when
we do an upadate.

That is for any update being done, when we don't have recent readers
(ie, if we do many updates, this algorithm works very well).

But what if we have more than one pending reader ? There is little we
can do here, we have to wait for one of the reader to be released,
because this is where we will be able to free many pages.

Ok, let's see what happens when we free a reader revision now.

We know which pages this reader is holding, it's the list of copied
pages by the update which occured at the same revision. When we release
this reader, all those copied pages has to bec checked to see if they
are still in use.


M P Q R
| | | |
v v v v
o---[M]---[c]----[d]----[P]-----[Q]--[n]------> time
| ^ ^ | / /|
| | | | / / |
| | .------|----. / +--> we create a new revision,
which copy page [c] and create page [n]
| | | /
| .-------------|-------.
| |
| +--------------> Revision P is using page [c]
|
+---------------> Revision M

Here, we will release the reader P. We should expect that every page
copied between M and P can now be released (in the graphic, revison Q
has copied page [d], which is still in hold, because revision P is using
it). So when we release P, we know we should free pages [c] and [d]. The
difficulty here is to list all the pages that are in hold by revision P,
and thi sis a costly process :
- check every revision > P which have copied pages which revision is
above M and below P.

In order to do that we must have a data structure that holds all the
reader in use, order from the oldest to the newest, and we must have a
data structure keeping a track of all the pages copied by each revision.
We have both, but the second on is not in memory (well, we can't assume
it's in memory, as this data structure is stored in a B-tree itself).

So basically, it's a matter of browsing the copied-pages B-tree looking
for all the revisions between [M] and [R], and for each of those
revisions, we can feee of all the copied pages which revision is > M.
Not exactly inexpensive, but still acceptable, IMO.

Wdyt ?

Emmanuel Lécharny
Howard Chu
2014-12-10 17:07:52 UTC
Permalink
Content preview: Emmanuel Lécharny wrote: > Hi guys, > > we have had long
discussions with Howard on LMDB about the release of > pages that are not
anymore in use. This is a critical part of a MVCC > based B-tree, as each
update will copy many pages, and we want to re-use > the copied pages. > >
The problem is that those copied pages may be still in use by older > readers.
Let's quickly explain how it all works. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
@@CONTACT_ADDRESS@@ for details.

Content preview: Emmanuel Lécharny wrote: > Hi guys, > > we have had long
discussions with Howard on LMDB about the release of > pages that are not
anymore in use. This is a critical part of a MVCC > based B-tree, as each
update will copy many pages, and we want to re-use > the copied pages. > >
The problem is that those copied pages may be still in use by older > readers.
Let's quickly explain how it all works. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Emmanuel Lécharny
Hi guys,
we have had long discussions with Howard on LMDB about the release of
pages that are not anymore in use. This is a critical part of a MVCC
based B-tree, as each update will copy many pages, and we want to re-use
the copied pages.
The problem is that those copied pages may be still in use by older
readers. Let's quickly explain how it all works.
This is a good analysis, but I have to state that the investigation down
this path has stalled for a major reason:

The current reader/writer interaction depends on the fact that we only
need to know the oldest reader, and that we can be lazy about
determining what the oldest reader is. Once you start looking at
intervals between ordered ages of readers, that means you require
continuously up to date reader status, which means you require highly
consistent updating and reading of the readers table. That imposes
memory consistency guarantees that we currently don't require, and
adding those guarantees has a significant performance cost.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Emmanuel Lécharny
2014-12-10 18:11:06 UTC
Permalink
Content preview: Le 10/12/14 18:07, Howard Chu a écrit : > Emmanuel Lécharny
wrote: >> Hi guys, >> >> we have had long discussions with Howard on LMDB
about the release of >> pages that are not anymore in use. This is a critical
part of a MVCC >> based B-tree, as each update will copy many pages, and
we want to re-use >> the copied pages. >> >> The problem is that those copied
pages may be still in use by older >> readers. Let's quickly explain how
it all works. > > This is a good analysis, but I have to state that the investigation
down this path has stalled for a major reason: > > The current reader/writer
interaction depends on the fact that we only > need to know the oldest reader,
and that we can be lazy about > determining what the oldest reader is. [...]


Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(elecharny[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
@@CONTACT_ADDRESS@@ for details.

Content preview: Le 10/12/14 18:07, Howard Chu a écrit : > Emmanuel Lécharny
wrote: >> Hi guys, >> >> we have had long discussions with Howard on LMDB
about the release of >> pages that are not anymore in use. This is a critical
part of a MVCC >> based B-tree, as each update will copy many pages, and
we want to re-use >> the copied pages. >> >> The problem is that those copied
pages may be still in use by older >> readers. Let's quickly explain how
it all works. > > This is a good analysis, but I have to state that the investigation
down this path has stalled for a major reason: > > The current reader/writer
interaction depends on the fact that we only > need to know the oldest reader,
and that we can be lazy about > determining what the oldest reader is. [...]


Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(elecharny[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid
Post by Emmanuel Lécharny
Hi guys,
we have had long discussions with Howard on LMDB about the release of
pages that are not anymore in use. This is a critical part of a MVCC
based B-tree, as each update will copy many pages, and we want to re-use
the copied pages.
The problem is that those copied pages may be still in use by older
readers. Let's quickly explain how it all works.
This is a good analysis, but I have to state that the investigation
The current reader/writer interaction depends on the fact that we only
need to know the oldest reader, and that we can be lazy about
determining what the oldest reader is.
You have to keep all the readers in memory, just to be able to find the
oldest when it leaves. Then, when the oldest leave, you have to find the
new oldest in a way or another, which should be easy. Or you have to
check that the leaving reader is the oldest by controlling all the other
ones (and in this case, you have to iterate over the existing LDAP
sessions).

In any case, you should know about the readers, no matter what.
Once you start looking at intervals between ordered ages of readers,
that means you require continuously up to date reader status,
Yes. It's just a set associated with a list (the set to retrieve
immediately the position in the list of the leaving reader).

How do you manage readers currently ?
which means you require highly consistent updating and reading of the
readers table.
Yes, and no. You can avoid dealing with consistency, it's just depending
on how you manage the readers
That imposes memory consistency guarantees that we currently don't
require, and adding those guarantees has a significant performance cost.
Ok no let's suppose each thread handling reads (and there is a limited
number of threads for that) has its own list of readers. This list is
clearly not protected against concurrent access, because it's specific
to each thread. Now, when a reader, managed by the thread, leaves, you
can tell the writer that some cleanup can occur. This cleanup will be
done regardless what the reader thread will do, because it implies an
already dead reader.
From the writer thread POV, it's a matter of checking on every reader
threads to see if the released revision is in use, or not. If not, then
it's time to cleanup the pages.

This is totally thread safe, with no contention at all.

Of course, I'm assumling that the reader threads keep a set of readers
information...
David Barbour
2014-12-11 21:39:56 UTC
Permalink
I wonder if an intermediate variation on this idea might be easier to
implement:

We can describe a subset of MDB_RDONLY transactions as 'long running', e.g.
when copying the database. Normal read-only transactions might then be
called 'ephemeral'. We might compute long-running transactions
heuristically, e.g. ((currentTxnID - rdOnlyTxnId) > 100) and/or we might
allow developers to mark them.

When deciding which transactions to free, we compute three numbers with a
one-pass scan of the reader lock table:

txnO = the oldest active transaction
txnE = the oldest active ephemeral transaction
txnL = the newest active long-running transaction

Normally, LMDB just computes txnO, and conservatively assumes that readers
might be operating anywhere between txnO and the current transaction.
Emmanuel's proposal for precise analysis is somewhat daunting. But I think
we can use this simple three-point analysis to find some sweet spots
between precise and conservative.

We know by definition that txnO <= txnL, and txnO <= txnE.

Under normal conditions, i.e. assuming that long-running transactions are
relatively infrequent, we also know that txnL < txnE will be true more
frequently than not. Conservatively, we must protect the ranges from
[txnO,txnL] and [txnE,currentTxnId]. But in the common case, there will be
a non-empty range of transactions (txnL,txnE) that can be GC'd. And of
course anything below txnO can still be GC'd.

I'm not sure whether or not we can easily identify which pages are used
only within that gap, and may thus be collected. I haven't studied that
aspect of LMDB yet. But supposing that aspect is addressed, this scheme
should greatly mitigate the impact of infrequent long-running read-only
transactions. A lot more space from recent write transactions would be
reusable.
Howard Chu
2014-12-15 21:26:54 UTC
Permalink
Content preview: David Barbour wrote: > I wonder if an intermediate variation
on this idea might be easier to > implement: > > We can describe a subset
of MDB_RDONLY transactions as 'long running', > e.g. when copying the database.
Normal read-only transactions might then > be called 'ephemeral'. We might
compute long-running transactions > heuristically, e.g. ((currentTxnID -
rdOnlyTxnId) > 100) and/or we might > allow developers to mark them. [...]


Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by David Barbour
I wonder if an intermediate variation on this idea might be easier to
We can describe a subset of MDB_RDONLY transactions as 'long running',
e.g. when copying the database. Normal read-only transactions might then
be called 'ephemeral'. We might compute long-running transactions
heuristically, e.g. ((currentTxnID - rdOnlyTxnId) > 100) and/or we might
allow developers to mark them.
Interesting. I think, rather than picking an arbitrary age, we could
simply look for the largest gap between two reader txnIDs and work from
there. As long as the gap is >2 then there should be some pages to recover.
Post by David Barbour
When deciding which transactions to free, we compute three numbers with
txnO = the oldest active transaction
txnE = the oldest active ephemeral transaction
txnL = the newest active long-running transaction
Normally, LMDB just computes txnO, and conservatively assumes that
readers might be operating anywhere between txnO and the current
transaction. Emmanuel's proposal for precise analysis is somewhat
daunting. But I think we can use this simple three-point analysis to
find some sweet spots between precise and conservative.
We know by definition that txnO <= txnL, and txnO <= txnE.
Under normal conditions, i.e. assuming that long-running transactions
are relatively infrequent, we also know that txnL < txnE will be true
more frequently than not. Conservatively, we must protect the ranges
from [txnO,txnL] and [txnE,currentTxnId]. But in the common case, there
will be a non-empty range of transactions (txnL,txnE) that can be GC'd.
And of course anything below txnO can still be GC'd.
I'm not sure whether or not we can easily identify which pages are used
only within that gap, and may thus be collected. I haven't studied that
aspect of LMDB yet. But supposing that aspect is addressed, this scheme
should greatly mitigate the impact of infrequent long-running read-only
transactions. A lot more space from recent write transactions would be
reusable.
Yes, finding that gap is the tricky part now. I'm guessing we'll need a
freeDB format change to have enough info to do it, though maybe not.
Currently we record that in txn X, some list of pageIDs was freed. In
LMDB 1.0 we will also have per-page txnID stamps, to tell us when a page
was written.

Preserving the range from [txnO,txnL] pretty much means we cannot reuse
any pages up to txnL's free list. I.e., the page's stamp must be > txnL,
it must have been written at a point no txn in [txnO,txnL] could have
seen it. In [txnL+1,txnE] we can reuse all the free pages whose stamp is
Post by David Barbour
txnL. Perhaps we don't need a freeDB change.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Hallvard Breien Furuseth
2014-12-16 08:13:44 UTC
Permalink
Content preview: On 15/12/14 22:26, Howard Chu wrote: > Yes, finding that gap
is the tricky part now. I'm guessing we'll need a > freeDB format change
to have enough info to do it, though maybe not. > Currently we record that
in txn X, some list of pageIDs was freed. In > LMDB 1.0 we will also have
per-page txnID stamps, to tell us when a page > was written. > > Preserving
the range from [txnO,txnL] pretty much means we cannot reuse > any pages
up to txnL's free list. I.e., the page's stamp must be > txnL, > it must have
been written at a point no txn in [txnO,txnL] could have > seen it. In [txnL+1,txnE]
we can reuse all the free pages whose stamp is > > txnL. Perhaps we don't
need a freeDB change. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Howard Chu
Yes, finding that gap is the tricky part now. I'm guessing we'll need a
freeDB format change to have enough info to do it, though maybe not.
Currently we record that in txn X, some list of pageIDs was freed. In
LMDB 1.0 we will also have per-page txnID stamps, to tell us when a page
was written.
Preserving the range from [txnO,txnL] pretty much means we cannot reuse
any pages up to txnL's free list. I.e., the page's stamp must be > txnL,
it must have been written at a point no txn in [txnO,txnL] could have
seen it. In [txnL+1,txnE] we can reuse all the free pages whose stamp is
Post by David Barbour
txnL. Perhaps we don't need a freeDB change.
Sure we do. The freeDB should have a more efficient format anyway,
for overflow pages. Peeking at a lot of old pages and rejecting
them can be a cache massacre. And a file page in the middle of
a multi-page overflow page has no header to peek at anyway.

Also: A transaction which must inspect a page to see if it can be
reused, must not overwrite that info in the page. The txn might
abort or get rolled back by a system crash. Instead, it can move
the page to a "reusable" list and commit. Then we wait one or two
transactions for sync to metapages, and then we can reuse the page.

This may not be needed when the newer txnid in the page header is
sufficient info, but I suggest we don't try to keep that straight
yet: Can't overwrite if the page ends up inside an overflow page.
Also the newer txnid will not always imply it can be reused.


Anyway, consider a page written by transaction ID (revision) W and
freed (no longer seen) by txnid F: Visible to snapshots [W..F-1].
Transaction F writes freeDB record {F: list of pages freed by F}.

To F, oldest snapshot "S" which can see the page = max(W, oldest
current snapshot). Store (F-S) along with the page number, it
will often fit in the unused bits of the page number. If it
doesn't, we can fall back to some expanded format, or peek at the
page later, or return to the old way: Wait for all older readers.

Page numbers have 16 unused bits on 64-bit CPUs, log2(page size)
unused bits on 32-bit CPUs. Use at least one bit to indicate a
multi-page overflow page, or that the expanded format is used and
this includes both age and number of file pages. (Overflow pages
are used for nodes bigger than half a page.)

Once we have determined that a page can be reused, it can move to
a freeDB record format which doesn't care about old snapshot IDs.
Unless it does get reused at once, of course.


About MDB_RDONLY transactions: They just store their snapshot's
transaction ID in the reader table and clear it on abort/commit.

Write transactions scan the reader table and do all the thinking.
When they scan the table to make a list of all readers, they can
save the youngest ones in a bit-vector indexed by transaction ID
and the ones which do not fit there in a list. If most readers
are ephemeral, that list will be short. If we sort the pages in
a freeDB record by (F-S) above, it's easy to check the youngest
freeDB records against the bitmap to look for pages we can use.
--
Hallvard
Hallvard Breien Furuseth
2014-12-16 08:50:30 UTC
Permalink
Content preview: I wrote: > To F, oldest snapshot "S" which can see the page
= max(W, oldest > current snapshot). Sorry, that's the oldest easily computable
age. There may not be a current snaphot with that age, which is OK. [...]


Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Hallvard Breien Furuseth
To F, oldest snapshot "S" which can see the page = max(W, oldest
current snapshot).
Sorry, that's the oldest easily computable age. There may
not be a current snaphot with that age, which is OK.

Loading...