multiple sequential lmdb readers + spinning media = slow / thrashes?

Discussion:

Matthew Moskewicz

2015-02-26 22:50:30 UTC

warnings: new to list, first post, lmdb noob.

i'm a caffe user:
https://github.com/BVLC/caffe

in one use case, caffe sequentially streams though >100GB lmdbs at a rate
of ~30MB/s in blocks of about 40MB. however, if multiple caffe processes
are reading the same lmdb (opened with MDB_RDONLY), read performance
becomes limiting (i.e. the processes become IO bound), even though the disk
has sufficient read bandwidth (say ~180MB/s). some of the relevant caffe
lmdb code is here:

https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

however, if i *both*
1) run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever
2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,
POSIX_MADV_SEQUENTIAL);

then i can get >1 reader to run without being IO limited.

for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt

similarly, using a sequential read microbenchmark designed to model the
caffe reads from here:
https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc

if i run one reader, i get 180MB/s bandwidth.
with two readers, but neither (1) nor (2) above, each gets ~30MB/s
bandwidth.
with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.

any advice?

mwm

PS: backstory (skippable):
caffe originally used LevelDB to get better read performance for
sequentially loading sets of ~1M 227x227x3 raw images (~200GB data).
typically processing time is ~2 hours for this data set size, yielding a
read BW need of 30MB/s or so. it's not really clear if/why LevelDB was uses
aside from the fact that the caffe author was a google intern at the time
he wrote it, but anecdotally i think the claim is that reading the raw
.jpgs had perf. issues, although it's unclear exactly what or why. i guess
it was the usual story about not getting sequential reads without using
LevelDB. they switched to lmdb a while back.

<openldap-***@openldap.org>

Howard Chu

2015-02-26 23:46:43 UTC

Permalink

Content preview: Matthew Moskewicz wrote: > > warnings: new to list, first
post, lmdb noob. > > i'm a caffe user: > https://github.com/BVLC/caffe > >
in one use case, caffe sequentially streams though >100GB lmdbs at a > rate
of ~30MB/s in blocks of about 40MB. however, if multiple caffe > processes
are reading the same lmdb (opened with MDB_RDONLY), read > performance becomes
limiting (i.e. the processes become IO bound), even > though the disk has
sufficient read bandwidth (say ~180MB/s). some of > the relevant caffe lmdb
code is here: > > https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

however, if i *both* > 1) run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever

2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize, > POSIX_MADV_SEQUENTIAL);

then i can get >1 reader to run without being IO limited. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

warnings: new to list, first post, lmdb noob.
https://github.com/BVLC/caffe
in one use case, caffe sequentially streams though >100GB lmdbs at a
rate of ~30MB/s in blocks of about 40MB. however, if multiple caffe
processes are reading the same lmdb (opened with MDB_RDONLY), read
performance becomes limiting (i.e. the processes become IO bound), even
though the disk has sufficient read bandwidth (say ~180MB/s). some of
https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp
however, if i *both*
1) run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever
2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,
POSIX_MADV_SEQUENTIAL);
then i can get >1 reader to run without being IO limited.

This is quite timing-dependent - if you start your multiple readers at exactly the same time and they run at exactly the same speed, then they will all be using the same cached pages and all of the readers can run at the full bandwidth of the disk. If they're staggered or not running in lockstep, then you'll only get partial performance.

for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt
similarly, using a sequential read microbenchmark designed to model the
https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc
if i run one reader, i get 180MB/s bandwidth.
with two readers, but neither (1) nor (2) above, each gets ~30MB/s
bandwidth.
with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.

The other point to note is that sequential reads in LMDB won't remain truly sequential (as seen by the storage device) after a few rounds of inserts/deletes/updates. Once you get any element of seek/random I/O in here your madvise will be useless.

any advice?
mwm
caffe originally used LevelDB to get better read performance for
sequentially loading sets of ~1M 227x227x3 raw images (~200GB data).
typically processing time is ~2 hours for this data set size, yielding a
read BW need of 30MB/s or so. it's not really clear if/why LevelDB was
uses aside from the fact that the caffe author was a google intern at
the time he wrote it, but anecdotally i think the claim is that reading
the raw .jpgs had perf. issues, although it's unclear exactly what or
why. i guess it was the usual story about not getting sequential reads
without using LevelDB. they switched to lmdb a while back.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Matthew Moskewicz

2015-02-27 02:36:46 UTC

Permalink

Content preview: On Thu, Feb 26, 2015 at 3:46 PM, Howard Chu <***@symas.com>
wrote: > Matthew Moskewicz wrote: >> >> >> warnings: new to list, first post,
lmdb noob. >> [snip] >> >> https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp

Post by Howard Chu

however, if i *both* >> 1) run blockdev --setra 65536 --setfra 65536

/dev/sdwhatever >> 2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,

Post by Howard Chu

POSIX_MADV_SEQUENTIAL); >> >> then i can get >1 reader to run without

being IO limited. > > > This is quite timing-dependent - if you start your
multiple readers at > exactly the same time and they run at exactly the same
speed, then they will > all be using the same cached pages and all of the
readers can run at the > full bandwidth of the disk. If they're staggered
or not running in lockstep, > then you'll only get partial performance. >
[...]

Content analysis details: (-1.8 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: github.com]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(moskewcz[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid

Post by Howard Chu

warnings: new to list, first post, lmdb noob.

[snip]

Post by Howard Chu

https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp
however, if i *both*
1) run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever
2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,
POSIX_MADV_SEQUENTIAL);
then i can get >1 reader to run without being IO limited.

This is quite timing-dependent - if you start your multiple readers at
exactly the same time and they run at exactly the same speed, then they will
all be using the same cached pages and all of the readers can run at the
full bandwidth of the disk. If they're staggered or not running in lockstep,
then you'll only get partial performance.

thanks for the quick reply. to clarify: yes, this is indeed the case.
when/if the readers are reading 'near' each other (within cache size)
there is no issue, but over time they drift out of sync, and this is
the case i'm considering / when i'm having an issue. these are
long-running processes that loop over the entire db 200GB lmdb many
times over days, at around 2 hours per epoch (iteration over all
data).

when i say i can get >1 reader to be not IO limited with my changes, i
mean that things continue to work (not be IO limited) even as the
readers go out of sync. the processes happen to output information
sufficient to deduce when they have de-synced by more than the amount
of system memory in terms of the lmdb offset at which they are
reading. empirically: without my changes, for a particular 2 readers
case, the readers would reliably drop out of sync within a few hours
and slow down by at least ~2X (getting perhaps ~20MB/s bandwidth);
with the changes i've had 2 runs going to multiple days without issue.

for my microbenchmarking i simulate the out-of-sync-ness and take care
to ensure i'm not reading cached areas, either by flushing the caches
or by just carefully choosing offsets into a 200GB lmdb on a machine
with only 32GB ram. i'd prefer to 'clear the cache' for all tests, but
that doesn't actually seem possible when there is a running process
that has the entire lmdb mmap()'d. that is, i don't know of any method
to make the kernel drop the clean cached mmap()'d pages out of memory.
but, caveats aside, i'm claiming that:

a) with the patch+readahead i get full read perf, even when the
readers are out of sync / streaming though well-separated (i.e. by
more than the size of system memory) parts of the lmdb.
b) without them i see much reduced read performance (presumably due to
seek trashing), sufficient to cause the caffe processes in question to
slow down by > 2X.

Post by Howard Chu

The other point to note is that sequential reads in LMDB won't remain truly
sequential (as seen by the storage device) after a few rounds of
inserts/deletes/updates. Once you get any element of seek/random I/O in here
your madvise will be useless.

yes, makes sense. i should have noted that, in this use model, the
lmdbs are in-order-write-once and then read-only thereafter -- they
are created and used in this manner specifically to allow for
sequential reads. i'd assume this is not actually reliable in general
due to the potential for filesystem-level fragmentation, but i guess
in practice it's okay. often, these lmdbs are being written to
spinners that are 'fresh' and don't have much filesystem level churn.

mwm

Milosz Tanski

2015-02-26 23:50:57 UTC

Permalink

Content preview: Matthew, If you are talking about rotational media, the more
reader you add the worse your aggregate bandwidth is going to be... Since
LMDB is storing it as a btree, the readers have to random access which turns
into a lot of seek. Seek time ends up being amortized as a higher average
time to read a block / page and your aggregate bandwidth disappears. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: princeton.edu]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

Matthew,

If you are talking about rotational media, the more reader you add the
worse your aggregate bandwidth is going to be... Since LMDB is storing
it as a btree, the readers have to random access which turns into a
lot of seek. Seek time ends up being amortized as a higher average
time to read a block / page and your aggregate bandwidth disappears.

If you have enough memory to store most of the data, or your working
set it only a small subset of that data this won't be as visible.

Best,
- Milosz

On Thu, Feb 26, 2015 at 5:50 PM, Matthew Moskewicz

Post by Matthew Moskewicz
warnings: new to list, first post, lmdb noob.
https://github.com/BVLC/caffe
in one use case, caffe sequentially streams though >100GB lmdbs at a rate of
~30MB/s in blocks of about 40MB. however, if multiple caffe processes are
reading the same lmdb (opened with MDB_RDONLY), read performance becomes
limiting (i.e. the processes become IO bound), even though the disk has
sufficient read bandwidth (say ~180MB/s). some of the relevant caffe lmdb
https://github.com/BVLC/caffe/blob/master/src/caffe/util/db.cpp
however, if i *both*
1) run blockdev --setra 65536 --setfra 65536 /dev/sdwhatever
2) modify lmdb to call posix_madvise(env->me_map, env->me_mapsize,
POSIX_MADV_SEQUENTIAL);
then i can get >1 reader to run without being IO limited.
for (2), see https://github.com/moskewcz/scratch/tree/lmdb_seq_read_opt
similarly, using a sequential read microbenchmark designed to model the
https://github.com/moskewcz/boda/blob/master/src/lmdbif.cc
if i run one reader, i get 180MB/s bandwidth.
with two readers, but neither (1) nor (2) above, each gets ~30MB/s
bandwidth.
with (1) and (2) enabled, and two readers, each gets ~90MB/s bandwidth.
any advice?
mwm
caffe originally used LevelDB to get better read performance for
sequentially loading sets of ~1M 227x227x3 raw images (~200GB data).
typically processing time is ~2 hours for this data set size, yielding a
read BW need of 30MB/s or so. it's not really clear if/why LevelDB was uses
aside from the fact that the caffe author was a google intern at the time he
wrote it, but anecdotally i think the claim is that reading the raw .jpgs
had perf. issues, although it's unclear exactly what or why. i guess it was
the usual story about not getting sequential reads without using LevelDB.
they switched to lmdb a while back.

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: ***@adfin.com