Discussion:
BadRSlotError: mdb_txn_begin: MDB_BAD_RSLOT: Invalid reuse of reader locktable slot
Howard Chu
2014-09-11 09:37:42 UTC
Permalink
hi all,
the infamous obscure error which people are seeing only very
infrequently is rearing its head at least 2 to 3 times per day in a
test lab where i work. this is however a secure environment so i
cannot post core-dumps or any details of the application.
given the restrictions, what information is needed and what approach
is needed to debug and fix this? luckily it's happening a lot so
there's the possibility of a regular iterative approach.
the operating system(s) have been ubuntu 12.04 and also 14.04, both
have resulted in this obscure bug. bizarrely, this bug occurs in a
*single process*. it's not even multi-processing. however
metasync=False, sync=False, map_async=True, readahead=False and
writemap=True.
Use the Source, Luke.

MDB_BAD_RSLOT is returned only one place in mdb.c and the situation is very
specific. It means you've tried to begin a new read txn on a thread that
already has a read txn outstanding. The API docs are pretty clear that a
thread may only have one txn at a time.

You need to track down whatever is creating read txns in your code and make
sure they're being properly committed or aborted.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Howard Chu
2014-09-11 10:31:51 UTC
Permalink
Post by Howard Chu
hi all,
the infamous obscure error which people are seeing only very
infrequently is rearing its head at least 2 to 3 times per day in a
test lab where i work. this is however a secure environment so i
cannot post core-dumps or any details of the application.
given the restrictions, what information is needed and what approach
is needed to debug and fix this? luckily it's happening a lot so
there's the possibility of a regular iterative approach.
the operating system(s) have been ubuntu 12.04 and also 14.04, both
have resulted in this obscure bug. bizarrely, this bug occurs in a
*single process*. it's not even multi-processing. however
metasync=False, sync=False, map_async=True, readahead=False and
writemap=True.
Use the Source, Luke.
:)
Post by Howard Chu
MDB_BAD_RSLOT is returned only one place in mdb.c and the situation is very
specific. It means you've tried to begin a new read txn on a thread that
already has a read txn outstanding.
... but there aren't any threads... this is literally only one
process. there are no threads involved at all. the single process is
doing writes in a txn followed by reads in a separate txn.
Technically, a single process is also a single thread.
Post by Howard Chu
The API docs are pretty clear that a
thread may only have one txn at a time.
You need to track down whatever is creating read txns in your code and make
sure they're being properly committed or aborted.
this is from python, and all code is done using "with env.begin .... as txn:"
there are no exceptions occurring within any blocks, and even if they
were the "with" statement calls the __exit__ function which closes the
transaction.
I can't comment on anything python is doing, but it sounds like it's missing a
step...
so, all code is as expected, hence the reason for raising it here
because this is definitely not something that should be happening.
*thinks*... there is only one possible thing that i can think of, and
it's related to using cursors. i am not calling close or del on the
txn.cursor objects within the "with" block. could it be that python's
garbage collection is somehow collecting those txn.cursor objects at
random points, interacting in some way with the current read txn?
No idea. If you're using py-lmdb it sounds like we need David Wilsom to chime
in here. In the C API there's no way a cursor could interfere with a txn, no
guesses what the python code is doing.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Howard Chu
2014-09-11 11:01:55 UTC
Permalink
garbage collection occurs periodically, at any time the interpreter
feels like it
(afaik). so, the scenario under consideration is therefore that a
cursor gets closed
at random points, well after its txn has been closed.
however if the c api is reflected in the python layer then correspondingly the
cursor could not interfere with a txn.
the only other possibility under consideration is that the python
process has been
killed unexpectedly (kill -9) and has left the database in an abnormal state.
Not possible; the database structure is inherently incorruptible. kill -9
won't do anything to it.
looking at the application logs that i have available (qty one only
at the moment)
i do note that the application appears to have just undergone a restart...
You're talking about a database being accessed by only a single process; the
lock table is reinitialized whenever a process opens it and no other processes
already have it open. As such, no locktable elements from a previous crash
will ever affect the running process.
i'll have to have a closer look but would that be causing any problems (esp
with the arguments to speed up access such as sync=False etc.)
Nope. sync is irrelevant for application-level crashes, all the data goes into
the OS buffer cache regardless. sync=False will likely leave you a corrupted
database if the OS or hardware crashes, but not if the application process
crashes.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Loading...