Content preview: Le 11/05/15 22:17, Howard Chu a écrit : > Emmanuel Lécharny
wrote: >> Restarting this thread... >> >> we have had some interesting discussion
today that I wanted to share. >> >> Hypothesis : 1 server has been down for
a long time, and the contextCSN >> is older than the one of the other servers,
forcing a refresh mode with >> more than the content of the AccessLog. >>
Post by Howard ChuPost by Emmanuel LécharnyQuanah said that in some heavily servers, the only way for the consumer
to catch up is to slapcat/slapadd/restart the consumer. I wonder if it
would not be a way to deal with server that are to far behind the >> running
server, but as a mechanism that is included in the refresh phase >> (ie,
the restarted server will detect that it has to grab the set of >> entries
and load them, os if a human being was doing a >> slapcat/slapadd/restart).
have >> to update, and is there a way to know when it will be faster to be
Post by Howard ChuPost by Emmanuel Lécharnybrutal (the Quanah way) compared to let the refresh mechanism doing its
job. > > Not a worthwhile direction to pursue. Doing the equivalent of
a full > slapcat/slapadd across the network will use even more bandwidth
than > the current syncrepl. None of this addresses the underlying causes
of > why the consumer is slow, so the original problem will remain. [...]
Content analysis details: (-2.7 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
-0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low
trust
[74.125.82.53 listed in list.dnswl.org]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(elecharny[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -2.7 (--)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.
Content preview: Le 11/05/15 22:17, Howard Chu a écrit : > Emmanuel Lécharny
wrote: >> Restarting this thread... >> >> we have had some interesting discussion
today that I wanted to share. >> >> Hypothesis : 1 server has been down for
a long time, and the contextCSN >> is older than the one of the other servers,
forcing a refresh mode with >> more than the content of the AccessLog. >>
Post by Howard ChuPost by Emmanuel LécharnyQuanah said that in some heavily servers, the only way for the consumer
to catch up is to slapcat/slapadd/restart the consumer. I wonder if it
would not be a way to deal with server that are to far behind the >> running
server, but as a mechanism that is included in the refresh phase >> (ie,
the restarted server will detect that it has to grab the set of >> entries
and load them, os if a human being was doing a >> slapcat/slapadd/restart).
have >> to update, and is there a way to know when it will be faster to be
Post by Howard ChuPost by Emmanuel Lécharnybrutal (the Quanah way) compared to let the refresh mechanism doing its
job. > > Not a worthwhile direction to pursue. Doing the equivalent of
a full > slapcat/slapadd across the network will use even more bandwidth
than > the current syncrepl. None of this addresses the underlying causes
of > why the consumer is slow, so the original problem will remain. [...]
Content analysis details: (-2.7 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
-0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low
trust
[74.125.82.53 listed in list.dnswl.org]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(elecharny[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Howard ChuPost by Emmanuel LécharnyRestarting this thread...
we have had some interesting discussion today that I wanted to share.
Hypothesis : 1 server has been down for a long time, and the contextCSN
is older than the one of the other servers, forcing a refresh mode with
more than the content of the AccessLog.
Quanah said that in some heavily servers, the only way for the consumer
to catch up is to slapcat/slapadd/restart the consumer. I wonder if it
would not be a way to deal with server that are to far behind the
running server, but as a mechanism that is included in the refresh phase
(ie, the restarted server will detect that it has to grab the set of
entries and load them, os if a human being was doing a
slapcat/slapadd/restart).
More specifically, is there a way to know how many entries we will have
to update, and is there a way to know when it will be faster to be
brutal (the Quanah way) compared to let the refresh mechanism doing its
job.
Not a worthwhile direction to pursue. Doing the equivalent of a full
slapcat/slapadd across the network will use even more bandwidth than
the current syncrepl. None of this addresses the underlying causes of
why the consumer is slow, so the original problem will remain.
IMHO, network congestion is not a real pb. Assuming you are running a
1Gb ethernet network, the time it takes to transmit 1 milion 1Kb entries
is only 10 seconds. It will be barely noticable compared to the time it
will take to load those 1 M entries into your consumer. Even with a
100Gb ethernet newtork, this is not a big part of the problem.
Post by Howard Chu1) the AVL tree used for presentlist is still extremely inefficient
in both CPU and memory use.
2) the consumer does twice as much work for a single modification as
the provider. I.e., the consumer does a write op to the backend for
the modification, and then a second write op to update its contextCSN.
Updating the contextCSN is an extra operation on the consumer, but as
you have to update potentially tens of indexes when updating an entry
(on both teh consumer and the producer), it's not really twoce more
work. It's an additianal operation, but that would not double the time
it costs on the producer.
The question would be : how do we update the contextCSN only
periodically, to mitigate this extra cost, and it seems you proposed to
batch the updates for this reason. By using btaches of 500 updates, this
extra cost will be almost unnoticable, and one would expect the work on
the consumer to be the same as on the producer side, right ?
Post by Howard ChuThe provider only does the original modification, and caches the
contextCSN update.
If we fix both of these issues, consumer speed should be much faster.
Nothing else is worth investigating until these two areas are reworked.
Agreed in most of the case. Although for use cases of an important
number of updates have occured while a consumer is off line, another
strategy might work. That this other strategy is to stop the consumer,
slapcat the producer, slapadd the result and restart the server, all
with the command line, instead of having it implemented in the server
code, was what I was suggesting, but this is another story for a corner
case that is not frequent. Plus we don't know at which point this would
be the correct strategy (ie, for how many updates should we consider it
as a better startegy than the current implementation ?).
Post by Howard ChuFor (1) I've been considering a stripped down memory-only version of
LMDB. There are plenty of existing memory-only Btree implementations
out there already though, if anyone has a favorite it would probably
save us some time to use an existing library. The Linux kernel has one
(lib/btree.c) but it's under GPL so we can't use it directly.
Q : do you need to keep the presentList ina BTree at all ?
Post by Howard ChuPost by Emmanuel LécharnyAnother point : as soon as the server is restarted, it can receive
incoming requests, which will send back outdated response, until the
refresh is completed (and i'm not talking about updates that could also
be applied on an outdated base, with the consequences if there are some
missing parents). In many cases, that would be a real problem, typically
if the LDAP servers are considered as part of a shared pool of server,
with a load balance mecahnism to spread the load. Wouldn't be more
realistic to simply consider the server as not available until the
refresh phase is completed ?
This was ITS#7616. We tried it and it caused a lot of problems. It has
been reverted.
The two options were to either send a referral (not ideal, as we have no
control whatesoever on the client API) and LDAP_BUSY. A third option
would be possible : chaining the request to the server from which the
replication updates are coming from. Doing so will guarantee that the
client will gets a updated version of the data, as the producer is up to
date. There is still an issue though if both servers are replicating
each other (pretty much the pb with referrals). OTOH, if the other
server is also in refresh mode, it should be possible to return a
LDAP_BUSY if it is capable of detecting that the requests come from
another server, not for a client. Maybe it's far fetched...