Relaxed mode for delta-sync MMR

Quanah Gibson-Mount

2015-06-10 20:04:12 UTC

Content preview: After having deployed delta-sync MMR at several customer sites,
the general handling of conflict resolution in MMR mode is significantly
sub optimal, and routinely causes the MMR nodes to get further out of sync,
worsening things significantly (Mainly due to ITS#8125). [...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: zimbra.com]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature

After having deployed delta-sync MMR at several customer sites, the general
handling of conflict resolution in MMR mode is significantly sub optimal,
and routinely causes the MMR nodes to get further out of sync, worsening
things significantly (Mainly due to ITS#8125).

The main issues I see are the following:

a) Two masters get different change requests at approximately the same time
to add a value X to an attribute.

b) Two masters get different change requests at approximately the same time
to delete a value X from an attribute.

In these two specific cases, in relaxed mode, rather than falling back and
re-syncing the entire database, I think the conflict should be discarded
(skipped), and logged as such. I.e., there is no actual discrepancy in the
object. It still has X present in the add case, and X gone in the delete
case.

At best, if we're going to do fallback, then we should only see about
resyncing the specific entry. The overall behavior I'm seeing from
OpenLDAP is the masters get in an endless cycle of re-sync, and the more
they do so, the more out of sync they become, leading to a point at which
you have to stop all masters, export all their DBs, sort them, find missing
entries between all sets of masters, and build a brand new DB with which to
reload them, until they get massively out of sync again. I.e., the current
strategy of resync is doing no favors to anyone. It may work OK on very
small DBs, where a resync only takes seconds, but on larger dbs were such
syncs take 30+ minutes to hours, it is not a useful methodology.

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration