Discussion:
LMDB and text encoding
Timur Kristóf
2015-01-27 21:39:33 UTC
Permalink
Content preview: Hi Everyone, I've been talking to Howard about this and he
suggested to post it to this mailing list. There are two things that I recently
noticed about how LMDB works with various encodings and I think it's worth
to discuss. [...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature

Hi Everyone,

I've been talking to Howard about this and he suggested to post it to
this mailing list. There are two things that I recently noticed about
how LMDB works with various encodings and I think it's worth to
discuss.

1. Database names

mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on
unixes and ANSI on Windows, which is problematic for cross-platform
applications.

My suggestion is to create a variant of this function that also
accepts a length parameter (or just use MDB_val) so that instead of
treating it as a C string, it would treat it like a series of bytes,
allowing the user to use the encoding of their choice.

2. Path names

Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the
likes accept a char* for path names. This is fine on most unixes where
char* is an UTF-8 string, but unfortunately, these functions call the
ANSI variants of the Windows API functions, making it impossible to
use Unicode path names with them.

I think we should switch to the widechar APIs instead, but that would
also mean changing the LMDB API to accept a wchar_t* parameter on
Windows instead of char*.


What do you guys think about all this?


Best regards,
Timur Kristóf
Timur Kristóf
2015-01-29 09:29:52 UTC
Permalink
Post by Timur Kristóf
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on
unixes and ANSI on Windows, which is problematic for cross-platform
applications. [...]
Here is a patch that addresses this concern.
If you like it, I'll move on to the other issue.
Timur Kristóf
2015-01-29 13:42:46 UTC
Permalink
Here is a fixed version of the patch.
Post by Timur Kristóf
Post by Timur Kristóf
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on
unixes and ANSI on Windows, which is problematic for cross-platform
applications. [...]
Here is a patch that addresses this concern.
If you like it, I'll move on to the other issue.
Timur Kristóf
2015-02-01 21:59:58 UTC
Permalink
Hi,

I forgot to add an ENOMEM check. I added it now. I think this patch is
ready for Howard and Hallvard to review. :)

Timur
Post by Timur Kristóf
Here is a fixed version of the patch.
Post by Timur Kristóf
Post by Timur Kristóf
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on
unixes and ANSI on Windows, which is problematic for cross-platform
applications. [...]
Here is a patch that addresses this concern.
If you like it, I'll move on to the other issue.
Howard Chu
2015-02-01 23:40:37 UTC
Permalink
Content preview: Timur Kristóf wrote: > Hi, > > I forgot to add an ENOMEM check.
I added it now. I think this patch is > ready for Howard and Hallvard to
review. :) It looks OK to me. No one raises any concerns I'll commit it in
a few hours. > > Timur > > On Thu, Jan 29, 2015 at 2:42 PM, Timur Kristóf
wrote: >>>> mdb_dbi_open treats its name parameter as a C string. This means
UTF-8 on >>>> unixes and ANSI on Windows, which is problematic for cross-platform
Post by Timur Kristóf
Post by Timur Kristóf
applications. [...] >>> >>> Here is a patch that addresses this concern.
If you like it, I'll move on to the other issue. [...]
Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: symas.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: Timur Kristóf wrote: > Hi, > > I forgot to add an ENOMEM check.
I added it now. I think this patch is > ready for Howard and Hallvard to
review. :) It looks OK to me. No one raises any concerns I'll commit it in
a few hours. > > Timur > > On Thu, Jan 29, 2015 at 2:42 PM, Timur Kristóf
wrote: >>>> mdb_dbi_open treats its name parameter as a C string. This means
UTF-8 on >>>> unixes and ANSI on Windows, which is problematic for cross-platform
Post by Timur Kristóf
Post by Timur Kristóf
applications. [...] >>> >>> Here is a patch that addresses this concern.
If you like it, I'll move on to the other issue. [...]
Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
Hi,
I forgot to add an ENOMEM check. I added it now. I think this patch is
ready for Howard and Hallvard to review. :)
It looks OK to me. No one raises any concerns I'll commit it in a few hours.
Post by Timur Kristóf
Timur
Post by Timur Kristóf
Here is a fixed version of the patch.
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on
unixes and ANSI on Windows, which is problematic for cross-platform
applications. [...]
Here is a patch that addresses this concern.
If you like it, I'll move on to the other issue.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Hallvard Breien Furuseth
2015-02-02 01:00:31 UTC
Permalink
Content preview: On 02/02/15 00:40, Howard Chu wrote: > It looks OK to me.
No one raises any concerns I'll commit it in a few > hours. Some sudden last
thoughts: mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size)
to exclude non-databases, which is no longer valid. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Howard Chu
It looks OK to me. No one raises any concerns I'll commit it in a few hours.
Some sudden last thoughts:

mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size)
to exclude non-databases, which is no longer valid.

Database names with \0 in them can no longer be spelled as strings,
everything which gets DB names from the database must use binary blobs.
Including mdb_load and mdb_dump; I notice mdb_load uses
strdup() for the "database=" name. Come to think of it, I have no
idea if the dump format supports DB names with \0 in them.
--
Hallvard
Hallvard Breien Furuseth
2015-02-02 01:09:49 UTC
Permalink
Content preview: On 02/02/15 02:00, Hallvard Breien Furuseth wrote: > Come
to think of it, I have no > idea if the dump format supports DB names with
\0 in them. ...and there will now be database names which cannot be spelled
on the command line, like for <mdb_stat/mdb_dump -s subdb>. I don't think
that was quite the point. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Hallvard Breien Furuseth
Come to think of it, I have no
idea if the dump format supports DB names with \0 in them.
...and there will now be database names which cannot be spelled
on the command line, like for <mdb_stat/mdb_dump -s subdb>.
I don't think that was quite the point.
Howard Chu
2015-02-02 02:37:38 UTC
Permalink
Content preview: Hallvard Breien Furuseth wrote: > On 02/02/15 00:40, Howard
Chu wrote: >> It looks OK to me. No one raises any concerns I'll commit it
in a few >> hours. > > Some sudden last thoughts: > > mdb_dump.c also has
a check (memchr(key.mv_data, '\0', key.mv_size) > to exclude non-databases,
which is no longer valid. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Hallvard Breien Furuseth
Post by Howard Chu
It looks OK to me. No one raises any concerns I'll commit it in a few hours.
mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size)
to exclude non-databases, which is no longer valid.
Good point. As Timur's patch comment notes, we probably need an API call
"is valid DB" now.
Post by Hallvard Breien Furuseth
Database names with \0 in them can no longer be spelled as strings,
everything which gets DB names from the database must use binary blobs.
Including mdb_load and mdb_dump; I notice mdb_load uses
strdup() for the "database=" name. Come to think of it, I have no
idea if the dump format supports DB names with \0 in them.
No, it doesn't. It's the BDB format, and BDB only accepted C strings.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Timur Kristóf
2015-02-02 10:47:13 UTC
Permalink
Content preview: On Mon, Feb 2, 2015 at 3:37 AM, Howard Chu wrote: > Hallvard
Breien Furuseth wrote: >> >> On 02/02/15 00:40, Howard Chu wrote: >>> >>>
It looks OK to me. No one raises any concerns I'll commit it in a few >>>
hours. >> >> >> Some sudden last thoughts: >> >> mdb_dump.c also has a check
(memchr(key.mv_data, '\0', key.mv_size) >> to exclude non-databases, which
is no longer valid. > > > Good point. As Timur's patch comment notes, we
probably need an API call "is > valid DB" now. > >> Database names with \0
in them can no longer be spelled as strings, >> everything which gets DB
names from the database must use binary blobs. >> Including mdb_load and mdb_dump;
I notice mdb_load uses >> strdup() for the "database=" name. Come to think
of it, I have no >> idea if the dump format supports DB names with \0 in
them. > > > No, it doesn't. It's the BDB format, and BDB only accepted C strings.
[...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: symas.com]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Hallvard Breien Furuseth
Post by Howard Chu
It looks OK to me. No one raises any concerns I'll commit it in a few hours.
mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size)
to exclude non-databases, which is no longer valid.
Good point. As Timur's patch comment notes, we probably need an API call "is
valid DB" now.
Post by Hallvard Breien Furuseth
Database names with \0 in them can no longer be spelled as strings,
everything which gets DB names from the database must use binary blobs.
Including mdb_load and mdb_dump; I notice mdb_load uses
strdup() for the "database=" name. Come to think of it, I have no
idea if the dump format supports DB names with \0 in them.
No, it doesn't. It's the BDB format, and BDB only accepted C strings.
(Just noticed that I hit "reply" instead of "reply all". Sorry. Now
reposting to the mailing list.)

I think it is an acceptable limitation of mdb_dump and mdb_load. This
is not the only thing they don't support: they also don't work with
user-defined comparison functions. Although I could think about ways
to solve it.

For example, we could add a command line option that would make
mdb_dump output db names as a string of hexadecimal numbers, and
mdb_load interpret them as such.
Hallvard Breien Furuseth
2015-01-29 10:15:22 UTC
Permalink
Content preview: My take: On 27. jan. 2015 22:39, Timur Kristóf wrote: > 1.
Database names MDB_val here sounds nice... [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.17 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Cc: openldap-***@openldap.org
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -4.2 (----)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: My take: On 27. jan. 2015 22:39, Timur Kristóf wrote: > 1.
Database names MDB_val here sounds nice... [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[129.240.10.17 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
1. Database names
MDB_val here sounds nice...
Post by Timur Kristóf
2. Path names
Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the
likes accept a char* for path names. This is fine on most unixes where
char* is an UTF-8 string, but unfortunately, these functions call the
ANSI variants of the Windows API functions, making it impossible to
use Unicode path names with them.
I think we should switch to the widechar APIs instead, but that would
also mean changing the LMDB API to accept a wchar_t* parameter on
Windows instead of char*.
My initial reaction is that an API with different prototypes for
Windows and Unix sounds bad. Unix must have char* since it does
not even support wchar_t* filenames, and it should be simple to
write a portable OS-unaware LDMB program.

Though I notice Windows #defines CreateFile() & co as CreateFileA
or CreateFileW depending on whether or not UNICODE is #defined (and
some other macros), without even mentioning this in the CreateFile()
doc. I suppose ldmb.h could do something similar - but with doc:-)

What's the norm for Windows libraries? Google found TCHAR stuff
which becomes WCHAR or char depending on defined(UNICODE), and
apparently strong religions about whether if it's a good or bad
idea for portable libraries to do the same. #define MDB_TEXT(str)
as str or L##str depending on UNICODE, etc.
--
Hallvard
Hallvard Breien Furuseth
2015-01-29 10:19:20 UTC
Permalink
Content preview: I wrote: > Though I notice Windows #defines CreateFile() &
co as CreateFileA > or CreateFileW depending on whether or not UNICODE is
#defined (and > some other macros), without even mentioning this in the CreateFile()
doc. I suppose ldmb.h could do something similar - but with doc:-) [...]
Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 RCVD_IN_DNSWL_BLOCKED RBL: ADMINISTRATOR NOTICE: The query to DNSWL
was blocked. See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[129.240.10.17 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Though I notice Windows #defines CreateFile() & co as CreateFileA
or CreateFileW depending on whether or not UNICODE is #defined (and
some other macros), without even mentioning this in the CreateFile()
doc. I suppose ldmb.h could do something similar - but with doc:-)
Whoops, is does document it: Filename parameter is "LPCTSTR".
--
Hallvard
Timur Kristóf
2015-01-29 13:35:23 UTC
Permalink
Content preview: I've had a brief chat with Hallvard on IRC. We came up with
several possible solutions, although each of them has its drawbacks. Writing
cross-platform code that supports unicode is always a messy business. I vote
for option 4, but would like to hear everyone's opinions before starting
to work on any of them. [...]

Content analysis details: (-2.7 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low
trust
[209.85.192.50 listed in list.dnswl.org]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature

I've had a brief chat with Hallvard on IRC. We came up with several
possible solutions, although each of them has its drawbacks. Writing
cross-platform code that supports unicode is always a messy business.
I vote for option 4, but would like to hear everyone's opinions before
starting to work on any of them.

1) Separate widechar functions

Make functions such as mdb_env_open_w that would call the widechar
APIs. The drawback of this approach is that it would require a lot of
duplicate code, which is hard to maintain. It would also pollute the
lmdb header file.

2) New flag

Introduce a new flag (such as MDB_USE_WCHAR) that would tell
mdb_dbi_open to cast the path parameter to wchar_t* under the hood and
call the widechar variant of the windows api.

Advantage: only the string concatenation code would need to be duplicated
Drawback: it is really-really ugly

3) Require UTF-16 on Windows

Since Microsoft discourages the use of their ANSI apis, we could say
that we require UTF-16 on windows. We can make a type such as
mdb_uchar_t that we would typedef to char on unix and wchar_t on
windows and then we could change the function signatures to use this
type.

Drawback: users that want to write cross-platform code would need to
ifdef their calls to mdb_env_open

4) Require UTF-8 on Windows

Let's say we require the path parameter to be encoded in UTF-8, even
on windows. Then under the hood we can convert it to UTF-16 and call
the widechar APIs. This doesn't lead to loss of performance because
windows itself converts to UTF-16 anyway if you use their ANSI
functions.
This is the least ugly and perhaps the easiest-to-implement solution
we found. It is easy to make UTF-8 (most libraries can produce it, or
the user could use u8"..." from C++11, etc.)

Advantage: this is the easiest to implement; code that worked before
(with ASCII paths) will work without modification, and we don't need
to duplicate any code.
Howard Chu
2015-02-02 02:58:57 UTC
Permalink
Content preview: Timur Kristóf wrote: > Hi Everyone, > > I've been talking
to Howard about this and he suggested to post it to > this mailing list.
There are two things that I recently noticed about > how LMDB works with various
encodings and I think it's worth to > discuss. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: openldap.org]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: Timur Kristóf wrote: > Hi Everyone, > > I've been talking
to Howard about this and he suggested to post it to > this mailing list.
There are two things that I recently noticed about > how LMDB works with various
encodings and I think it's worth to > discuss. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
Hi Everyone,
I've been talking to Howard about this and he suggested to post it to
this mailing list. There are two things that I recently noticed about
how LMDB works with various encodings and I think it's worth to
discuss.
2. Path names
Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the
likes accept a char* for path names. This is fine on most unixes where
char* is an UTF-8 string, but unfortunately, these functions call the
ANSI variants of the Windows API functions, making it impossible to
use Unicode path names with them.
I think we should switch to the widechar APIs instead, but that would
also mean changing the LMDB API to accept a wchar_t* parameter on
Windows instead of char*.
What do you guys think about all this?
I just had a look at how BDB handled this. As you can see they used a
TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136

(And a FROM_TSTRING for the reverse, as well.)
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Timur Kristóf
2015-02-02 10:52:04 UTC
Permalink
Content preview: > I just had a look at how BDB handled this. As you can see
they used a > TO_TSTRING macro to convert incoming pathnames from UTF8 to
UTF16. > > https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c
Post by Howard Chu
Post by Howard Chu
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136
(And a FROM_TSTRING for the reverse, as well.) [...]
Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: gitorious.org]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Howard Chu
I just had a look at how BDB handled this. As you can see they used a
TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136
(And a FROM_TSTRING for the reverse, as well.)
(Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry.
Now reposting to the mailing list.)

Since we only need to do this on Windows, we could use
MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.)
I do not think we would ever need to do any such conversion on UNIX.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx

I'm not sure if we can just copy-paste BDB's code. Probably not, that
would lead to licensing issues, wouldn't it?
Howard Chu
2015-02-02 10:56:11 UTC
Permalink
Content preview: Timur Kristóf wrote: >> I just had a look at how BDB handled
this. As you can see they used a >> TO_TSTRING macro to convert incoming
pathnames from UTF8 to UTF16. >> >> https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c
Post by Timur Kristóf
Post by Howard Chu
Post by Howard Chu
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136
(And a FROM_TSTRING for the reverse, as well.) > > (Mea culpa, I accidentally
hit "reply" instead of "reply all". Sorry. > Now reposting to the mailing
list.) > > Since we only need to do this on Windows, we could use > MultiByteToWideChar
with CP_UTF8. (That's what TO_TSTRING does, too.) > I do not think we would
ever need to do any such conversion on UNIX. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: symas.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: Timur Kristóf wrote: >> I just had a look at how BDB handled
this. As you can see they used a >> TO_TSTRING macro to convert incoming
pathnames from UTF8 to UTF16. >> >> https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c
Post by Timur Kristóf
Post by Howard Chu
Post by Howard Chu
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136
(And a FROM_TSTRING for the reverse, as well.) > > (Mea culpa, I accidentally
hit "reply" instead of "reply all". Sorry. > Now reposting to the mailing
list.) > > Since we only need to do this on Windows, we could use > MultiByteToWideChar
with CP_UTF8. (That's what TO_TSTRING does, too.) > I do not think we would
ever need to do any such conversion on UNIX. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
Post by Howard Chu
I just had a look at how BDB handled this. As you can see they used a
TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c
https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136
(And a FROM_TSTRING for the reverse, as well.)
(Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry.
Now reposting to the mailing list.)
Since we only need to do this on Windows, we could use
MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.)
I do not think we would ever need to do any such conversion on UNIX.
Correct, these macros only exist in the Windows-specific source files of
BDB. None of this is needed for POSIX.
Post by Timur Kristóf
https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx
I'm not sure if we can just copy-paste BDB's code. Probably not, that
would lead to licensing issues, wouldn't it?
I wasn't suggesting a copy/paste, just using it as an example of how the
problem could be approached.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Hallvard Breien Furuseth
2015-02-02 12:37:08 UTC
Permalink
Content preview: I suggest we wait to deal with DB names until we also have
a way to deal with filenames. And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling non-ASCII
LMDB names on Windows, not two. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]

I suggest we wait to deal with DB names until we also have a way to
deal with filenames. And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling
non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would
reading DB names and filenames from an config file. Yet OS-aware and
OS-specific config files can look rather different. Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up
recommending that "portable" DB names must be UTF-8. And add a "flag
convert UTF8<->WCHAR if this is Windows".
--
Hallvard
Howard Chu
2015-02-02 13:24:24 UTC
Permalink
Content preview: Hallvard Breien Furuseth wrote: > I suggest we wait to deal
with DB names until we also have a way to > deal with filenames. And this
time test that it works is practice:-) > Hopefully users and programmers
will only need one method of handling > non-ASCII LMDB names on Windows, not
two. > > I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as
would > reading DB names and filenames from an config file. Yet OS-aware
and > OS-specific config files can look rather different. Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up > recommending
that "portable" DB names must be UTF-8. And add a "flag > convert UTF8<->WCHAR
if this is Windows". > DB names are purely internal to LMDB, so they bear
no relation to OS filenames and none of this discussion matters to them.
[...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
I suggest we wait to deal with DB names until we also have a way to
deal with filenames. And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling
non-ASCII LMDB names on Windows, not two.
I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would
reading DB names and filenames from an config file. Yet OS-aware and
OS-specific config files can look rather different. Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up
recommending that "portable" DB names must be UTF-8. And add a "flag
convert UTF8<->WCHAR if this is Windows".
DB names are purely internal to LMDB, so they bear no relation to OS
filenames and none of this discussion matters to them.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Timur Kristóf
2015-02-02 14:20:30 UTC
Permalink
Content preview: > DB names are purely internal to LMDB, so they bear no relation
to OS > filenames and none of this discussion matters to them. If we let
the users treat db names as an MDB_val (essentially, an arbitrary byte array),
then all bets are off: we can't even make the assumption that a db name is
meaningful text in any encoding. We can make it possible to type such a thing
in the console if we represent it as a string of hexadecimal numbers. For
example, mdb_dump could do something like to_hex_string in this code snippet:
http://pastebin.com/jqnGSS6C (note: you need -std=c11 to compile the snippet).
[...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: pastebin.com]
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Howard Chu
DB names are purely internal to LMDB, so they bear no relation to OS
filenames and none of this discussion matters to them.
If we let the users treat db names as an MDB_val (essentially, an
arbitrary byte array), then all bets are off: we can't even make the
assumption that a db name is meaningful text in any encoding. We can
make it possible to type such a thing in the console if we represent
it as a string of hexadecimal numbers. For example, mdb_dump could do
something like to_hex_string in this code snippet:
http://pastebin.com/jqnGSS6C (note: you need -std=c11 to compile the
snippet).
Hallvard Breien Furuseth
2015-02-02 14:48:40 UTC
Permalink
Content preview: On 02. feb. 2015 14:24, Howard Chu wrote: > Hallvard Breien
Furuseth wrote: >> I suggest we wait to deal with DB names until we also
have a way to >> deal with filenames. And this time test that it works is
practice:-) >> Hopefully users and programmers will only need one method of
handling >> non-ASCII LMDB names on Windows, not two. >> >> I'd be nice if
'mdb_stat filename -s dbname' would Just Work, as would >> reading DB names
and filenames from an config file. Yet OS-aware and >> OS-specific config
files can look rather different. Maybe LMDB must >> handle DB names more
flexibly than filenames, or maybe we'll end up >> recommending that "portable"
DB names must be UTF-8. And add a "flag >> convert UTF8<->WCHAR if this is
Windows". >> > DB names are purely internal to LMDB, so they bear no relation
to OS > filenames and none of this discussion matters to them. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Howard Chu
Post by Hallvard Breien Furuseth
I suggest we wait to deal with DB names until we also have a way to
deal with filenames. And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling
non-ASCII LMDB names on Windows, not two.
I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would
reading DB names and filenames from an config file. Yet OS-aware and
OS-specific config files can look rather different. Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up
recommending that "portable" DB names must be UTF-8. And add a "flag
convert UTF8<->WCHAR if this is Windows".
DB names are purely internal to LMDB, so they bear no relation to OS
filenames and none of this discussion matters to them.
They're exposed to the programmer and the program's users. Either may
want them on command-line arguments, in config files, etc. It will be
inconvenient if LMDB requires different string handling for non-ASCII
filenames and non-ASCII DB names in such cases. The programmer may
choose to use different string handling but let's try to avoid forcing
him to do so.
--
Hallvard
Timur Kristóf
2015-02-02 15:03:48 UTC
Permalink
Content preview: >> DB names are purely internal to LMDB, so they bear no relation
to OS >> filenames and none of this discussion matters to them. > > They're
exposed to the programmer and the program's users. Either may > want them
on command-line arguments, in config files, etc. It will be > inconvenient
if LMDB requires different string handling for non-ASCII > filenames and
non-ASCII DB names in such cases. The programmer may > choose to use different
string handling but let's try to avoid forcing > him to do so. [...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Hallvard Breien Furuseth
Post by Howard Chu
DB names are purely internal to LMDB, so they bear no relation to OS
filenames and none of this discussion matters to them.
They're exposed to the programmer and the program's users. Either may
want them on command-line arguments, in config files, etc. It will be
inconvenient if LMDB requires different string handling for non-ASCII
filenames and non-ASCII DB names in such cases. The programmer may
choose to use different string handling but let's try to avoid forcing
him to do so.
A path is always a Unicode string, while a DB name can be an arbitrary
binary blob. So I don't think that we can treat them the same way.
Hallvard Breien Furuseth
2015-02-02 15:15:43 UTC
Permalink
Content preview: On 02. feb. 2015 16:03, Timur Kristóf wrote: > A path is
always a Unicode string, while a DB name can be an arbitrary > binary blob.
So I don't think that we can treat them the same way. Not the point. A program
which uses LDMB can choose to treat its own DB names in its own LMDB environments
as the same kind of strings as filenames (WCHAR, UTF-8 char, or whatever).
Unless we make that impossible. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Cc: Howard Chu <***@symas.com>, openldap-***@openldap.org
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: On 02. feb. 2015 16:03, Timur Kristóf wrote: > A path is
always a Unicode string, while a DB name can be an arbitrary > binary blob.
So I don't think that we can treat them the same way. Not the point. A program
which uses LDMB can choose to treat its own DB names in its own LMDB environments
as the same kind of strings as filenames (WCHAR, UTF-8 char, or whatever).
Unless we make that impossible. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
A path is always a Unicode string, while a DB name can be an arbitrary
binary blob. So I don't think that we can treat them the same way.
Not the point. A program which uses LDMB can choose to treat its
own DB names in its own LMDB environments as the same kind of
strings as filenames (WCHAR, UTF-8 char, or whatever). Unless we
make that impossible.

As for what LMDB will accept and what it must handle, that's up to
us. DB names are not binary blobs yet, after all.
--
Hallvard
Timur Kristóf
2015-02-02 15:25:57 UTC
Permalink
Content preview: >> A path is always a Unicode string, while a DB name can
be an arbitrary >> binary blob. So I don't think that we can treat them the
same way. > > > Not the point. A program which uses LDMB can choose to treat
its > own DB names in its own LMDB environments as the same kind of > strings
as filenames (WCHAR, UTF-8 char, or whatever). Unless we > make that impossible.
Post by Hallvard Breien Furuseth
As for what LMDB will accept and what it must handle, that's up to >
us. DB names are not binary blobs yet, after all. [...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Hallvard Breien Furuseth
A path is always a Unicode string, while a DB name can be an arbitrary
binary blob. So I don't think that we can treat them the same way.
Not the point. A program which uses LDMB can choose to treat its
own DB names in its own LMDB environments as the same kind of
strings as filenames (WCHAR, UTF-8 char, or whatever). Unless we
make that impossible.
As for what LMDB will accept and what it must handle, that's up to
us. DB names are not binary blobs yet, after all.
Okay. What do you suggest?
Hallvard Breien Furuseth
2015-02-02 16:00:34 UTC
Permalink
Content preview: On 02. feb. 2015 16:25, Timur Kristóf wrote: > Okay. What
do you suggest? I suggest we wait to deal with DB names until we also have
a way to deal with filenames. And this time test that it works is practice.
[...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Cc: Howard Chu <***@symas.com>, openldap-***@openldap.org
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: On 02. feb. 2015 16:25, Timur Kristóf wrote: > Okay. What
do you suggest? I suggest we wait to deal with DB names until we also have
a way to deal with filenames. And this time test that it works is practice.
[...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
Okay. What do you suggest?
I suggest we wait to deal with DB names until we also have a way to
deal with filenames. And this time test that it works is practice.

And then I also suggest to try to make this mess simple to deal
with for programmers and or users. I guess I should have separated
that from the rest more clearly.
--
Hallvard
Timur Kristóf
2015-02-02 16:11:12 UTC
Permalink
Content preview: > I suggest we wait to deal with DB names until we also have
a way to > deal with filenames. And this time test that it works is practice.
Post by Hallvard Breien Furuseth
And then I also suggest to try to make this mess simple to deal > with
for programmers and or users. I guess I should have separated > that from
the rest more clearly. [...]

Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Post by Hallvard Breien Furuseth
I suggest we wait to deal with DB names until we also have a way to
deal with filenames. And this time test that it works is practice.
And then I also suggest to try to make this mess simple to deal
with for programmers and or users. I guess I should have separated
that from the rest more clearly.
I can write a patch which does the UTF-8 to UTF-16 conversion on
Windows for file paths, but I would hate to restrict db names to UTF-8
text only (or for that matter, any text only). However, not supporting
non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable
compromise to me.
Hallvard Breien Furuseth
2015-02-02 16:28:41 UTC
Permalink
Content preview: On 02. feb. 2015 17:11, Timur Kristóf wrote: >> I suggest
we wait to deal with DB names until we also have a way to >> deal with filenames.
And this time test that it works is practice. >> >> And then I also suggest
to try to make this mess simple to deal >> with for programmers and or users.
I guess I should have separated >> that from the rest more clearly. > > I
can write a patch which does the UTF-8 to UTF-16 conversion on > Windows
for file paths, but I would hate to restrict db names to UTF-8 > text only
(or for that matter, any text only). However, not supporting > non-UTF-8
db names in mdb_dump and mdb_load sounds like a reasonable > compromise to
me. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Cc: Howard Chu <***@symas.com>, openldap-***@openldap.org
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: On 02. feb. 2015 17:11, Timur Kristóf wrote: >> I suggest
we wait to deal with DB names until we also have a way to >> deal with filenames.
And this time test that it works is practice. >> >> And then I also suggest
to try to make this mess simple to deal >> with for programmers and or users.
I guess I should have separated >> that from the rest more clearly. > > I
can write a patch which does the UTF-8 to UTF-16 conversion on > Windows
for file paths, but I would hate to restrict db names to UTF-8 > text only
(or for that matter, any text only). However, not supporting > non-UTF-8
db names in mdb_dump and mdb_load sounds like a reasonable > compromise to
me. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
Post by Hallvard Breien Furuseth
I suggest we wait to deal with DB names until we also have a way to
deal with filenames. And this time test that it works is practice.
And then I also suggest to try to make this mess simple to deal
with for programmers and or users. I guess I should have separated
that from the rest more clearly.
I can write a patch which does the UTF-8 to UTF-16 conversion on
Windows for file paths, but I would hate to restrict db names to UTF-8
text only (or for that matter, any text only). However, not supporting
non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable
compromise to me.
I suggest we wait to deal with DB names until we also have a way to
deal with filenames.
--
Hallvard
Florian Weimer
2015-02-15 15:52:52 UTC
Permalink
Content preview: * Timur Kristóf: > A path is always a Unicode string, while
a DB name can be an arbitrary > binary blob. On many POSIX platforms, a path
is a blob which does not contain '\000'. These systems do not enforce Unicode
encoding at all. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 RCVD_IN_DNSWL_BLOCKED RBL: ADMINISTRATOR NOTICE: The query to DNSWL
was blocked. See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[46.237.207.196 listed in list.dnswl.org]
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Mailman-Approved-At: Sun, 15 Feb 2015 16:36:54 +0000
Cc: Hallvard Breien Furuseth <***@usit.uio.no>,
openldap-***@openldap.org, Howard Chu <***@symas.com>
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.9 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: * Timur Kristóf: > A path is always a Unicode string, while
a DB name can be an arbitrary > binary blob. On many POSIX platforms, a path
is a blob which does not contain '\000'. These systems do not enforce Unicode
encoding at all. [...]

Content analysis details: (-1.9 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay
domain
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Timur Kristóf
A path is always a Unicode string, while a DB name can be an arbitrary
binary blob.
On many POSIX platforms, a path is a blob which does not contain
'\000'. These systems do not enforce Unicode encoding at all.
Timur Kristóf
2015-02-15 19:36:43 UTC
Permalink
Content preview: > > A path is always a Unicode string, while a DB name can
be an arbitrary > > binary blob. > > On many POSIX platforms, a path is a
blob which does not contain > '\000'. These systems do not enforce Unicode
encoding at all. [...]

Content analysis details: (-1.8 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
0.0 DKIM_ADSP_CUSTOM_MED No valid author signature, adsp_override is
CUSTOM_MED
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid
Post by Florian Weimer
Post by Timur Kristóf
A path is always a Unicode string, while a DB name can be an arbitrary
binary blob.
On many POSIX platforms, a path is a blob which does not contain
'\000'. These systems do not enforce Unicode encoding at all.
My mistake. I was unaware.
On those platforms, how do you type a path name into a terminal?
Timur Kristóf
2015-02-15 20:33:38 UTC
Permalink
Content preview: >>> > A path is always a Unicode string, while a DB name can
be an arbitrary >>> > binary blob. >>> >>> On many POSIX platforms, a path
is a blob which does not contain >>> '\000'. These systems do not enforce
Unicode encoding at all. >> >> My mistake. I was unaware. >> On those platforms,
how do you type a path name into a terminal? > > There are some files which
are not directly nameable. Many programs > support special sequences such
as “Ctrl+V 3 7 7” to enter arbitrary > bytes, but that's not universal.
Depending on the actual > implementation of the terminal, cut-and-paste of
funny file names can > work, too. > > Older programs have trouble accessing
such files even if the user > chooses them in a file selection dialog, but
current version are > supposed to have been fixed (including OpenJDK, which
took a > ridiculously long time). Beyond that, it's not much different from
dealing with file names in an unfamiliar script. [...]
Content analysis details: (-2.0 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
Cc: Hallvard Breien Furuseth <***@usit.uio.no>,
openldap-***@openldap.org, Howard Chu <***@symas.com>
X-BeenThere: openldap-***@openldap.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OpenLDAP development discussion list <openldap-devel.openldap.org>
List-Unsubscribe: <http://www.openldap.org/lists/mm/options/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=unsubscribe>
List-Archive: <http://www.openldap.org/lists/openldap-devel/>
List-Post: <mailto:openldap-***@openldap.org>
List-Help: <mailto:openldap-devel-***@openldap.org?subject=help>
List-Subscribe: <http://www.openldap.org/lists/mm/listinfo/openldap-devel>,
<mailto:openldap-devel-***@openldap.org?subject=subscribe>
Errors-To: openldap-devel-***@openldap.org
Sender: "openldap-devel" <openldap-devel-***@openldap.org>
X-Spam-Score: -1.8 (-)
X-Spam-Report: Spam detection software, running on the system "gauss.openldap.net", has
identified this incoming email as possible spam. The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email. If you have any questions, see
the administrator of that system for details.

Content preview: >>> > A path is always a Unicode string, while a DB name can
be an arbitrary >>> > binary blob. >>> >>> On many POSIX platforms, a path
is a blob which does not contain >>> '\000'. These systems do not enforce
Unicode encoding at all. >> >> My mistake. I was unaware. >> On those platforms,
how do you type a path name into a terminal? > > There are some files which
are not directly nameable. Many programs > support special sequences such
as “Ctrl+V 3 7 7” to enter arbitrary > bytes, but that's not universal.
Depending on the actual > implementation of the terminal, cut-and-paste of
funny file names can > work, too. > > Older programs have trouble accessing
such files even if the user > chooses them in a file selection dialog, but
current version are > supposed to have been fixed (including OpenJDK, which
took a > ridiculously long time). Beyond that, it's not much different from
dealing with file names in an unfamiliar script. [...]
Content analysis details: (-1.8 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
(timur.kristof[at]gmail.com)
0.0 DKIM_ADSP_CUSTOM_MED No valid author signature, adsp_override is
CUSTOM_MED
-0.0 SPF_PASS SPF: sender matches SPF record
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid
Post by Timur Kristóf
Post by Florian Weimer
Post by Timur Kristóf
A path is always a Unicode string, while a DB name can be an arbitrary
binary blob.
On many POSIX platforms, a path is a blob which does not contain
'\000'. These systems do not enforce Unicode encoding at all.
My mistake. I was unaware.
On those platforms, how do you type a path name into a terminal?
There are some files which are not directly nameable. Many programs
support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary
bytes, but that's not universal. Depending on the actual
implementation of the terminal, cut-and-paste of funny file names can
work, too.
Older programs have trouble accessing such files even if the user
chooses them in a file selection dialog, but current version are
supposed to have been fixed (including OpenJDK, which took a
ridiculously long time). Beyond that, it's not much different from
dealing with file names in an unfamiliar script.
Interesting.
So ultimately, there are always going to be things that you cannot
type into your terminal directly.

Loading...