database icon indicating copy to clipboard operation
database copied to clipboard

Blazegraph confuses strings with and without RTM (U+200F)

Open smalyshev opened this issue 7 years ago • 11 comments

If Blazegraph indexes a triple that has a string with RTLM character (U+200F), e.g. "0000 0000 4698 056X\u200F", then it would consider it to be the same string as one without RTLM, i.e. "0000 0000 4698 056X". For example, if a new triple is entered with non-RTLM string, Blazegraph would still return the RTLM string when the triple is returned.

This seems to be happening because Blazegraph is using ICU collation keys as string keys, in https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/ICUSortKeyGenerator.java, and by default ICU collator seems to be generating the same collation key for RTLM and non-RTLM strings.

This is particularly annoying as once RTLM string is indexed, it is impossible to get it out even if the original triple is deleted - since the RTLM string remains in TERM2ID/ID2TERM dictionaries, every time non-RTLM string would be used, the result will again be the original RTLM string.

smalyshev avatar Jun 19 '18 21:06 smalyshev

This is a matter of the collation strength. If it is set to IDENTICAL then all differences are respected.

Bryan

On Tue, Jun 19, 2018, 14:06 Stanislav Malyshev [email protected] wrote:

If Blazegraph indexes a triple that has a string with RTM character (U+200F), e.g. "0000 0000 4698 056X\u200F", then it would consider it to be the same string as one without RTM, i.e. "0000 0000 4698 056X". For example, if a new triple is entered with non-RTM string, Blazegraph would still return the RTM string when the triple is returned.

This seems to be happening because Blazegraph is using ICU collation keys as string keys, in https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/ICUSortKeyGenerator.java, and by default ICU collator seems to be generating the same collation key for RTM and non-RTM strings.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/93, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4Pg6WZAQG_Zs8ZHOLPeEsNmzlLizks5t-We3gaJpZM4UuPNO .

thompsonbry avatar Jun 19 '18 21:06 thompsonbry

There seems to be two issues here:

  1. The default is set not to Identical (I think it may be Secondary? not sure) despite all Unicode being allowed in generic case
  2. I don't see any config value that allows to change it. Maybe I am missing some options, but I could not find any path in code that leads to setting these. Is it documented somewhere maybe that I missed? I put a breakpoint on DefaultKeyBuilderFactory(final Properties properties) and it doesn't seem to be ever called with non-empty properties.

smalyshev avatar Jun 19 '18 21:06 smalyshev

https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/DefaultKeyBuilderFactory.java

Especially

https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/DefaultKeyBuilderFactory.java#L345

The Options interface declares the supported configuration properties.

See https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/KeyBuilder.java#L1639

Bryan

On Tue, Jun 19, 2018 at 2:38 PM, Stanislav Malyshev < [email protected]> wrote:

There seems to be two issues here:

  1. The default is set not to Identical (I think it may be Secondary? not sure) despite all Unicode being allowed in generic case
  2. I don't see any config value that allows to change it. Maybe I am missing some options, but I could not find any path in code that leads to setting these. Is it documented somewhere maybe that I missed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/93#issuecomment-398554712, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4INL0K2LvowqvuIyom15LvGlIrsJks5t-W9KgaJpZM4UuPNO .

thompsonbry avatar Jun 19 '18 21:06 thompsonbry

Yes, I found this code, but I do not see how it can ever be reached since DefaultKeyBuilderFactory us always called with either null argument or new Properties() in the upstream code. E.g. https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/DefaultTupleSerializer.java#L73

smalyshev avatar Jun 19 '18 21:06 smalyshev

https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/DefaultKeyBuilderFactory.java#L168

You can use -DNAME=VALUE. See below.

/**

  • Return the property if found in properties. If properties
  • is null or if the value is not found in properties,
  • then return the property if found using
  • {@link System#getProperty(String)}.
  • @param properties
  • The properties.
  • @param key
  • The name of the desired property.
  • @param def
  • The default (MAY be null).
  • @return The value -or- def if no value was found. */ static private String getProperty(final Properties properties, final String key, final String def) { String val = null; if (properties != null) { val = properties.getProperty(key);//, def); } if (val == null) { val = System.getProperty(key, def); } if(log.isDebugEnabled()) { log.debug("name=" + key + ",val=" + val); } return val; }

On Tue, Jun 19, 2018 at 2:52 PM, Stanislav Malyshev < [email protected]> wrote:

Yes, I found this code, but I do not see how it can ever be reached since DefaultKeyBuilderFactory us always called with either null argument or new Properties() in the upstream code. E.g. https://github.com/blazegraph/ database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/ DefaultTupleSerializer.java#L73

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/93#issuecomment-398558467, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4FLNdMy2SZ5qUUJhc1RrCTeEmdI_ks5t-XK4gaJpZM4UuPNO .

thompsonbry avatar Jun 19 '18 21:06 thompsonbry

Ah, I did not notice it uses System properties too. I'll try that, thanks.

smalyshev avatar Jun 19 '18 21:06 smalyshev

No problem!

Bryan

On Tue, Jun 19, 2018 at 2:57 PM, Stanislav Malyshev < [email protected]> wrote:

Ah, I did not notice it uses System properties too. I'll try that, thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/93#issuecomment-398559598, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4HBD1CCCdasVVPIUynW4YEwYd0-Aks5t-XPLgaJpZM4UuPNO .

thompsonbry avatar Jun 19 '18 22:06 thompsonbry

Yes, it looks like setting -Dcom.bigdata.btree.keys.KeyBuilder.collator.strength=Identical fixes the issue, but keys become much longer. Also, seems to require reindexing the whole store from scratch.

Will setting it to Identical have any performance impact?

smalyshev avatar Jun 19 '18 22:06 smalyshev

This will mainly hit dictionary operations. The statement indices will be unchanged, hence query performance will be essentially unchanged. Yes, you would need to reload since the Unicode sort keys stored in the forward dictionary (Term2Id) would be different.

I think that it is a fair question as to what is the most appropriate level of distinction. This specific case of a RTM (Registered Trade Mark, right?) could also be achieved by adding an assertion about an entity having that value as a label.

Would you want to have all variations of case stored as distinct Literals?

Do you want distinctions in a sequence of code points for the same glyph to be stored as different Literals?

Unicode is always tricky.

Bryan

On Tue, Jun 19, 2018 at 3:15 PM, Stanislav Malyshev < [email protected]> wrote:

Yes, it looks like setting -Dcom.bigdata.btree.keys. KeyBuilder.collator.strength=Identical fixes the issue, but keys become much longer. Also, seems to require reindexing the whole store from scratch.

Will setting it to Identical have any performance impact?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/93#issuecomment-398563669, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4EoIHaSivm62P7xrKgiOQi9D8QeHks5t-XgGgaJpZM4UuPNO .

thompsonbry avatar Jun 19 '18 22:06 thompsonbry

No, RTLM is right-to-left-mark. It is a non-display character influencing text direction, so it's easy to miss, and low collation levels rightfully skip it since its presence does not really change the text. The problem is, once string with RTLM is in the index, it's not possible to get it out.

Side question: is is possible to get a value out of the term2id index manually? Can term2id values be deleted at all?

smalyshev avatar Jun 19 '18 22:06 smalyshev

So, you can easily change what is in ID2TERM, which has the actual Unicode string that is returned when a query result is materialized. This would let you strip out somethings that you might not want to have in there. You could then filter the data on insert (externally).

You could change the TERM2ID mapping also. I am not sure what would happen if you tried to have multiple keys mapped to the same ID in that index. It might be Ok. Or something might have a fit.

Just make sure that the server is not answering queries when making these sorts of changes. Otherwise you might well hit asserts otherwise or odd behavior if the ID <=> TERM mapping was not fully consistent (including in the TermCache maintained by the LexiconRelation, in the BigdataLiteral => IV => IVCache => BigdataLiteral relationships, etc.).

Bryan

On Tue, Jun 19, 2018 at 3:24 PM, Stanislav Malyshev < [email protected]> wrote:

No, RTLM is right-to-left-mark. It is a non-display character influencing text direction, so it's easy to miss, and low collation levels rightfully skip it since its presence does not really change the text. The problem is, once string with RTLM is in the index, it's not possible to get it out.

Side question: is is possible to get a value out of the term2id index manually? Can term2id values be deleted at all?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/93#issuecomment-398565406, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4Ei-Y7jKhrcKvvTou3SJnzErYUz9ks5t-XoIgaJpZM4UuPNO .

thompsonbry avatar Jun 19 '18 22:06 thompsonbry