silverbullet icon indicating copy to clipboard operation
silverbullet copied to clipboard

bug: queries don't seem support non-ascii chars

Open bnjbvr opened this issue 1 year ago • 4 comments

If I tag a page with "élément" (French for "element"), it seems I can't use it in queries because the parser fails on accentuated characters. This query will be marked as incorrect (but it's not clear to me whether it can't be parsed or interpreted) :

```query
élément
select name
```

Same if I'm using any accentuated attribute in a page; they can't be used in the where sections of queries later.

bnjbvr avatar Jul 18 '24 06:07 bnjbvr

I took a crack at this, and found what seems to be a plausible solution, only it still didn't work (and I lost the code between other PRs). I'll write what I found out, for my future reference or anyone else trying:

The query syntax is defined here, and we see that it starts with a TagIdentifier, because we start query with a tag to look for: https://github.com/silverbulletmd/silverbullet/blob/1635c417c3d925ff0766756eb5b063d8233878f4/common/markdown_parser/query.grammar#L23

What is allowed in a tag identifier is defined lower, I think that's what's rejecting letters with diacritics: https://github.com/silverbulletmd/silverbullet/blob/1635c417c3d925ff0766756eb5b063d8233878f4/common/markdown_parser/query.grammar#L128

I checked in lezer docs that there isn't anything like @unicodeLetters (as we have in regex with \p{L}), but the next best thing I found is defining ranges of code points, like they do here: https://github.com/lezer-parser/lezer-grammar/blob/64e55bd774a17e47fb600983b1f5390a11025562/src/lezer.grammar#L152 However the \u{a1}-\u{10ffff} cannot be copied directly, because this includes other whitespace characters and breaks the grammar parsing.

I tried changing this grammar, updating the files with scripts/generate.sh, and rebuilding the server but still keep seeing "Parse error". Is there a better way to debug than this? Lezer forum only agrees it's hard

Maarrk avatar Jul 22 '24 11:07 Maarrk

Now I remember: this change does work, but not when the first letter is also non-ASCII. The grammar after the patch should allow it, probably there's some regex somewhere?

Maarrk avatar Jul 22 '24 11:07 Maarrk

Also, this works without any patches:

```query
page where tags = "élément"
```

Maarrk avatar Jul 22 '24 11:07 Maarrk

I think the same issue applies to attributes: obraz

Non-ASCII attributes:
- All ASCII [works: true]
- Other letters [działa: false]

Maarrk avatar Sep 03 '24 10:09 Maarrk

The whole system SHOULD support Unicode scalar values everywhere where it is applicable (anywhere where users may input something), IMHO.

mjf avatar Dec 17 '24 12:12 mjf

This got fixed as a side effect of Lua Integrated Query.

Image

The screenshot was done from this page:

#elément

${query[[
from index.tag "elément"
select {
  Nazwa = name,
  ["Długość"] = size}
]]}

The table columns which contain non-ASCII characters using the general form for table constructor, but this is standard Lua. Probably worth including into the documentation, but I'm not sure what would be the best place.

Maarrk avatar Mar 27 '25 17:03 Maarrk