org-parser icon indicating copy to clipboard operation
org-parser copied to clipboard

non-ascii tag is not parsed

Open yqu212 opened this issue 3 years ago • 8 comments

Describe the bug Non-ascii tag is not parsed.

To Reproduce Steps to reproduce the behavior:

(read-str "* headline :标签:")
{:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "headline :标签:"]],
    :planning [],
    :tags []}}]}

Expected behavior

{:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "headline"]],
    :planning [],
    :tags ["tag"]}}]}

Screenshots If applicable, add screenshots to help explain your problem.

Additional context [org-parser "0.1.24"]

yqu212 avatar Jun 22 '21 07:06 yqu212

Thanks for the report.

TAGS is made of words containing any alpha-numeric character, underscore, at sign, hash sign or percent sign, and separated with colons.

  • https://orgmode.org/worg/dev/org-syntax.html

The regex for tag names is currently [a-zA-Z0-9_@#%] (see function extract-tags).

It must include also unicode characters but JavaScript regexes cannot do that. Only Java has such a character class.

If we invert the regex like [^ \t-.…] we would have to exclude too many characters.

Other ideas? Add unicode ranges next to a-zA-Z? That will get messy, too :/

PS: It would be interesting how org mode does this. Maybe they have a special character class for unicode chars.

schoettl avatar Jun 22 '21 07:06 schoettl

Yes. Elisp has [:multibyte:].

Chinese is not parsed in another parser orgajs implemented by javascript. https://github.com/orgapp/orgajs/blob/eac72e62b902b79289cfacd97e9bdf5e09bc9030/packages/orga/src/tokenize/headline.ts#L61

Maybe we can make org-parser support java only for now?

p.s. Is this one useful? https://stackoverflow.com/questions/21109011/javascript-unicode-string-chinese-character-but-no-punctuation

yqu212 avatar Jun 22 '21 08:06 yqu212

Maybe we can make org-parser support java only for now?

No. This would be very much against https://github.com/200ok-ch/org-parser/#what-does-this-project-do and https://github.com/200ok-ch/org-parser/#why-is-this-project-useful--rationale.

Having said that, JavaScript has "Unicode property escapes" . Maybe we can use it for the a-zA-Z part of the regexp:

> ":标签:".match(/\p{Letter}+/gu)
[ '标签' ]

munen avatar Jun 22 '21 08:06 munen

Looks like this also works as part of a 'regular' regular expression (pardon the pun).:

> ":标签:".match(/[\p{Letter}0-9_@#%]+/gu)
[ '标签' ]

munen avatar Jun 22 '21 09:06 munen

@yqu212 Do you want to make your first PR and include Chinese characters by employing above Regexp for CLJS and the equivalent for CLJ?

munen avatar Jun 22 '21 09:06 munen

It's a good idea. However, it will taks some time to write the test since I am not familar with CLJS.

yqu212 avatar Jun 22 '21 09:06 yqu212

Looks, like it pays off that we doing tag extraction in the transformation, not EBNF ^^

schoettl avatar Jun 22 '21 09:06 schoettl

@yqu212

It's a good idea. However, it will taks some time to write the test since I am not familar with CLJS.

No worries, take your time!

Good luck and enjoy :pray:

@schoettl

Looks, like it pays off that we doing tag extraction in the transformation, not EBNF ^^

:+1:

munen avatar Jun 22 '21 09:06 munen