SoMaJo icon indicating copy to clipboard operation
SoMaJo copied to clipboard

Other issue with Markdown style links.

Open PhilipMay opened this issue 1 year ago • 1 comments

Links in this format: "*[Neubau](https://www.some-link.com)*" have an issue.

Code:

text = "*[Neubau](https://www.some-link.com)*"
sentences = somajo.tokenize_text([text])
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")

Returns:

*	symbol	SpaceAfter=No
[	symbol	SpaceAfter=No
Neubau	regular	SpaceAfter=No
]	symbol	SpaceAfter=No
(	symbol	SpaceAfter=No
https://www.some-link.com)*	URL	

Should return something like this:

*	symbol	SpaceAfter=No
[	symbol	SpaceAfter=No
Neubau	regular	SpaceAfter=No
]	symbol	SpaceAfter=No
(	symbol	SpaceAfter=No
https://www.some-link.com	URL
)       symbol SpaceAfter=No
*	symbol SpaceAfter=No

Full code: https://colab.research.google.com/drive/16-CKdzp20Gin02emrLVeHfFFir2veK8M?usp=sharing

PhilipMay avatar Feb 13 '24 21:02 PhilipMay

I’ve decided to explicitly add markdown links, so this should be fixed now, with the caveat that it will fail if the link description contains square brackets or if the URL contains parentheses.

tsproisl avatar Feb 19 '24 12:02 tsproisl