SoMaJo
SoMaJo copied to clipboard
Other issue with Markdown style links.
Links in this format: "*[Neubau](https://www.some-link.com)*" have an issue.
Code:
text = "*[Neubau](https://www.some-link.com)*"
sentences = somajo.tokenize_text([text])
for sentence in sentences:
for token in sentence:
print(f"{token.text}\t{token.token_class}\t{token.extra_info}")
Returns:
* symbol SpaceAfter=No
[ symbol SpaceAfter=No
Neubau regular SpaceAfter=No
] symbol SpaceAfter=No
( symbol SpaceAfter=No
https://www.some-link.com)* URL
Should return something like this:
* symbol SpaceAfter=No
[ symbol SpaceAfter=No
Neubau regular SpaceAfter=No
] symbol SpaceAfter=No
( symbol SpaceAfter=No
https://www.some-link.com URL
) symbol SpaceAfter=No
* symbol SpaceAfter=No
Full code: https://colab.research.google.com/drive/16-CKdzp20Gin02emrLVeHfFFir2veK8M?usp=sharing
I’ve decided to explicitly add markdown links, so this should be fixed now, with the caveat that it will fail if the link description contains square brackets or if the URL contains parentheses.