stanza
stanza copied to clipboard
[QUESTION] Is there a way to add exceptions to the tokenizers?
For example to provide a list of abbreviations and tell the model to avoid splitting the token if it belongs to the list. This would be very useful to improve sentence splitting and tokenization for domain-specific vocabularies...
That feature is currently not implemented. Actually, you could technically probably do it for ZH or VI, since those models incorporate dictionaries. Other languages do not use the dictionary features yet, though.
On Thu, Jun 23, 2022 at 2:10 AM José Angel Daza @.***> wrote:
For example to provide a list of abbreviations and tell the model to avoid splitting the token if it belongs to the list. This would be very useful to improve sentence splitting and tokenization for domain-specific vocabularies...
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWILY4G5L2JN5VG7VPLVQQSZ7ANCNFSM5ZTTCXTQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
There's a solution which might help, although it requires some coding on your end. You can now pass tokenize_postprocessor=(callable) when creating the Pipeline, then redo the tokens on your end to fix up whichever exceptions you found