stanza icon indicating copy to clipboard operation
stanza copied to clipboard

[QUESTION] Is there a way to add exceptions to the tokenizers?

Open angel-daza opened this issue 3 years ago • 2 comments

For example to provide a list of abbreviations and tell the model to avoid splitting the token if it belongs to the list. This would be very useful to improve sentence splitting and tokenization for domain-specific vocabularies...

angel-daza avatar Jun 23 '22 09:06 angel-daza

That feature is currently not implemented. Actually, you could technically probably do it for ZH or VI, since those models incorporate dictionaries. Other languages do not use the dictionary features yet, though.

On Thu, Jun 23, 2022 at 2:10 AM José Angel Daza @.***> wrote:

For example to provide a list of abbreviations and tell the model to avoid splitting the token if it belongs to the list. This would be very useful to improve sentence splitting and tokenization for domain-specific vocabularies...

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1055, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWILY4G5L2JN5VG7VPLVQQSZ7ANCNFSM5ZTTCXTQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa avatar Jun 23 '22 16:06 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 31 '22 11:08 stale[bot]

There's a solution which might help, although it requires some coding on your end. You can now pass tokenize_postprocessor=(callable) when creating the Pipeline, then redo the tokens on your end to fix up whichever exceptions you found

AngledLuffa avatar Oct 03 '23 06:10 AngledLuffa