codeprep Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens

Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens

Open hlibbabii opened this issue 4 years ago • 0 comments

The tasks for the new PreppedTokenSequence class are to encapsulate getting full tokens from subtokens (which is currently done by FullTokenIterator class) and at the same time provide transparent access to the subtokens)

Motivation:

To get the full tokens, the user won't have to know about FullTokenIterator. This functionality can be provided by PreppedTokenSequence directly
ModelContext class is not really needed anymore

Provisional API:

>>> prepped_token_sequence = api.bpe("getName(", "5k")
>>> prepped_tokens
['get', 'Name', '</t>', (]
>>> prepped_tokens.metadata.token_types
[SplitContainer, OpeningBracket]
>>> prepped_tokens.metadata.n_subtokens_per_token
[3, 1]
>>> prepped_tokens.full_tokens()
[['get', 'Name', '</t>'], ['(']]
>>> prepped_tokens.full_tokens(formatter=lambda s: ''.join(s))
['getName</t>', '(']

Feb 28 '20 17:02 hlibbabii

codeprep codeprep copied to clipboard

Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens

codeprep
codeprep copied to clipboard