codeprep
codeprep copied to clipboard
Create PreppedTokenSequence class to incapsulate getting full tokens from subtokens
The tasks for the new PreppedTokenSequence
class are to encapsulate getting full tokens from subtokens (which is currently done by FullTokenIterator
class) and at the same time provide transparent access to the subtokens)
Motivation:
- To get the full tokens, the user won't have to know about
FullTokenIterator
. This functionality can be provided byPreppedTokenSequence
directly - ModelContext class is not really needed anymore
Provisional API:
>>> prepped_token_sequence = api.bpe("getName(", "5k")
>>> prepped_tokens
['get', 'Name', '</t>', (]
>>> prepped_tokens.metadata.token_types
[SplitContainer, OpeningBracket]
>>> prepped_tokens.metadata.n_subtokens_per_token
[3, 1]
>>> prepped_tokens.full_tokens()
[['get', 'Name', '</t>'], ['(']]
>>> prepped_tokens.full_tokens(formatter=lambda s: ''.join(s))
['getName</t>', '(']