text
text copied to clipboard
Generate sentencepiece model from iterable
🚀 Feature
Motivation
The current method of training a sentencepiece model requires a file to be passed. It would be nice if this was not required.
Pitch
Like other data-related functions in torchtext, there should be a _from_iterator function to train a sentencepiece model. It seems like sentencepiece actually supports this out of the box,
Alternatives
For some datasets, it may be prohibitively expensive to continually replenish the iterator on each pass. A middle-of-the-road approach would be to allow passing an iterator, write the contents to a temp file, and call the "classic" function using the temp file which is then cleaned up. This could be added as a kwarg if necessary.
Additional context
N/A
It seems like one way of doing this will require is implementing this class and making the analog to this function accept a PyObject *iter (which requires inclusion of <Python.h> -- acceptable?). From there it should be straightforward to add the binding.