text icon indicating copy to clipboard operation
text copied to clipboard

Generate sentencepiece model from iterable

Open erip opened this issue 3 years ago • 1 comments

🚀 Feature

Motivation

The current method of training a sentencepiece model requires a file to be passed. It would be nice if this was not required.

Pitch

Like other data-related functions in torchtext, there should be a _from_iterator function to train a sentencepiece model. It seems like sentencepiece actually supports this out of the box,

Alternatives

For some datasets, it may be prohibitively expensive to continually replenish the iterator on each pass. A middle-of-the-road approach would be to allow passing an iterator, write the contents to a temp file, and call the "classic" function using the temp file which is then cleaned up. This could be added as a kwarg if necessary.

Additional context

N/A

erip avatar Dec 22 '21 12:12 erip

It seems like one way of doing this will require is implementing this class and making the analog to this function accept a PyObject *iter (which requires inclusion of <Python.h> -- acceptable?). From there it should be straightforward to add the binding.

erip avatar Dec 22 '21 13:12 erip