RedditScore icon indicating copy to clipboard operation
RedditScore copied to clipboard

Spacy's nlp_maxlength

Open D0cRandom opened this issue 5 years ago • 0 comments

With the CrazyTokenizer (excellent results, btw, thanks!) I am running into an issue with a maximum character length for SpaCY: "[E088] Text of length 3029371 exceeds maximum of 1000000." You can change nlp.max_length , but for that you have to load spacy itself. Is there a way that nlp.max_length can be set. when loading the CrazyTokenizer? (I know I could simply cut the file in 3, but I'd rather avoid that as I'd have to manually stitch the resulting token set back together again and I'll have to do this for various files).

D0cRandom avatar Jan 23 '20 23:01 D0cRandom