universal-data-tool icon indicating copy to clipboard operation
universal-data-tool copied to clipboard

Tokenization of two ints separated by a space

Open jmn319 opened this issue 4 years ago • 0 comments

Full disclosure, I have only spent a handful of hours with the tool so if there is an easy fix for this my apologies.

I started with a data set where it's very common to see two ints following each other separated by a space (could be a single space or could be multiple spaces). When I go to the labeling UI, I noticed that the two ints are together as one token. They are even tokenized as one token when they are separated by a comma. Screenshots for full repro below.

Any thoughts in how I can get these to be separate tokens? Hoping there are some simple settings I can change.

udt-token1

udt-token2

udt-token3

jmn319 avatar Apr 20 '21 00:04 jmn319