prose
prose copied to clipboard
Start, end of tokens in sanitized text
This update is to keep track of original locations in the provided text, however I couldn't figure out how better to deal with sanitized (clean) text step inside Tokenize without breaking API here:
clean, white := t.sanitizer.Replace(text), false
length := len(clean)
Obviously clean becomes the actual source for Start and End, and not the original text.
Possible solution: Leave sanitizing up to caller so that they can have both the original string & the locations in it.
I did not understand how this change is supposed to break the API. Could you please explain better? Your changes seems promising!
Thank you for reviewing.
See, what is expected is that Start and End would refer to the original source string. However it is being sanitized right inside Tokenize and all calculations are now based on a possibly modified source, and the expectation here is not true anymore. (That's why I commented them with a reference to sanitized text, to remind of that). The break would be to remove sanitizing and leave it up to the caller to choose. Then he will have a reference to the correct source string.
The fear though is that the library depends on this sanitizing step.