prose Start, end of tokens in sanitized text

Start, end of tokens in sanitized text

Open zzwx opened this issue 3 years ago • 2 comments

This update is to keep track of original locations in the provided text, however I couldn't figure out how better to deal with sanitized (clean) text step inside Tokenize without breaking API here:

	clean, white := t.sanitizer.Replace(text), false
	length := len(clean)

Obviously clean becomes the actual source for Start and End, and not the original text.

Possible solution: Leave sanitizing up to caller so that they can have both the original string & the locations in it.

Nov 30 '20 13:11 zzwx

I did not understand how this change is supposed to break the API. Could you please explain better? Your changes seems promising!

Jan 15 '21 12:01 nicolasassi

Thank you for reviewing.

See, what is expected is that Start and End would refer to the original source string. However it is being sanitized right inside Tokenize and all calculations are now based on a possibly modified source, and the expectation here is not true anymore. (That's why I commented them with a reference to sanitized text, to remind of that). The break would be to remove sanitizing and leave it up to the caller to choose. Then he will have a reference to the correct source string.

The fear though is that the library depends on this sanitizing step.

Jan 15 '21 12:01 zzwx

prose prose copied to clipboard

Start, end of tokens in sanitized text

prose
prose copied to clipboard