textract
textract copied to clipboard
Is removing 1 space, but 2+ spaces to 1 space, deliberate?
What is the intention of lines 65,66 here? https://github.com/dbashford/textract/blob/master/lib/extract.js#L66
// multiple spaces, tabs, vertical tabs, non-breaking space]
text = text.replace( / (?! )/g, '' )
.replace( /[ \t\v\u00A0]{2,}/g, ' ' );
The first replace removes single space characters, the second says 2+ whitespace get replaced with a single space.
In one test document (a .doc) I was getting an extra space imported. If I change the first replace()
to remove multiple spaces it works. But I'm wondering if I'm breaking something else?
This very likely ties in with #145.
As a detailed example, after the regex replace I have:
74 0020 75 677E 松 76 0020 77 0020 78 962A 阪 79 0020 80 5E02 市 81 0020
(In my debug data, the first column is string index.)
And then after the replace at lines 65,66 I get:
37 677E 松 38 0020 39 962A 阪 40 5E02 市
But if I use the below command (adding a +
after the space, and removing the space in the 2nd regex) I get the desired:
37 677E 松 38 962A 阪 39 5E02 市
text = text.replace( / +(?! )/g, '' )
.replace( /[\t\v\u00A0]{2,}/g, ' ' );
The first replacement replaces the 2nd space of a 2 space group, not single spaces.
if ( options.preserveLineBreaks ) {
text = text.replace( WHITELIST_PRESERVE_LINEBREAKS, ' ' );
} else {
text = text.replace( WHITELIST_STRIP_LINEBREAKS, ' ' );
}
That code introduces superfluous spaces to the text. The first regex just undoes any possible cases where a 2nd space has been introduced where previously there was one.
replace( / +(?! )/g, '' )
will eliminate all spaces from the output, including those between words. That is not the goal. We want the text to be readable.
Would suggest the whitelist regexes be updated to include the characters that are missing from #145. I'll get to it soon I hope.
Thanks for the reply.
I am wondering why does the previous line remove replace those characters with a space, only to then remove them? Is space being used as a magic number, and it could be any unicode character?
Because I've only tried textract so far on a language that doesn't use spaces, I'm probably seeing problems with cleanseText()
that are not noticed in other languages. I wonder if we could document it (a mix of comments and unit tests?) to describe what problems it is fixing?
FWIW, if you've got a case where you need all the spaces removed regardless, its a simple regex/replace after textract has done its job. I'm sure that is why its been around so long without anyone bringing that up. Folks just strip the text after the fact themselves.