textract icon indicating copy to clipboard operation
textract copied to clipboard

Is removing 1 space, but 2+ spaces to 1 space, deliberate?

Open DarrenCook opened this issue 6 years ago • 3 comments

What is the intention of lines 65,66 here? https://github.com/dbashford/textract/blob/master/lib/extract.js#L66

 // multiple spaces, tabs, vertical tabs, non-breaking space]
 text = text.replace( / (?! )/g, '' )
   .replace( /[ \t\v\u00A0]{2,}/g, ' ' );

The first replace removes single space characters, the second says 2+ whitespace get replaced with a single space.

In one test document (a .doc) I was getting an extra space imported. If I change the first replace() to remove multiple spaces it works. But I'm wondering if I'm breaking something else?

This very likely ties in with #145.

As a detailed example, after the regex replace I have:

74 0020 75 677E 松 76 0020 77 0020 78 962A 阪 79 0020 80 5E02 市 81 0020

(In my debug data, the first column is string index.)

And then after the replace at lines 65,66 I get:

37 677E 松 38 0020 39 962A 阪 40 5E02 市

But if I use the below command (adding a + after the space, and removing the space in the 2nd regex) I get the desired:

37 677E 松 38 962A 阪 39 5E02 市

  text = text.replace( / +(?! )/g, '' )
    .replace( /[\t\v\u00A0]{2,}/g, ' ' );

DarrenCook avatar Mar 23 '18 13:03 DarrenCook

The first replacement replaces the 2nd space of a 2 space group, not single spaces.

  if ( options.preserveLineBreaks ) {
    text = text.replace( WHITELIST_PRESERVE_LINEBREAKS, ' ' );
  } else {
    text = text.replace( WHITELIST_STRIP_LINEBREAKS, ' ' );
  }

That code introduces superfluous spaces to the text. The first regex just undoes any possible cases where a 2nd space has been introduced where previously there was one.

replace( / +(?! )/g, '' ) will eliminate all spaces from the output, including those between words. That is not the goal. We want the text to be readable.

Would suggest the whitelist regexes be updated to include the characters that are missing from #145. I'll get to it soon I hope.

dbashford avatar Mar 23 '18 15:03 dbashford

Thanks for the reply.

I am wondering why does the previous line remove replace those characters with a space, only to then remove them? Is space being used as a magic number, and it could be any unicode character?

Because I've only tried textract so far on a language that doesn't use spaces, I'm probably seeing problems with cleanseText() that are not noticed in other languages. I wonder if we could document it (a mix of comments and unit tests?) to describe what problems it is fixing?

DarrenCook avatar Mar 23 '18 17:03 DarrenCook

FWIW, if you've got a case where you need all the spaces removed regardless, its a simple regex/replace after textract has done its job. I'm sure that is why its been around so long without anyone bringing that up. Folks just strip the text after the fact themselves.

dbashford avatar Mar 23 '18 19:03 dbashford