textract icon indicating copy to clipboard operation
textract copied to clipboard

Problems with garbled characters in docx files

Open uptown opened this issue 5 years ago • 1 comments

Hi,

During extracting text from docx, I found some text are not extracted as expected.

[AS-IS] image

I think the problem occurred by conversion from byte buffer to string in this file

function getTextFromZipFile( zipfile, entry, cb ) {
  zipfile.openReadStream( entry, function( err, readStream ) {
    var text = ''
      , error = ''
      ;

    if ( err ) {
      cb( err, null );
      return;
    }

    readStream.on( 'data', function( chunk ) {
      text += chunk; // HERE !! 
    });
    readStream.on( 'end', function() {
      if ( error.length > 0 ) {
        cb( error, null );
      } else {
        cb( null, text );
      }
    });
    readStream.on( 'error', function( _err ) {
      error += _err;
    });
  });
}

In the function, the line text += chunk; makes a conversion problem, so there is a chance to text contains wrong text.

So, I changed the function a little bit, changing the type of text (which was string type) to Buffer

function getTextFromZipFile( zipfile, entry, cb ) {
  zipfile.openReadStream( entry, function( err, readStream ) {
    var text = new Buffer("")
      , error = ''
      ;

    if ( err ) {
      cb( err, null );
      return;
    }

    readStream.on( 'data', function( chunk ) {
      text = Buffer.concat([text, chunk]);
    });
    readStream.on( 'end', function() {
      if ( error.length > 0 ) {
        cb( error, null );
      } else {
        cb( null, "" + text );
      }
    });
    readStream.on( 'error', function( _err ) {
      error += _err;
    });
  });
}

And I finally get a right output.

[TO-BE] image

Is there any problem with this approach? Thank you.

uptown avatar Nov 03 '18 03:11 uptown

Doesn't seem to be anything up with that approach! Please do give the tests a go and submit a PR. I tend to wait 3 months or so while issues and PRs pile up and then go through them and release. Should be doing another soon.

Thanks!

dbashford avatar Nov 05 '18 15:11 dbashford