gulp-concat icon indicating copy to clipboard operation
gulp-concat copied to clipboard

gulp-concat doesnt seem to support UTF 16

Open billrawlinson opened this issue 8 years ago • 16 comments

When concating files which are UTF 16 Little Endian (unicode) every other file gets munged a bit.

When concatenating files which are UTF 16 Big Endian the same result happens.

If you alternate files where the first is UTF16LE and the second is UTF16BE then just the very end of the second file gets munged.

I have set up a demo project that illustrates this and has a bunch of notes that explain why I even tried these things. I don't know for certain the problem is in gulp-concat (it could be in gulp itself in gulp.src(). )

https://github.com/finalcut/gulp-concat-bug

billrawlinson avatar Jul 09 '15 21:07 billrawlinson

Is this with concat or gulp itself?

yocontra avatar Jul 09 '15 21:07 yocontra

it seems like it is concat to me considering the characters that are munged are interleaved (every other file). I figured I'd post the problem here first and see if you guys could see it and, possibly, confirm or reject if it is with gulp-concat.

billrawlinson avatar Jul 10 '15 14:07 billrawlinson

@billrawlinson Can you try just piping src to dest a bunch of times and see if that causes the issue as well?

yocontra avatar Jul 10 '15 19:07 yocontra

Sure I'll try Monday. I'm on the road now. If anyone else wants to know sooner they can pull the demo project and try.

I figure the problem is either in the file read or Concat as the problem manifests in the middle of the concatted result which should rule out the write operation

On Fri, Jul 10, 2015, 15:28 contra [email protected] wrote:

@billrawlinson https://github.com/billrawlinson Can you try just piping src to dest a bunch of times and see if that causes the issue as well?

— Reply to this email directly or view it on GitHub https://github.com/wearefractal/gulp-concat/issues/101#issuecomment-120502422 .

billrawlinson avatar Jul 10 '15 22:07 billrawlinson

So I ran the tests where I just pipe in the files to dest and nothing funky happens to the files in the process.

I've updated the test demo project to where it does both.

If you want to run the tests to see the results just pull the project and give it a run. Each test now puts its results in a folder titled "results#' where # is the number of the test being run.

https://github.com/finalcut/gulp-concat-bug

billrawlinson avatar Jul 13 '15 12:07 billrawlinson

I'm guessing it has something to do with buffer conversions in concat-with-sourcemaps:

  • https://github.com/floridoo/concat-with-sourcemaps/blob/master/index.js#L109
  • https://github.com/floridoo/concat-with-sourcemaps/blob/master/index.js#L43-L46
  • https://github.com/floridoo/concat-with-sourcemaps/blob/master/index.js#L15-L18

Probably mixing a bunch of encodings together using node's Buffer module is causing unexpected results.

yocontra avatar Jul 13 '15 22:07 yocontra

In test example 2 (utf16le) and 3 (utf16be) the encodings are all the same. Test 1 and 4 with mixed encodings ends up with better results (though still broken). Test 5,utf8,is the only one that has the correct results.

On Mon, Jul 13, 2015, 18:13 contra [email protected] wrote:

I'm guessing it has something to do with buffer conversions in concat-with-sourcemaps:

https://github.com/floridoo/concat-with-sourcemaps/blob/master/index.js#L109

https://github.com/floridoo/concat-with-sourcemaps/blob/master/index.js#L43-L46

https://github.com/floridoo/concat-with-sourcemaps/blob/master/index.js#L15-L18

Probably mixing a bunch of encodings together using node's Buffer module is causing unexpected results.

— Reply to this email directly or view it on GitHub https://github.com/wearefractal/gulp-concat/issues/101#issuecomment-121077591 .

billrawlinson avatar Jul 13 '15 23:07 billrawlinson

@billrawlinson I mean that the separator is treated as UTF-8, so combining that with some UTF-16 buffers might be yielding weird results

yocontra avatar Jul 14 '15 06:07 yocontra

ah, that makes perfect sense.

billrawlinson avatar Jul 14 '15 13:07 billrawlinson

I assume, due to the nature of gulp pipes that concat has no way of knowing the encoding of the various buffers coming in to it from src?

billrawlinson avatar Jul 14 '15 13:07 billrawlinson

you are correct; it is the separator character that is causing the problem. I set up the test like follows:

function runConcatTest(d){
  var testResults =  gulp.src(d.sources)
    .pipe(concat(d.outfile, { newLine: '' }))
    .pipe(gulp.dest(d.outpath));
    testResults.on('data', printToConsole);
}

Where I basically blanked out the newLine character and the test 2 and 3 both work perfectly while test1 and 4 are all mucked up. If I don't override the newline it is broken as before.

Maybe as a temporary solution just the readme could be updated to let people know if they joining UTF16 files that they should put their own newline at the end of the files and then override the join character to be nothing.

UPDATE: I updated the demo project to show the working scenario with test 2 using an empty string as a the separator.

billrawlinson avatar Jul 14 '15 17:07 billrawlinson

@billrawlinson Hmm trying to think up a solution here, going to dig into the buffer docs and see if I can figure something out

yocontra avatar Jul 14 '15 19:07 yocontra

https://nodejs.org/api/buffer.html#buffer_class_method_buffer_isencoding_encoding

could emit a warning if the users mixes encodings (assuming we can't figure out a way to make it work)

yocontra avatar Jul 14 '15 19:07 yocontra

I played around with this for a bit and it stumped me, @billrawlinson did you figure anything out?

yocontra avatar Dec 10 '15 04:12 yocontra

I did not. I just resorted to not using UTF 16 :-1:

billrawlinson avatar Dec 10 '15 14:12 billrawlinson

Have run into the same issue and it turns out the files that end up munged are UTF16

troydemonbreun avatar Aug 19 '16 16:08 troydemonbreun