Mangled file content when multipart-POSTing a file with a "text/*" content type
Reading a UTF-8 CSV and attempting to upload it with needle via a multipart POST can cause non-ASCII characters inside the CSV data to be replaced by other characters.
Steps to Reproduce:
# node --version
v14.18.1
'use strict';
const childproc = require('child_process');
const needle = require('needle');
const net = require('net');
async function postAndCapture(buffer, content_type) {
const port = 49152 + Math.floor(Math.random() * 16384);
const proxy = net.createServer();
await new Promise((res, rej) => {
proxy.on('error', rej);
proxy.listen(port, res);
});
proxy.on('connection', sock => {
const buffs = [];
let pending;
sock.on('data', x => {
buffs.push(x);
if (!pending)
pending = setImmediate(() => sock.end('HTTP/1.1 200 OK\r\n\r\n'));
});
sock.on('close', () => {
const proc = childproc.spawn('hexdump', ['-C'], {
stdio: ['pipe', 'inherit', 'inherit']
});
proc.stdin.end(Buffer.concat(buffs));
});
});
await needle('post', `http://localhost:${port}/`, {
file: {
buffer,
content_type,
filename: new Date().getTime() + '.csv'
}
}, {
multipart: true,
});
proxy.close();
}
(async () => {
// File content as a buffer, as it would be read directly from a file.
const csvFile = Buffer.from('77u/Ikh5cGhlbiIsIkVtIERhc2giDQoiLSIsIuKAlCINCg==', 'base64');
// Send with CSV type, as some web API might require.
// -> needle heuristically detects a "text" type, tries to re-encode the
// CSV payload, causing the UTF-8 em dash to be replaced by an ASCII
// control character, and corrupting the payload.
await postAndCapture(csvFile, 'text/csv');
// Sending with application/octet-stream works around the issue, but prevents
// us from sending the correct Content-Type, which might not work for all use-cases.
await postAndCapture(csvFile, 'application/octet-stream');
})();
Expected
- Changing the Content-Type of the uploaded file should not affect the file content.
Observed
With application/octet-stream...
...the relevant section of the hex dump shows:
00000110 54 50 43 4c 49 45 4e 54 0d 0a 43 6f 6e 74 65 6e |TPCLIENT..Conten|
00000120 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a 20 66 |t-Disposition: f|
00000130 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65 3d 22 |orm-data; name="|
00000140 66 69 6c 65 22 3b 20 66 69 6c 65 6e 61 6d 65 3d |file"; filename=|
00000150 22 31 36 35 31 39 33 37 34 32 38 39 38 38 2e 63 |"1651937428988.c|
00000160 73 76 22 0d 0a 43 6f 6e 74 65 6e 74 2d 54 72 61 |sv"..Content-Tra|
00000170 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e 67 3a 20 |nsfer-Encoding: |
00000180 62 69 6e 61 72 79 0d 0a 43 6f 6e 74 65 6e 74 2d |binary..Content-|
00000190 54 79 70 65 3a 20 61 70 70 6c 69 63 61 74 69 6f |Type: applicatio|
000001a0 6e 2f 6f 63 74 65 74 2d 73 74 72 65 61 6d 0d 0a |n/octet-stream..|
000001b0 0d 0a ef bb bf 22 48 79 70 68 65 6e 22 2c 22 45 |.."Hyphen","E|
000001c0 6d 20 44 61 73 68 22 0d 0a 22 2d 22 2c 22 e2 80 |m Dash".."-",".|
000001d0 94 22 0d 0a 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |."....----------|
The em dash is encoded as 0x80 0x94, a valid UTF-8 code sequence.
With text/csv...
...the relevant section of the hex dump shows:
00000110 54 50 43 4c 49 45 4e 54 0d 0a 43 6f 6e 74 65 6e |TPCLIENT..Conten|
00000120 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a 20 66 |t-Disposition: f|
00000130 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65 3d 22 |orm-data; name="|
00000140 66 69 6c 65 22 3b 20 66 69 6c 65 6e 61 6d 65 3d |file"; filename=|
00000150 22 31 36 35 31 39 33 37 34 32 38 39 36 36 2e 63 |"1651937428966.c|
00000160 73 76 22 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 70 |sv"..Content-Typ|
00000170 65 3a 20 74 65 78 74 2f 63 73 76 0d 0a 0d 0a ff |e: text/csv....|
00000180 22 48 79 70 68 65 6e 22 2c 22 45 6d 20 44 61 73 |"Hyphen","Em Das|
00000190 68 22 0d 0a 22 2d 22 2c 22 14 22 0d 0a 0d 0a 2d |h".."-","."....-|
The em dash is encoded as 0x14, an obscure ASCII control character, which apparently chokes some CSV parsers.
Uploading the CSV file as application/octet-stream may work as a workaround for some APIs but may not work in all cases, e.g. where a provider accepts multiple formats and uses the Content-Type header to actually differentiate which parser to use.
According to https://github.com/tomas/needle/blob/master/lib/multipart.js#L45, needle tries to heuristically determine if it should process and re-encode the payload data based on the content-type; there is apparently no way to instruct it to skip this re-encoding and send the data exactly as-is while still using a text content-type.
I see. So what should we do in this case? Allow passing something like multipart: 'raw' to skip re-encoding, or include some other method of preventing these weird conversions to happen?
Thanks for the very detailed bug report, by the way!
multipart: 'raw' could be technically workable, but feels very unsatisfying.
If you receive a binary Buffer for that part (or a file, which readFile then turns into a Buffer), wouldn't it make sense to just always send that as binary without re-encoding? In that case, I might just do something like this:
(code example moved to PR #405)
That seemed to work for the test case in this issue, at least: I get the content-type I want, a binary transfer-encoding, and the correct UTF-8 bytes, in both cases.