needle icon indicating copy to clipboard operation
needle copied to clipboard

Mangled file content when multipart-POSTing a file with a "text/*" content type

Open mciasuen opened this issue 3 years ago • 2 comments

Reading a UTF-8 CSV and attempting to upload it with needle via a multipart POST can cause non-ASCII characters inside the CSV data to be replaced by other characters.

Steps to Reproduce:

# node --version
v14.18.1
'use strict';

const childproc = require('child_process');
const needle = require('needle');
const net = require('net');

async function postAndCapture(buffer, content_type) {
	const port = 49152 + Math.floor(Math.random() * 16384);
	const proxy = net.createServer();
	await new Promise((res, rej) => {
		proxy.on('error', rej);
		proxy.listen(port, res);
	});
	proxy.on('connection', sock => {
		const buffs = [];
		let pending;
		sock.on('data', x => {
			buffs.push(x);
			if (!pending)
				pending = setImmediate(() => sock.end('HTTP/1.1 200 OK\r\n\r\n'));
		});
		sock.on('close', () => {
			const proc = childproc.spawn('hexdump', ['-C'], {
				stdio: ['pipe', 'inherit', 'inherit']
			});
			proc.stdin.end(Buffer.concat(buffs));
		});
	});
	await needle('post', `http://localhost:${port}/`, {
		file: {
			buffer,
			content_type,
			filename: new Date().getTime() + '.csv'
		}
	}, {
		multipart: true,
	});
	proxy.close();
}

(async () => {
	// File content as a buffer, as it would be read directly from a file.
	const csvFile = Buffer.from('77u/Ikh5cGhlbiIsIkVtIERhc2giDQoiLSIsIuKAlCINCg==', 'base64');

	// Send with CSV type, as some web API might require.
	// -> needle heuristically detects a "text" type, tries to re-encode the
	// CSV payload, causing the UTF-8 em dash to be replaced by an ASCII
	// control character, and corrupting the payload.
	await postAndCapture(csvFile, 'text/csv');

	// Sending with application/octet-stream works around the issue, but prevents
	// us from sending the correct Content-Type, which might not work for all use-cases.
	await postAndCapture(csvFile, 'application/octet-stream');
})();

Expected

  • Changing the Content-Type of the uploaded file should not affect the file content.

Observed

With application/octet-stream...

...the relevant section of the hex dump shows:

00000110  54 50 43 4c 49 45 4e 54  0d 0a 43 6f 6e 74 65 6e  |TPCLIENT..Conten|
00000120  74 2d 44 69 73 70 6f 73  69 74 69 6f 6e 3a 20 66  |t-Disposition: f|
00000130  6f 72 6d 2d 64 61 74 61  3b 20 6e 61 6d 65 3d 22  |orm-data; name="|
00000140  66 69 6c 65 22 3b 20 66  69 6c 65 6e 61 6d 65 3d  |file"; filename=|
00000150  22 31 36 35 31 39 33 37  34 32 38 39 38 38 2e 63  |"1651937428988.c|
00000160  73 76 22 0d 0a 43 6f 6e  74 65 6e 74 2d 54 72 61  |sv"..Content-Tra|
00000170  6e 73 66 65 72 2d 45 6e  63 6f 64 69 6e 67 3a 20  |nsfer-Encoding: |
00000180  62 69 6e 61 72 79 0d 0a  43 6f 6e 74 65 6e 74 2d  |binary..Content-|
00000190  54 79 70 65 3a 20 61 70  70 6c 69 63 61 74 69 6f  |Type: applicatio|
000001a0  6e 2f 6f 63 74 65 74 2d  73 74 72 65 61 6d 0d 0a  |n/octet-stream..|
000001b0  0d 0a ef bb bf 22 48 79  70 68 65 6e 22 2c 22 45  |.."Hyphen","E|
000001c0  6d 20 44 61 73 68 22 0d  0a 22 2d 22 2c 22 e2 80  |m Dash".."-",".|
000001d0  94 22 0d 0a 0d 0a 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |."....----------|

The em dash is encoded as 0x80 0x94, a valid UTF-8 code sequence.

With text/csv...

...the relevant section of the hex dump shows:

00000110  54 50 43 4c 49 45 4e 54  0d 0a 43 6f 6e 74 65 6e  |TPCLIENT..Conten|
00000120  74 2d 44 69 73 70 6f 73  69 74 69 6f 6e 3a 20 66  |t-Disposition: f|
00000130  6f 72 6d 2d 64 61 74 61  3b 20 6e 61 6d 65 3d 22  |orm-data; name="|
00000140  66 69 6c 65 22 3b 20 66  69 6c 65 6e 61 6d 65 3d  |file"; filename=|
00000150  22 31 36 35 31 39 33 37  34 32 38 39 36 36 2e 63  |"1651937428966.c|
00000160  73 76 22 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |sv"..Content-Typ|
00000170  65 3a 20 74 65 78 74 2f  63 73 76 0d 0a 0d 0a ff  |e: text/csv....|
00000180  22 48 79 70 68 65 6e 22  2c 22 45 6d 20 44 61 73  |"Hyphen","Em Das|
00000190  68 22 0d 0a 22 2d 22 2c  22 14 22 0d 0a 0d 0a 2d  |h".."-","."....-|

The em dash is encoded as 0x14, an obscure ASCII control character, which apparently chokes some CSV parsers.


Uploading the CSV file as application/octet-stream may work as a workaround for some APIs but may not work in all cases, e.g. where a provider accepts multiple formats and uses the Content-Type header to actually differentiate which parser to use.

According to https://github.com/tomas/needle/blob/master/lib/multipart.js#L45, needle tries to heuristically determine if it should process and re-encode the payload data based on the content-type; there is apparently no way to instruct it to skip this re-encoding and send the data exactly as-is while still using a text content-type.

mciasuen avatar May 07 '22 15:05 mciasuen

I see. So what should we do in this case? Allow passing something like multipart: 'raw' to skip re-encoding, or include some other method of preventing these weird conversions to happen?

Thanks for the very detailed bug report, by the way!

tomas avatar May 07 '22 15:05 tomas

multipart: 'raw' could be technically workable, but feels very unsatisfying.

If you receive a binary Buffer for that part (or a file, which readFile then turns into a Buffer), wouldn't it make sense to just always send that as binary without re-encoding? In that case, I might just do something like this:

(code example moved to PR #405)

That seemed to work for the test case in this issue, at least: I get the content-type I want, a binary transfer-encoding, and the correct UTF-8 bytes, in both cases.

mciasuen avatar May 07 '22 16:05 mciasuen