node
node copied to clipboard
fs.cpSync / fs.cp / fs.promises.cp fails when src/dest args are Buffer/URL
Version
All
Platform
All
Subsystem
fs
What steps will reproduce the bug?
const { cpSync, cp, promises } = require('node:fs');
cpSync(Buffer.from('dirA'), Buffer.from('dirB'), { recursive: true, filter(...args) { console.log(...args); return true; } }); // throws
cp(Buffer.from('dirA'), Buffer.from('dirB'), { recursive: true, filter(...args) { console.log(...args); return true; } }, (err) => {
console.log(err); // errors
});
promises.cp(Buffer.from('dirA'), Buffer.from('dirB'), { recursive: true, filter(...args) { console.log(...args); return true; } }); // rejects
How often does it reproduce? Is there a required condition?
Always
What is the expected behavior? Why is that the expected behavior?
It should just work without throwing. The filter function should be called with the option to receive the path as either a string, Buffer, or URL.
What do you see instead?
It throws.
Additional information
The issue here is that the underlying implementation assumes that the paths are strings and tries to use the path.join function to concatenate them, which obviously does not work for Buffer and URL. The implementation has to be able to support Buffer and URL paths in order to properly support arbitrary text encodings in file names etc. Unfortunately, because of the way the callback is designed, it's not clear that there is an immediate non-breaking fix ready available.
@dario-piotrowicz @cjihrig @anonrig
The immediately clear problem is this block in the copyDir function in the cp-sync.js file, specifically the two calls to path.join(...). The src and dest here might be Buffer or URL instances. Passing a URL just happens to work since a URL can be coerced to a string but it does so entirely by accident and can produce an invalid URL result since the name is not correctly appended to the URL base.
while ((dirent = dir.readSync()) !== null) {
const { name } = dirent;
const srcItem = join(src, name);
const destItem = join(dest, name);
let shouldCopy = true;
if (opts.filter) {
shouldCopy = opts.filter(srcItem, destItem);
if (isPromise(shouldCopy)) {
throw new ERR_INVALID_RETURN_VALUE('boolean', 'filter', shouldCopy);
}
}
if (shouldCopy) {
getStats(srcItem, destItem, opts);
}
}
Unfortunately, however, changing this is going to be likely breaking since it would necessarily change the way the filter callback is invoked. I'm not sure if there's a reasonable way to fix this with the current callback design
Unfortunately, however, changing this is going to be likely breaking since it would necessarily change the way the filter callback is invoked. I'm not sure if there's a reasonable way to fix this with the current callback design
The only reasonable way to implement this in a non-breaky way I think would be to always convert everything to strings, basically keeping the exact same (incorrect but also fully documented) functionality we currently have but making sure that it works with Buffer objects as well (as a stop gap and then fix it in a breaky way in a major release).
Alternatively we could have an additional optional flag like "respectInputTypes" or something like that to signal whether inputs should or not be converted to strings like they currently are. However it'd seem pretty strange to me to have a flag to enable the correct behavior of an API... so I would avoid this and rather implement this in the way I mentioned above, or directly go with a breaky solution and release it in a major release (also since this has, as far as I can tell, been broken for years without huge consequences, I think that it's probably not something very time sensitive).
Putting the breaking aspect aside for a moment, can we simply implement URL/Buffer logic in the same way the current implementation works but:
- without converting the paths to strings
- replacing
joincalls with custom URL/Buffer concatenation logic?
Are there any particular nuances with this? or is it just a matter of taking what currently works on strings and implement equivalent logic based on different types?
The only reasonable way to implement this in a non-breaky way I think would be to always convert everything to strings, basically keeping the exact same (incorrect but also fully documented) functionality we currently have but making sure that it works with Buffer objects as well (as a stop gap and then fix it in a breaky way in a major release).
Unfortunately converting everything to strings means likely breaking the paths when we don't know what character encoding they are using, so I don't think that works as a solution here.
There are two fundamental problems here:
- The
srcanddestcan be either a string, Buffer, or URL. The current code assumes it is always a string. When a URL is passed it ends up being coerced to a string but still concatenated incorrectly. When Buffer is used thePath.join(..)blows us because the string coercion with a Buffer does not work as transparently as it does with URL. - The second issue is that the individual names reported for directory entries are treated unconditionally as utf8 at the native layer, which is likely incorrect.
When then take both the src and dest and try combining them with name using path.join(...) which blows up.
I don't think there's a way of fixing this without a substantive breaking change. This gets quite tricky in the recursive case when Buffer or URL are used as the src or dest. I'm working through a proposed fix and will hopefully have a PR later today.
To illustrate the problematic scenario here... most POSIX OS's treat file names as opaque byte strings with very few actual limitations. Generally, the only bytes that aren't allowed in a filename are known separators like ASCII / or ASCII \. So I can have a directory structure like:
/
|- <directory name encoded as shift-js>
|- <directory name encoded as utf16le>
|- <directory name encoded as latin1>
|- <filename encoded as utf8>
Now, if the node:fs implementation just ends up blindly converting these all into UTF8 strings it's going to fail spectacularly just trying to handle that first top level.
If we handle every thing as Buffer, then we can certainly concatenate the buffers using / as a delimiter, e.g. Buffer.concat([dir1, sep, dir2, sep, dir3, sep, file]) where sep is equal to Buffer.from(path.sep), and we'd be able to work with that. However, if the path is given as a URL that gets a bit tricky as we currently don't have a URL-to-Buffer conversion for file paths... we probably need one.
And, fwiw, this is not a hypothetical case. This is exactly the kind of issue I was helping a customer solve back when I added the ability to represent file paths as Buffer in the first place.
I believe the only way to adequately fix this will be to come up with an entirely new signature for the filter method. Essentially, runtime deprecate options.filter and add a new alternative like options.predicate that uses a signature that adequately handles this type of case. I'm still working out what the signature of that options.predicate should be.
Another challenge in this... our current file URL to path logic does not work when the file URL has to percent encode the file name in a different encoding.. that is, the conversion is assuming that the percent encoding is of utf8 bytes... When it is, for instance, shift-jis bytes, it blows up... In the example screenshotted below, the percent encodings align with the actual file name on disk as shift-jis.
Ok, here's a proposed set of fixes:
- When the
pathis specified as aURL, we will convert that into aBufferinstead of a string. This will require a new alternative implementation of theul.fileURLToPath(n)method. https://github.com/nodejs/node/pull/58700 - When the
pathis specified as astring, we will convert that into aBufferassumingutf8encoding. - We have the
readdiroperation always return the child paths as Buffer instances. We can then join those usingBuffer.concatrather thanpath.join. - The new
options.predicatewould receiveBufferinstances for thesrcanddestinstead of strings, with the existingoptions.filterbeing runtime deprecated.
Unfortunately converting everything to strings means likely breaking the paths when we don't know what character encoding they are using, so I don't think that works as a solution here.
Yes agreed, I was only suggesting converting everything to strings as a temporary non-breaky solution 🙂
Thanks a lot for all the explanations and examples, all of them are very very informative, much appreciated! :heart_hands:
Ok, here's a proposed set of fixes:
mh... I am not completely sold on it...
You're basically proposing to convert everything to Buffers and just use those for the function's logic and as the filtering function, right?
I personally feel that this does make sense but provides a worse UX for the function's consumers when they will be forced, if they want to implement the filtering logic, to work with Buffers regardless of the types of src and dest, is that right?
I imagine that it would be more complex but could we respect the input types? meaning that if the user passed strings that's the type of argument that their filtering function gets? (potentially, if that helps, with the constraint that the type of src and dest need to be the same?)
Additionally I feel like predicate is a very awkward name (what about exclude maybe? 🤔 )... but names aside, could we not just make this a breaking change? do you think that that would be too disruptive? 😓
The only way to keep the change from being semver-major is to add an option that essentially makes the option.filter callback polymorphic, which I absolutely don't want to do.
Unfortunately it's not about respecting the input types as that is only half of the issue here. The current function does not handle URL and Buffer input types correct as it is, the fact that the directory listing filenames also aren't handled correctly makes things worse. For instance, if I pass in a URL object as one of the path arguments, that gets coerced to a string in the callback, which is not respecting the input type.
with the constraint that the type of src and dest need to be the same?
This would also be a breaking change since that's currently not enforced.
So there are multiple layers of issues here that make any non-breaking fix problematic.
I personally feel that this does make sense but provides a worse UX for the function's consumers when they will be forced, if they want to implement the filtering logic, to work with Buffers regardless of the types of src and dest, is that right?
What we could do here, which can still be problematic, is allow an additional option.encoding option that controls the encoding of the arguments passed into the predicate function. If you're not a fan of predicate as a name, then maybe include is better (exclude would flip the semantics of the return boolean).
// By default, the src, dest, and name would be Buffer instances...
// The src and dest would *not* have the name appended, instead the name would be a separate arg
fs.copySync(Buffer.from('a'), Buffer.from('b'), {
include(src, dest, name) {
console.log(src); // Buffer
console.log(dest); // Buffer
console.log(name); // Buffer
}
});
// If the encoding option is set, however, the src, dest, and name are essentially passed as
// buf.toString(encoding) ... which can still yield invalid results but that's at the users
// discretion. The default behavior would be more difficult to use but more correct.
fs.copySync(Buffer.from('a'), Buffer.from('b'), {
encoding: 'utf8',
include(src, dest, name) {
console.log(src); // string
console.log(dest); // string
console.log(name); // string
}
});
The only way to keep the change from being semver-major is to add an option that essentially makes the
option.filtercallback polymorphic, which I absolutely don't want to do.
I agree, I mentioned something like that in one of my comments above, but as I also mentioned there I also think that this would be quite awkward/ugly.
Anyways what I am advocating for is making this a semver-major/breaking change instead of coming up with a sub-optimal solution which we might get stuck with for a while 🤔
(It looks like you're trying very hard not to make this a breaking change, I am not really that familiar with node's policies/preferences on breaking changes, maybe I am just being too loose/naive advocating for that route)
If you're not a fan of predicate as a name, then maybe include is better (exclude would flip the semantics of the return boolean).
yeah sorry my mistake about exclude, yeah I would find include better than predicate 🙂 (although again my preference would be not to introduce a new filtering option if we can help it)
I've found another instance of this issue, readdir (both the callback and promises versions since they share the same code)
Because of this join:
https://github.com/nodejs/node/blob/139c2e1e297939f677f4198fe8c416e9b29a4cf9/lib/fs.js#L1434
After that there are also some other joins that would kick in as well:
- https://github.com/nodejs/node/blob/139c2e1e297939f677f4198fe8c416e9b29a4cf9/lib/internal/fs/promises.js#L897
- https://github.com/nodejs/node/blob/139c2e1e297939f677f4198fe8c416e9b29a4cf9/lib/internal/fs/promises.js#L918
Note: both docs clearly say that URL and Buffer are accepted args:
Note: the same issue happens with readdirSync since that also uses the same processReaddirResult` utility the other two do