file: URL separator deduplication API
I would like to propose that file URLs include separator deduplication in the spec as a rule to match the file system identifiers themselves more closely.
The reason for this is that Node.js, Deno, and other platforms, build tools and tooling use file: URLs as their key in the module registry when loading ES modules. Whenever there are duplicate separators, this causes a double entry in the registry which can lead to duplicate module execution and a lack of code sharing.
These platforms and build tools are now the primary consumers of file: URLs since browsers no longer encourage the use case at all, therefore it is important that this spec can aim to meet their needs as primary consumers of this aspect of the spec going forward.
URL resolution already converts \ into /. This leaves only casing, percent encoding, query parameters and hashes as the comprehensive list of remaining non-unique components of the file URL representation's associated path representation.
I understand the main consideration for such a change is considering the potentially major change in browsers, but perhaps it would not be so difficult since browsers so carefully discourage file URLs and the issue is rather that until now there hasn't been pressure on this use case from these new file modules platform directions.
This would smooth a number of environment bugs that otherwise need custom code to handle.
In Node.js we effectively add this renormalization step to all module resolution usage of file URLs. Other cases of fileURL resolution would benefit.
References:
- Prev Issue: https://github.com/whatwg/url/issues/405
- Prev PR: https://github.com/whatwg/url/pull/544
- Upstream Issue: https://github.com/nodejs/node/pull/35477
That was #232 which we ended up overturning to some extent with #405. I'd rather not revisit this again.
cc @alwinb
To be fair, It is a bit different, as #232 collapsed a leading sequence of slashes only, somewhat mimicking the ignoring of multiple slashes before the host in other special URLs. This seems to call for collapsing all slashes in the path.
There also were interactions with the host that made things rather complex, especially to find a good trade-off between matching the different browser behaviours.
We recently considered in #606 to add a percent-encoding-equivalence / normalisation to the URL equivalence section. The same could be done for (some necessarily limited form of) file-path normalisation. Then the parser (ao.) would not become more complex.
Whether such a thing should be applied by default, is a question that I don't feel qualified to answer though. Browsers in any case, don't do this.
@karwa has done a lot of research on file paths, posted in #621 (note #618 as well). I think this would be part of that project. It's some amount of work though!
I think the proposal would be to just dedupe pathname in the file: scheme URLs and not any other URL parts (so leaving the Windows hostname etc). @alwinb do you know if there were any remaining issues with that kind of approach offhand? It's a lot of information to digest to pick up it seems....
If you don't consider the hostname or windows drive letters? I don't see how there would be interactions with other parts of the URL. But it would equate more file URLs.
I'd be surprised to find systems that allow empty directory names, but I would want to be careful about normalising too much. File path normalisation is quite a thing on its own.
Certainly agreed, I think for the most part we've accepted the state of percent encoding at this point. This issue was specifically about the / separator deduping only in the pathname.
Given browser behavior I would tend toward this being something you do after parsing if you need it.
I don't think it is an unreasonable proposal. But it requires answering a universal question about file-paths and that makes it a difficult issue.
Specifically: does this normalisation agree with all file-paths, for all systems, and are we willing to impose it on future systems as well?
The normalisation itself isn't complicated and quite natural, so if that question can be answered with a convincing "yes" then I wouldn't be opposed, at least.
Related: RFC8089
RFC 8089 lists two cases, which doesn't play well which separator deduplication, see: https://datatracker.ietf.org/doc/html/rfc8089#appendix-E.3.2
E.3.2.
URI with UNC Path ... Non-local files:
- The representation of a non-local file with an empty authority and a complete (transformed) UNC string in the path. For example:
- "file:////host.example.com/path/to/file"
- As above, with an extra slash between the empty authority and the transformed UNC string, as per the syntax defined in [RFC1738]. For example:
- "file://///host.example.com/path/to/file" This representation is notably used by the Firefox web browser. See Bugzilla#107540 [Bug107540].
If I understand this proposal correctly, then separator deduplication will change their meaning, as they become:
file:///host.example.com/path/to/file
Thanks
@rmisev the UNC string in a file URL, as far as I'm aware, is always of the form file://host/pathname and not file:///host/pathname. As a result new URL() treats the UNC component as a hostname in all cases I've seen and Node.js relies on this behaviour as well (//cc @bmeck if I'm missing something). Where did you get the file:////host.example.com example from? I've never seen a case like that...
The quote with these examples is taken from RFC 8089: https://datatracker.ietf.org/doc/html/rfc8089#appendix-E.3.2
For example Firefox uses 5 slashes before host, when converts from UNC path:
\\host.example.com\path\to\file -> file://///host.example.com/path/to/file
@rmisev in Node.js we explicitly create file://host.example.com/... in these cases so I wasn't aware of this more verbose form. That seems like a good counter example.
I guess this isn't a blocker though - it just complicates the story a little:
- Is the file URL a UNC form (
file:////detection being non-ambiguous as the indicator I believe), and if so, only dedupe separators after the UNC prefix. - If not, dedupe separators as part of the pathname
It still seems like it wouldn't be that complex?
Right now, with the exception of drive letters, pathnames are pretty consistent between different schemes. I think it'd be unfortunate to have the slash-collapsing behavior just for file URLs.
Additionally, parsing file:///etc//passwd gives consistent results on a wide variety of parsers across different languages:

I would really want to preserve this level of interop, which is quite rare in file URLs.
Reclassifying this as something we could potentially offer on top of #729, but won't change the parser for. Hope that's acceptable.