url icon indicating copy to clipboard operation
url copied to clipboard

Windows file paths in URL parsing

Open mertcanaltin opened this issue 7 months ago • 11 comments

I've discovered an interesting behavior in the WHATWG URL specification: Windows file paths such as C:\path\file.node are considered valid URLs.

The issue

  • Security: Web apps might accidentally treat file paths as URLs
  • UX: Users expect C:\path\file to be a local file, not a URL
  • Consistency: Other OS paths aren't treated as URLs

Current vs Proposed

// Now
URL.canParse("C:\\path\\file.node") // true

//Should be
URL.canParse("C:\\path\\file.node") // false
URL.canParse("file:///C:/path/file.node") // true

Related Issues

  • Node.js PR #58578: "node-api: preserve URL filenames without conversion" (https://github.com/nodejs/node/pull/58578)
  • ada-url PR #957: "fix: reject Windows file paths in can_parse" (https://github.com/ada-url/ada/pull/957)

mertcanaltin avatar Jun 13 '25 20:06 mertcanaltin

Cc @annevk

anonrig avatar Jun 13 '25 22:06 anonrig

I don't think we want to reject single-letter-scheme URLs. As discussed in #271 we might be able to do something special if the scheme is followed by a backslash, though this issue seems to ask for the opposite of what that issue is asking for.

annevk avatar Jun 15 '25 07:06 annevk

I don't think we want to reject single-letter-scheme URLs. As discussed in #271 we might be able to do something special if the scheme is followed by a backslash, though this issue seems to ask for the opposite of what that issue is asking for.

@annevk You're right - I'm actually proposing the same solution as #271, just a different approach

  • #271: D:\foo → auto-convert to file:///D:/foo
  • My proposal: C:\foo → reject in canParse(), force explicit file:///C:/foo

Both prevent C:\foo being parsed as scheme:C, path:\foo. Your Dec 2024 comment about treating [a-zA-Z]:\ specially is exactly what I'm suggesting.

Would rejecting [a-zA-Z]:\ in canParse() work as a first step toward the #271 solution?

mertcanaltin avatar Jun 15 '25 16:06 mertcanaltin

I think that would be quite risky as it means the URL constructor starts throwing as well. It's also not generally how we attempt to solve issues. We'd like to solve them once.

annevk avatar Jun 18 '25 11:06 annevk

@annevk Thanks for the explanation. Now I understand like we can apply the #271 approach

I am happy to help you implement this if you want to move forward.

mertcanaltin avatar Jun 19 '25 16:06 mertcanaltin

If you're willing to work out the necessary changes to the specification and supporting tests that would be a huge help. It will likely still take quite a while before it's all accepted, but a concrete set of proposed changes seems like a good next step here.

annevk avatar Jun 27 '25 14:06 annevk

@annevk I understand now. Based on your December 2024 comment in #271 and our discussion here, I'll implement the approach where:

[a-zA-Z]:\ → convert to file:///[drive]/[path] (backslash is invalid anyway) [a-zA-Z]:/ → preserve current behavior (too risky to change)

This solves both #271 and #873 together as you mentioned.

I'll work on:

  1. Specification changes for the URL parser algorithm
  2. Comprehensive tests covering both backslash (convert) and forward slash (preserve) cases
  3. Edge case handling

I'll start with the specification changes first, then the tests. Should I create a draft PR for the spec changes, or would you prefer to see the proposed changes in this issue first?

mertcanaltin avatar Jun 28 '25 20:06 mertcanaltin

I hope I did it right, I created two PR request for the first small steps

mertcanaltin avatar Jun 28 '25 21:06 mertcanaltin

@mertcanaltin, if you need a reference on how Windows handles path to URL conversion then there's UrlCreateFromPath introduced in 2006 and explained well by Dave Risney in File URIs in Windows. Though, the most interesting part here:

A large set of invalid file URIs come from the common but incorrect notion that it’s acceptable to place a Windows file path after the text ‘file://’ and call it a file URI. This is bad because Windows file paths, as mentioned earlier, may contain characters that aren’t allowed in URIs or that are important to the parsing of URIs. For instance, if a ‘#’ is in a Windows file path and that Windows file path is simply appended to the text ‘file://’ then we can’t know if the ‘#’ is supposed to be part of the path or if its supposed to delimit the fragment as it would in an actual URI. Similarly, if the path contains a ‘%’ then we can’t determine whether the ‘%’ identifies a percent-encoded octet, or if it is just a plain percent character in the Windows file path. Zeke Odins-Lucas wrote an informative and entertaining blog post on this topic.

The link to Zeke's post is updated to point to it in the archive.org.

YohDeadfall avatar Jun 29 '25 23:06 YohDeadfall

@YohDeadfall Thanks for infos, I will read today

mertcanaltin avatar Jul 03 '25 09:07 mertcanaltin

@annevk Implementation ready!

PR #874 + https://github.com/web-platform-tests/wpt/pull/53459 now implement exactly your suggested approach:

  • [a-zA-Z]:\ → convert to file:///
  • [a-zA-Z]:/ → preserve existing behavior

Comprehensive test coverage included.

mertcanaltin avatar Jul 08 '25 19:07 mertcanaltin