rclone icon indicating copy to clipboard operation
rclone copied to clipboard

rclone copy with http remote fails with HTTP Error: 404 Not Found when URL contains question mark

Open whoschek opened this issue 1 year ago • 0 comments

The associated forum post URL from https://forum.rclone.org

n/a

What is the problem you are having with rclone?

rclone copy, when downloading from a http remote, fails with HTTP Error: 404 Not Found when the HTTP URL contains a question mark. Example:

% rclone copy --dump bodies --http-no-head -vv --http-url 'https://www.google.com' ':http:search?q=rclone' /tmp

As can be seen in the debug error log below, rclone copy (with http remote) URL-encodes the question mark (which correctly separates the URL path from the URL query per the spec) into a '%3F' code and sends that '%3F' to the HTTP server, which violates the semantics of the URL spec - it effectively directs the HTTP server to treat the URL query portion not as a query portion but rather as part of the URL path - hence 404 Not Found. In other words rclone sends this incorrect HTTP GET (and similar HEAD) request, per the debug log below:

GET /search%3Fq=rclone HTTP/1.1

Instead, the correct GET request to issue here would be:

GET /search?q=rclone HTTP/1.1

For reference, curl and wget complete the same request correctly. For example, this 'curl' command completes correctly as expected:

% curl 'https://www.google.com/search?q=rclone'

As another reference, 'rclone copyurl' works correctly, as expected:

rclone copyurl --http-no-head -vv 'https://www.google.com/search?q=rclone' /tmp/foo

Also, 'rclone copy' (with http remote) works correctly with URLs that do not contain a question mark (i.e. that contain no query portion). The problem revolves around URL encoding the question mark. I believe, 'rclone copy' (with http remote) should not send URL encoded URL to the HTTP server. Rather, 'rclone copy' (with http remote) should pass the user-provided URL (including the question mark) as-is into the HTTP GET and HEAD requests.

FYI, I know that in this particular simple 'google search' example one might work around the problem by switching from 'rclone copy' to the 'rclone copyurl' command, but 'copyurl' doesn't implement the --files-from feature as well as parallel downloads and it also doesn't seem to implement atomic file renames from .partial to final file on success, etc.

In real life, my use case requires continuously and efficiently downloading from vanilla HTTP servers via lists of millions of URLs, using multiple threads, which makes the use of 'rclone copy --files-from' mandatory. (And no, a bash download script using copyurl with Unix 'parallel' command wouldn't meet the requirements - for example it would fork many processes per URL and thus be inefficient)

FYI, the --local-encoding flag does not seem to provide a work-around - The question mark '?' is always sent as a %3F code to the HTTP server, no matter what.

TLDR: The proposed fix is for 'rclone copy' (with http remote) to pass user-provided URL (including the question mark) as-is into the HTTP GET and HEAD requests, including for URLs passed in via --files-from feature.

What is your rclone version (output from rclone version)

%rclone --version
rclone v1.65.2
- os/version: darwin 13.6.4 (64 bit)
- os/kernel: 22.6.0 (arm64)
- os/type: darwin
- os/arch: arm64 (ARMv8 compatible)
- go/version: go1.21.6
- go/linking: dynamic
- go/tags: none

Which OS you are using and how many bits (e.g. Windows 7, 64 bit)

OSX

Which cloud storage system are you using? (e.g. Google Drive)

http remote to vanilla HTTP server

The command you were trying to run (e.g. rclone copy /tmp remote:tmp)

rclone copy --dump bodies --http-no-head -vv --http-url 'https://www.google.com' ':http:search?q=rclone' /tmp

A log from the command with the -vv flag (e.g. output from rclone -vv copy /tmp remote:tmp)

2024/02/18 22:39:37 DEBUG : rclone: Version "v1.65.2" starting with parameters ["rclone" "copy" "--dump" "bodies" "--http-no-head" "-vv" "--http-url" "https://www.google.com" ":http:search?q=rclone" "/tmp/rclone/"]
2024/02/18 22:39:37 DEBUG : Creating backend with remote ":http:search?q=rclone"
2024/02/18 22:39:37 DEBUG : Using config file from "/Users/<xxxxx>/.config/rclone/rclone.conf"
2024/02/18 22:39:37 DEBUG : :http: detected overridden config - adding "{s0EIw}" suffix to name
2024/02/18 22:39:37 DEBUG : You have specified to dump information. Please be noted that the Accept-Encoding as shown may not be correct in the request and the response may not show Content-Encoding if the go standard libraries auto gzip encoding was in effect. In this case the body of the request will be gunzipped before showing it.
2024/02/18 22:39:37 DEBUG : Assuming path is a file as --http-no-head is set
2024/02/18 22:39:37 DEBUG : If path is a directory you must add a trailing '/'
2024/02/18 22:39:37 DEBUG : Root: https://www.google.com/
2024/02/18 22:39:37 DEBUG : fs cache: adding new entry for parent of ":http:search?q=rclone", ":http{s0EIw}:search?q=rclone"
2024/02/18 22:39:37 DEBUG : Creating backend with remote "/tmp/rclone/"
2024/02/18 22:39:37 DEBUG : fs cache: renaming cache item "/tmp/rclone/" to be canonical "/tmp/rclone"
2024/02/18 22:39:37 DEBUG : search?q=rclone: Need to transfer - File not found at Destination
2024/02/18 22:39:37 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024/02/18 22:39:37 DEBUG : HTTP REQUEST (req 0x140007aa600)
2024/02/18 22:39:37 DEBUG : GET /search%3Fq=rclone HTTP/1.1
Host: www.google.com
User-Agent: rclone/v1.65.2
Accept-Encoding: gzip

2024/02/18 22:39:37 DEBUG : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024/02/18 22:39:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2024/02/18 22:39:37 DEBUG : HTTP RESPONSE (req 0x140007aa600)
2024/02/18 22:39:37 DEBUG : HTTP/2.0 404 Not Found
Content-Length: 1578
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Content-Type: text/html; charset=UTF-8
Date: Sun, 18 Feb 2024 21:39:37 GMT
Referrer-Policy: no-referrer

<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 404 (Not Found)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>404.</b> <ins>That’s an error.</ins>
  <p>The requested URL <code>/search%3Fq=rclone</code> was not found on this server.  <ins>That’s all we know.</ins>
2024/02/18 22:39:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2024/02/18 22:39:37 ERROR : search?q=rclone: Failed to copy: failed to open source object: Open failed: HTTP Error: 404 Not Found
2024/02/18 22:39:37 ERROR : Attempt 1/3 failed with 1 errors and: failed to open source object: Open failed: HTTP Error: 404 Not Found
2024/02/18 22:39:37 DEBUG : search?q=rclone: Need to transfer - File not found at Destination
...
2024/02/18 22:39:37 DEBUG : <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2024/02/18 22:39:37 ERROR : search?q=rclone: Failed to copy: failed to open source object: Open failed: HTTP Error: 404 Not Found
2024/02/18 22:39:37 ERROR : Attempt 3/3 failed with 1 errors and: failed to open source object: Open failed: HTTP Error: 404 Not Found
2024/02/18 22:39:37 INFO  : 
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Errors:                 1 (retrying may help)
Elapsed time:         0.5s

2024/02/18 22:39:37 DEBUG : 8 go routines active
2024/02/18 22:39:37 Failed to copy: failed to open source object: Open failed: HTTP Error: 404 Not Found

How to use GitHub

  • Please use the 👍 reaction to show that you are affected by the same issue.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.

whoschek avatar Feb 18 '24 22:02 whoschek