aria2
aria2 copied to clipboard
Check the header “Content-Disposition” to determine the filename to store
For example, the link http://audeering.com/download/1318/ (linked from http://audeering.com/technology/opensmile/#download) provides a Gzip-compressed tarball, but aria2 stored the tarball as the name index.html
by the command aria2c -x4 http://audeering.com/download/1318/
.
This link returns the header Content-Disposition
as attachment; filename="opensmile-2.3.0.tar.gz";
, so aria2 must interpret it and store the tarball as the name opensmile-2.3.0.tar.gz
.
+1
aria2 supports content-disposition header field, and the value which server sent is syntactically incorrect. The trailing ";" must not be there. The server is broken.
A simple workaround for this:
diff --git a/src/util.cc b/src/util.cc
index be73a6e4..ffa4236b 100644
--- a/src/util.cc
+++ b/src/util.cc
@@ -1555,6 +1555,12 @@ ssize_t parse_content_disposition(char* dest, size_t destlen,
case CD_AFTER_VALUE:
case CD_TOKEN:
return destlen - dlen;
+ case CD_BEFORE_DISPOSITION_PARM_NAME:
+ if ((flags & CD_FILENAME_FOUND) == 0 &&
+ (flags & CD_EXT_FILENAME_FOUND) == 0) {
+ return -1;
+ }
+ return destlen - dlen;
case CD_VALUE_CHARS:
if (charset == CD_ENC_UTF8 && dfa_state != UTF8_ACCEPT) {
return -1;
Is the workaround ok?
well server broken or not it's something happens in the wild and should be handled. wget can recognize all content-disposition replies pretty good but aria2 fails many times
_ttps://drscdn.500px.org/photo/1022280281/q%3D80_m%3D1000/v2?sig=3234f9ae309e3e6a7a31f95ace6754e45c5171a8be579de3ae30a464f239beae
content-disposition: filename=stock-photo-1022280281.jpg
aria2c v1.35 doesn't use the remote name when saving the file, curl does.
@eddiezato @tats-u @tatsuhiro-t @ahmedtds @Baneeishaque @kwkam Does aria still can't autorename file from content-disposition?
@tatsuhiro-t I have been quiet about your opinion, but I found an evidence that trailing semicolons should be permitted. https://datatracker.ietf.org/doc/html/rfc2183#section-3
One of examples of RFC2183 is:
Content-Disposition: attachment; filename=genome.jpeg;
This shows that trailing semicolons are officially allowed.
It is unfortunate that the above links are dead, though.
@tats-u Looks like according to RFC 6266 trailing semicolon must be omitted.
RFC 2616 defines the Content-Disposition response header field, but points out that it is not part of the HTTP/1.1 Standard.
I can't download models from https://huggingface.co due to this bug 😞
I can't download models from https://huggingface.co due to this bug disappointed
you can just wget directly from huggingface. change the /blob/ into /resolve/
you can just wget directly from huggingface. change the /blob/ into /resolve/
Surely I can download using wget or curl, but I want to use aria2c for multipart download.
There's no reason to reject a response from a server just because of its trivial and easily-recoverable syntax "error". As you know servers SHALL generate correct response though.
@popugasik said:
@tats-u Looks like according to RFC 6266 trailing semicolon must be omitted.
It says quite the opposite, actually. The grammar is defined as:
content-disposition = "Content-Disposition" ":"
disposition-type *( ";" disposition-parm )
disposition-type = "inline" | "attachment" | disp-ext-type
; case-insensitive
disp-ext-type = token
disposition-parm = filename-parm | disp-ext-parm
Specifically, the content-disposition
field's final element is *( ";" disposition-parm )
.
Checking the grammar rules in RFC 2616 section 2.1:
(rule1 rule2)
Elements enclosed in parentheses are treated as a single element.
Thus, "(elem (foo | bar) elem)" allows the token sequences "elem
foo elem" and "elem bar elem".
*rule
The character "*" preceding an element indicates repetition. The
full form is "<n>*<m>element" indicating at least <n> and at most
<m> occurrences of element. Default values are 0 and infinity so
that "*(element)" allows any number, including zero; "1*element"
requires at least one; and "1*2element" allows one or two.
we see that *( ";" disposition-parm )
translates to "Zero or more instances of disposition-parm
OR ;
".
Absolutely nothing prohibits having a trailing semicolon, it's explicitly allowed. In fact you could have a hundred of them and still be compliant.
Even if RFC6266 did forbid a trailing semicolon - which it doesn't - rejecting the header makes aria2 functionally unusable on a huge percentage of websites, and is a horrible user experience.
HuggingFace is serving these files from AWS S3 via AWS CloudFront. CloudFront/S3 are responsible for these headers. A significant portion of websites serve large files from S3 via CloudFront. Rejecting this RFC-spec-compliant header breaks all downloads from these sites.
It also (at last check) breaks downloads from recent nginx versions using the default out-of-the-box header configuration.
curl
gets this right. wget
gets this right. python requests
gets this right.
Yet this issue has been open for over five years and aria2 is still getting it wrong...
n.b. on top of the above, section 4.3 of RFC 6266 states:
o Recipients SHOULD strip or replace character sequences that are
known to cause confusion both in user interfaces and in filenames,
such as control characters and leading and trailing whitespace.
I'd argue semicolons fall into that category, though that's probably beside the point.
Aria2 seems to be kind of dead anyway, is there an active aria2 fork out there that doesn't do this kind of nonsense? If not, maybe it's time to fork it now.
nobody willing to start an active fork and fix issues like this?
@tatsuhiro-t you still resist to fix this?
Do you still think curl is wrong, wget is wrong, python is wrong, Amazon is wrong, nginx and several other most popular web servers are wrong but only you are correct?
@tatsuhiro-t while you're active nowadays, can you finally solve this issue before a new version?
@tatsuhiro-t you still resist to fix this?
Do you still think curl is wrong, wget is wrong, python is wrong, Amazon is wrong, nginx and several other most popular web servers are wrong but only you are correct?
Agree, Even it is wrong, Others supported as they can.
nobody willing to start an active fork and fix issues like this?
you really want this to be fixed huh? isn't this a fix? just compile it afterwards
@popugasik said: (...) It says quite the opposite, actually. The grammar is defined as:
content-disposition = "Content-Disposition" ":" disposition-type *( ";" disposition-parm )
(...) we see that
*( ";" disposition-parm )
translates to "Zero or more instances ofdisposition-parm
OR;
".
It translates to zero or more instances of ";"disposition-parm
, there's no "or"
Anyway, I agree aria2 should handle this situation. Maybe something like this at the beginning of ssize_t parse_content_disposition
function (with the hint about wrong syntax):
if (in[len-1]==';'){
A2_LOG_NOTICE(fmt("Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing"));
--len;
}
Output:
>aria2c-x86_64 "https://huggingface.co/microsoft/Orca-2-13b/resolve/main/pytorch_model-00001-of-00006.bin"
11/25 17:57:26 [NOTICE] Downloading 1 item(s)
11/25 17:57:26 [NOTICE] CUID#7 - Redirecting to https://cdn-lfs-us-1.huggingface.co/repos/a2/6b/a26b589f0cd574f46956be719d27409ed2fde99088cd84b2a8e0403d0de10ab4/48b50d9d6ef77522b210c62ec214dd01295fcf88a3160acce32cae23d033ce1d?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model-00001-of-00006.bin%3B+filename%3D%22pytorch_model-00001-of-00006.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1701189679&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwMTE4OTY3OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2EyLzZiL2EyNmI1ODlmMGNkNTc0ZjQ2OTU2YmU3MTlkMjc0MDllZDJmZGU5OTA4OGNkODRiMmE4ZTA0MDNkMGRlMTBhYjQvNDhiNTBkOWQ2ZWY3NzUyMmIyMTBjNjJlYzIxNGRkMDEyOTVmY2Y4OGEzMTYwYWNjZTMyY2FlMjNkMDMzY2UxZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=U4omCd1sHHmXoPvqWRCLAUwgfaUsF1gVWJtHDAhnMkgWDT-5NzYycBsVNCLVARFy8k%7ER6RIyWC-0ylvrGb0jW5BWwCIoi7Y007TlHzNX382JVtdWrtSc8yZPVWVwrWiBjrVm5piIT4487rheKUQ8bXxbCnChuBD5ZJIKQ%7E4oSvdVEHgwuhhLfGz9MiaLHC%7E-W85FdVQX3JPBWBgu71yWtOz1qSjNCNRUHGHWTwPcredMfT7jGbdkFiHqfTj4mQFQUXK9-XNRMWiksNvQGrcmB9%7EVf7uisIngiGwqsHwlpX4OmkiyUyiGXlJ%7E9szHI7-6JXi0EXVj2gMqhMbYXhlW8g__&Key-Pair-Id=KCD77M1F0VK2B
11/25 17:57:27 [NOTICE] Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing
11/25 17:57:27 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
*** Download Progress Summary as of Sat Nov 25 17:57:37 2023 ***
====================================================================================================================================================
[#3aa206 0B/9.2GiB(0%) CN:1 DL:0B]
FILE: F:/pytorch_model-00001-of-00006.bin
nobody willing to start an active fork and fix issues like this?
you really want this to be fixed huh? isn't this a fix? just compile it afterwards
I don't know how to compile it but I'm not the only one who needs it, as you can see it affects a lot of people, so a fix in the release would be better.
Anyway, I agree aria2 should handle this situation. Maybe something like this at the beginning of
ssize_t parse_content_disposition
function (with the hint about wrong syntax):if (in[len-1]==';'){ A2_LOG_NOTICE(fmt("Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing")); --len; }
I don't think it needs to be logged. It's not something the user needs/wants to know and considering all the other major software and web server behaviour, it's most likely not right to call it a "corruption that was fixed"
@fenopa, similar situation is with 7-Zip. You can find discussions there. Basically, in many cases, Igor refused to open/handle files that doesn't follow specification. Similar to aria2 situation was with WordPress plugin BackWPup generated tar files - GNU tar (and other programs) silently ignored the error, Igor decided to still report error but also unpack all files ( link ).
Here my workflow script file. And here is github-actions release.
Maintainers did not explicitly close this issue or stated they would not accept a pull request, so I suppose anyone could send a PR.
@mgrinzPlayer thanks a lot for that build.
@reschke can you put an end to this? who is right?
The grammar does not allow trailing semicolons.
But that doesn't necessarily mean it would be a problem to allow them.
In any case, I would report site bugs when I'd encounter those.
The grammar does not allow trailing semicolons.
But that doesn't necessarily mean it would be a problem to allow them.
In any case, I would report site bugs when I'd encounter those.
You and @mgrinzPlayer are right, I misread the grammar (think my annoyed-at-the-time brain inserted a |
that wasn't there), but that doesn't change the fact that CloudFront injects a trailing semicolon, therefore a significant chunk of the internet does, and nothing any of us can do will change that.
W.R.T. 7-Zip, Igor's persistent refusal to accept the reality of formats that people are using and want to use has caused just as much strife (if not more) and no doubt had a negative effect on uptake/usage of 7-Zip. It's also not entirely comparable, since in the 7-Zip case we're talking about new, relatively-unknown formats that are not supported by much else in the way of software save for the existing 7-Zip forks.
While in this case, we're talking about something that every other library/client comparable to aria2 handles gracefully, that's widely deployed and accepted across the internet as a whole.
~~I'd be happy to send a PR similar to the above if I had any confidence that it'd be accepted and not just closed with "fix your noncompliant server" (since as discussed above "fixing cloudfront" is entirely beyond what any of us can do)~~ [Edit:] Eh, money where my mouth is. Will open a draft PR using something like what's suggested above shortly...
If cloudfront doesn't follow simple "Content-Disposition" ":" disposition-type *( ";" disposition-parm )
... what other things they messed.