aria2 icon indicating copy to clipboard operation
aria2 copied to clipboard

Check the header “Content-Disposition” to determine the filename to store

Open tats-u opened this issue 7 years ago • 30 comments

For example, the link http://audeering.com/download/1318/ (linked from http://audeering.com/technology/opensmile/#download) provides a Gzip-compressed tarball, but aria2 stored the tarball as the name index.html by the command aria2c -x4 http://audeering.com/download/1318/. This link returns the header Content-Disposition as attachment; filename="opensmile-2.3.0.tar.gz";, so aria2 must interpret it and store the tarball as the name opensmile-2.3.0.tar.gz.

tats-u avatar Jan 22 '18 05:01 tats-u

+1

Baneeishaque avatar Jan 23 '18 19:01 Baneeishaque

aria2 supports content-disposition header field, and the value which server sent is syntactically incorrect. The trailing ";" must not be there. The server is broken.

tatsuhiro-t avatar Feb 12 '18 09:02 tatsuhiro-t

A simple workaround for this:

diff --git a/src/util.cc b/src/util.cc
index be73a6e4..ffa4236b 100644
--- a/src/util.cc
+++ b/src/util.cc
@@ -1555,6 +1555,12 @@ ssize_t parse_content_disposition(char* dest, size_t destlen,
   case CD_AFTER_VALUE:
   case CD_TOKEN:
     return destlen - dlen;
+  case CD_BEFORE_DISPOSITION_PARM_NAME:
+    if ((flags & CD_FILENAME_FOUND) == 0 &&
+        (flags & CD_EXT_FILENAME_FOUND) == 0) {
+      return -1;
+    }
+    return destlen - dlen;
   case CD_VALUE_CHARS:
     if (charset == CD_ENC_UTF8 && dfa_state != UTF8_ACCEPT) {
       return -1;

kwkam avatar Feb 12 '18 10:02 kwkam

Is the workaround ok?

Baneeishaque avatar Feb 12 '18 13:02 Baneeishaque

well server broken or not it's something happens in the wild and should be handled. wget can recognize all content-disposition replies pretty good but aria2 fails many times

ahmedtds avatar Oct 08 '19 01:10 ahmedtds

_ttps://drscdn.500px.org/photo/1022280281/q%3D80_m%3D1000/v2?sig=3234f9ae309e3e6a7a31f95ace6754e45c5171a8be579de3ae30a464f239beae

content-disposition: filename=stock-photo-1022280281.jpg

aria2c v1.35 doesn't use the remote name when saving the file, curl does.

eddiezato avatar Sep 24 '20 04:09 eddiezato

@eddiezato @tats-u @tatsuhiro-t @ahmedtds @Baneeishaque @kwkam Does aria still can't autorename file from content-disposition?

shadowzoom avatar Jan 14 '21 07:01 shadowzoom

@tatsuhiro-t I have been quiet about your opinion, but I found an evidence that trailing semicolons should be permitted. https://datatracker.ietf.org/doc/html/rfc2183#section-3

One of examples of RFC2183 is:

Content-Disposition: attachment; filename=genome.jpeg;

This shows that trailing semicolons are officially allowed.

It is unfortunate that the above links are dead, though.

tats-u avatar May 16 '21 12:05 tats-u

@tats-u Looks like according to RFC 6266 trailing semicolon must be omitted.

RFC 2616 defines the Content-Disposition response header field, but points out that it is not part of the HTTP/1.1 Standard.

popugasik avatar Oct 13 '22 15:10 popugasik

I can't download models from https://huggingface.co due to this bug 😞

nikicat avatar Mar 02 '23 19:03 nikicat

I can't download models from https://huggingface.co due to this bug disappointed

you can just wget directly from huggingface. change the /blob/ into /resolve/

shirooo39 avatar Mar 19 '23 16:03 shirooo39

you can just wget directly from huggingface. change the /blob/ into /resolve/

Surely I can download using wget or curl, but I want to use aria2c for multipart download.

nikicat avatar Mar 19 '23 20:03 nikicat

There's no reason to reject a response from a server just because of its trivial and easily-recoverable syntax "error". As you know servers SHALL generate correct response though.

tats-u avatar Mar 20 '23 09:03 tats-u

@popugasik said:

@tats-u Looks like according to RFC 6266 trailing semicolon must be omitted.

It says quite the opposite, actually. The grammar is defined as:

     content-disposition = "Content-Disposition" ":"
                            disposition-type *( ";" disposition-parm )

     disposition-type    = "inline" | "attachment" | disp-ext-type
                         ; case-insensitive
     disp-ext-type       = token

     disposition-parm    = filename-parm | disp-ext-parm

Specifically, the content-disposition field's final element is *( ";" disposition-parm ).
Checking the grammar rules in RFC 2616 section 2.1:

   (rule1 rule2)
      Elements enclosed in parentheses are treated as a single element.
      Thus, "(elem (foo | bar) elem)" allows the token sequences "elem
      foo elem" and "elem bar elem".

   *rule
      The character "*" preceding an element indicates repetition. The
      full form is "<n>*<m>element" indicating at least <n> and at most
      <m> occurrences of element. Default values are 0 and infinity so
      that "*(element)" allows any number, including zero; "1*element"
      requires at least one; and "1*2element" allows one or two.

we see that *( ";" disposition-parm ) translates to "Zero or more instances of disposition-parm OR ;".

Absolutely nothing prohibits having a trailing semicolon, it's explicitly allowed. In fact you could have a hundred of them and still be compliant.

Even if RFC6266 did forbid a trailing semicolon - which it doesn't - rejecting the header makes aria2 functionally unusable on a huge percentage of websites, and is a horrible user experience.

HuggingFace is serving these files from AWS S3 via AWS CloudFront. CloudFront/S3 are responsible for these headers. A significant portion of websites serve large files from S3 via CloudFront. Rejecting this RFC-spec-compliant header breaks all downloads from these sites.

It also (at last check) breaks downloads from recent nginx versions using the default out-of-the-box header configuration.

curl gets this right. wget gets this right. python requests gets this right.

Yet this issue has been open for over five years and aria2 is still getting it wrong...


n.b. on top of the above, section 4.3 of RFC 6266 states:

   o  Recipients SHOULD strip or replace character sequences that are
      known to cause confusion both in user interfaces and in filenames,
      such as control characters and leading and trailing whitespace.

I'd argue semicolons fall into that category, though that's probably beside the point.

neggles avatar Jul 16 '23 04:07 neggles

Aria2 seems to be kind of dead anyway, is there an active aria2 fork out there that doesn't do this kind of nonsense? If not, maybe it's time to fork it now.

fenopa avatar Jul 19 '23 19:07 fenopa

nobody willing to start an active fork and fix issues like this?

fenopa avatar Aug 15 '23 23:08 fenopa

@tatsuhiro-t you still resist to fix this?

Do you still think curl is wrong, wget is wrong, python is wrong, Amazon is wrong, nginx and several other most popular web servers are wrong but only you are correct?

fenopa avatar Nov 03 '23 18:11 fenopa

@tatsuhiro-t while you're active nowadays, can you finally solve this issue before a new version?

fenopa avatar Nov 14 '23 19:11 fenopa

@tatsuhiro-t you still resist to fix this?

Do you still think curl is wrong, wget is wrong, python is wrong, Amazon is wrong, nginx and several other most popular web servers are wrong but only you are correct?

Agree, Even it is wrong, Others supported as they can.

jyxjjj avatar Nov 24 '23 05:11 jyxjjj

nobody willing to start an active fork and fix issues like this?

you really want this to be fixed huh? isn't this a fix? just compile it afterwards

grandpa1946 avatar Nov 24 '23 22:11 grandpa1946

@popugasik said: (...) It says quite the opposite, actually. The grammar is defined as:

     content-disposition = "Content-Disposition" ":"
                            disposition-type *( ";" disposition-parm )

(...) we see that *( ";" disposition-parm ) translates to "Zero or more instances of disposition-parm OR ;".

It translates to zero or more instances of ";"disposition-parm, there's no "or"

Anyway, I agree aria2 should handle this situation. Maybe something like this at the beginning of ssize_t parse_content_disposition function (with the hint about wrong syntax):

  if (in[len-1]==';'){
    A2_LOG_NOTICE(fmt("Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing"));
    --len;
  }

Output:

>aria2c-x86_64 "https://huggingface.co/microsoft/Orca-2-13b/resolve/main/pytorch_model-00001-of-00006.bin"

11/25 17:57:26 [NOTICE] Downloading 1 item(s)

11/25 17:57:26 [NOTICE] CUID#7 - Redirecting to https://cdn-lfs-us-1.huggingface.co/repos/a2/6b/a26b589f0cd574f46956be719d27409ed2fde99088cd84b2a8e0403d0de10ab4/48b50d9d6ef77522b210c62ec214dd01295fcf88a3160acce32cae23d033ce1d?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model-00001-of-00006.bin%3B+filename%3D%22pytorch_model-00001-of-00006.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1701189679&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwMTE4OTY3OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2EyLzZiL2EyNmI1ODlmMGNkNTc0ZjQ2OTU2YmU3MTlkMjc0MDllZDJmZGU5OTA4OGNkODRiMmE4ZTA0MDNkMGRlMTBhYjQvNDhiNTBkOWQ2ZWY3NzUyMmIyMTBjNjJlYzIxNGRkMDEyOTVmY2Y4OGEzMTYwYWNjZTMyY2FlMjNkMDMzY2UxZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=U4omCd1sHHmXoPvqWRCLAUwgfaUsF1gVWJtHDAhnMkgWDT-5NzYycBsVNCLVARFy8k%7ER6RIyWC-0ylvrGb0jW5BWwCIoi7Y007TlHzNX382JVtdWrtSc8yZPVWVwrWiBjrVm5piIT4487rheKUQ8bXxbCnChuBD5ZJIKQ%7E4oSvdVEHgwuhhLfGz9MiaLHC%7E-W85FdVQX3JPBWBgu71yWtOz1qSjNCNRUHGHWTwPcredMfT7jGbdkFiHqfTj4mQFQUXK9-XNRMWiksNvQGrcmB9%7EVf7uisIngiGwqsHwlpX4OmkiyUyiGXlJ%7E9szHI7-6JXi0EXVj2gMqhMbYXhlW8g__&Key-Pair-Id=KCD77M1F0VK2B

11/25 17:57:27 [NOTICE] Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing

11/25 17:57:27 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
 *** Download Progress Summary as of Sat Nov 25 17:57:37 2023 ***
====================================================================================================================================================
[#3aa206 0B/9.2GiB(0%) CN:1 DL:0B]
FILE: F:/pytorch_model-00001-of-00006.bin

mgrinzPlayer avatar Nov 25 '23 17:11 mgrinzPlayer

nobody willing to start an active fork and fix issues like this?

you really want this to be fixed huh? isn't this a fix? just compile it afterwards

I don't know how to compile it but I'm not the only one who needs it, as you can see it affects a lot of people, so a fix in the release would be better.

fenopa avatar Nov 27 '23 20:11 fenopa

Anyway, I agree aria2 should handle this situation. Maybe something like this at the beginning of ssize_t parse_content_disposition function (with the hint about wrong syntax):

  if (in[len-1]==';'){
    A2_LOG_NOTICE(fmt("Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing"));
    --len;
  }

I don't think it needs to be logged. It's not something the user needs/wants to know and considering all the other major software and web server behaviour, it's most likely not right to call it a "corruption that was fixed"

fenopa avatar Nov 27 '23 20:11 fenopa

@fenopa, similar situation is with 7-Zip. You can find discussions there. Basically, in many cases, Igor refused to open/handle files that doesn't follow specification. Similar to aria2 situation was with WordPress plugin BackWPup generated tar files - GNU tar (and other programs) silently ignored the error, Igor decided to still report error but also unpack all files ( link ).

Here my workflow script file. And here is github-actions release.

mgrinzPlayer avatar Nov 28 '23 00:11 mgrinzPlayer

Maintainers did not explicitly close this issue or stated they would not accept a pull request, so I suppose anyone could send a PR.

pawamoy avatar Nov 28 '23 17:11 pawamoy

@mgrinzPlayer thanks a lot for that build.

fenopa avatar Nov 28 '23 18:11 fenopa

@reschke can you put an end to this? who is right?

fenopa avatar Dec 08 '23 01:12 fenopa

The grammar does not allow trailing semicolons.

But that doesn't necessarily mean it would be a problem to allow them.

In any case, I would report site bugs when I'd encounter those.

reschke avatar Dec 08 '23 06:12 reschke

The grammar does not allow trailing semicolons.

But that doesn't necessarily mean it would be a problem to allow them.

In any case, I would report site bugs when I'd encounter those.

You and @mgrinzPlayer are right, I misread the grammar (think my annoyed-at-the-time brain inserted a | that wasn't there), but that doesn't change the fact that CloudFront injects a trailing semicolon, therefore a significant chunk of the internet does, and nothing any of us can do will change that.

W.R.T. 7-Zip, Igor's persistent refusal to accept the reality of formats that people are using and want to use has caused just as much strife (if not more) and no doubt had a negative effect on uptake/usage of 7-Zip. It's also not entirely comparable, since in the 7-Zip case we're talking about new, relatively-unknown formats that are not supported by much else in the way of software save for the existing 7-Zip forks.

While in this case, we're talking about something that every other library/client comparable to aria2 handles gracefully, that's widely deployed and accepted across the internet as a whole.

~~I'd be happy to send a PR similar to the above if I had any confidence that it'd be accepted and not just closed with "fix your noncompliant server" (since as discussed above "fixing cloudfront" is entirely beyond what any of us can do)~~ [Edit:] Eh, money where my mouth is. Will open a draft PR using something like what's suggested above shortly...

neggles avatar Dec 10 '23 10:12 neggles

If cloudfront doesn't follow simple "Content-Disposition" ":" disposition-type *( ";" disposition-parm ) ... what other things they messed.

mgrinzPlayer avatar Dec 10 '23 11:12 mgrinzPlayer