OpenDirectoryDownloader icon indicating copy to clipboard operation
OpenDirectoryDownloader copied to clipboard

Output URL are not correctly encoded

Open maaaaz opened this issue 1 year ago • 7 comments

Hello there,

I observe that even the latest current version of ODD (v3.1.0.1) does not properly encode URL in the output file.

Let me detail the case:

  1. First, let's ODD a (randomly found on the internet) website containing some special chars in the path:
$ ./OpenDirectoryDownloader -u "https://gregoirelorieux.net/paysagescomposes/villes/Melle/" --output-file test
[...]
Finshed indexing
[...]
Saving URL list to file..
Saved URL list to file: /tmp/test.txt
  1. Then let's see the first results of the output file:
$ head test.txt
https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
[...]
  1. If we try to download the first file with wget (and even other download managers), it fails because there are unencoded characters in the URL: "#" and whitespaces.
$ wget -v "https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif"
--2024-10-29 23:22:12--  https://gregoirelorieux.net/paysagescomposes/villes/Melle/
Resolving gregoirelorieux.net (gregoirelorieux.net)... 213.186.33.87
Connecting to gregoirelorieux.net (gregoirelorieux.net)|213.186.33.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 844 [text/html]
Saving to: ‘index.html’

index.html                              100%[===============================================================================>]     844  --.-KB/s    in 0s

2024-10-29 23:22:13 (550 MB/s) - ‘index.html’ saved [844/844]

Here, the downloaded file:

  • is not the asked one: https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
  • but is from this automatically split link: https://gregoirelorieux.net/paysagescomposes/villes/Melle/ wget ignores everything after finding a special char, the first one here is "#"

The correct encoded link in the ODD output file should be: https://gregoirelorieux.net/paysagescomposes/villes/Melle/%233%2021%20jan/Melle/contrebasse-echantillons/cb-arco-1.aif

Instead of:
https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif

Can you fix it ?

The encodeURIComponent function should help.

Cheers!

maaaaz avatar Oct 29 '24 22:10 maaaaz

Hi, thanks for letting me know. I'll try to look at it ASAP 😅

KoalaBear84 avatar Oct 31 '24 05:10 KoalaBear84

Tried to make a new version with a partial fix, and maybe the definitive fix for now. But GitHub wont let me anymore because they deprecated/disabled older build actions. Will continue another time..

KoalaBear84 avatar Oct 31 '24 06:10 KoalaBear84

I think this should be optional (but maybe the default). I've encountered servers in the past that didn't treat encoded URLs the same as the raw URL, seemingly becaus they didn't decode them (or not correctly). Improving the parsing in the downloadet itself, or manually passing an enquoted URL to it, should work even with URLs that aren't encoded.

Chaphasilor avatar Nov 01 '24 09:11 Chaphasilor

I think this should be default, as download managers do not support unencoded URLs.

In the meantime, a Python solution to properly encode ODD output file:

$ cat script.py
#!/usr/bin/python3

import sys
import urllib.parse

for line in sys.stdin:
    print(urllib.parse.quote(line.strip(), safe=':/'))

$ cat oddresult | python script.py

maaaaz avatar Nov 01 '24 13:11 maaaaz

Thank you for working on this issue! This just burned me in something and I am glad it was already addressed.

josephroosen avatar Nov 28 '24 06:11 josephroosen

Hmm, after some time I've fixed the GitHub Actions issue. Please test if it works now and confirm, or reopen.

KoalaBear84 avatar Feb 15 '25 12:02 KoalaBear84

I just checked the latest version and the output URL are still not encoded. Could you reopen that issue please ?

maaaaz avatar Apr 11 '25 12:04 maaaaz