mygpo icon indicating copy to clipboard operation
mygpo copied to clipboard

Entity %23 is removed from feed URL

Open quinncomendant opened this issue 6 years ago • 4 comments

Gpodder appears to remove # characters from podcast feed URLs.

I have a podcast with the encoded entity %23 (which is a #) in the URL, which is required because it is a query parameter containing an access code (abc#):

http://example.com/feed.xml?c=abc%23

The feed is able to subscribe in gpodder. However, internally to gpodder, the %23 is stripped off. The feed never updates, and if I copy the URL from the feed RSS icon here:

screen shot 2018-04-16 at 17 21 57

It shows this is the URL in gpodder:

    http://example.com/feed.xml?c=abc

It seems gpodder is removing the # character for some reason (access codes containing other characters are retained fine).

Sorry, I can't share the podcast URL I'm using because it cannot be shared publicly. Email me if you'd like to test it.

quinncomendant avatar Apr 16 '18 09:04 quinncomendant

Hi @quinncomendant :smile: I'd triage this issue. Could you please send an email containing the URL to the podcast you added? My email is [email protected]

SiqingYu avatar Feb 01 '20 09:02 SiqingYu

A bug results in podcast URLs in the database like this: http://www.npr.org/rss/podcast.php?id=510300http:%2525252F%25252Fwww.npr.org%2525252Frss%252Fpodcast.php%2525253Fid=510300 The 25 is the result of encoded % (percent sign), and the 2F is the result of / (slash). It seems that % is encoded using urllib.parse.quote every time a podcast is updated and then not appropriately decoded.

SiqingYu avatar Mar 26 '20 15:03 SiqingYu

From urllib.parse — Parse URLs into components — Python 3.8.2 documentation

The URL quoting functions focus on taking program data and making it safe for use as URL components by quoting special characters and appropriately encoding non-ASCII text.

URL My opinion is that we shouldn't use urllib.parse.quote to deal with user-submitted URLs; otherwise, we have to handle the error-prone encoding and decoding process. Instead, we should urllib.parse.quote only for URLs constructed in code.

Another thing worth noting is that Django provides django.utils.http.urlencode, a wrapper utility for urllib.parse.quote.

An example in the wild is Sentry. I found that Sentry doesn't use urllib.parse.quote/unquote in their code at all. They only use django.utils.http.urlencode for their auto-generated URLs.

SiqingYu avatar Mar 27 '20 04:03 SiqingYu

I wanted to help with this one, albeit note I'm not really familiar with the whole codebase. After testing this locally, I see (at least) two issues here:

The first issue is that mygpo-feedservice calls unquote on the feed url and therefore can't handle URLs with %23 unless quoted twice. This can be observed live by typing http://example.com/test?key=abc%23def here. The service fetches http://example.com/test?key=abc instead. This is what makes me unable to add the podcast via the "Missing podcast" feature because the correct feed is not fetched.

The second issue I see is that normalize_feed_url is not idempotent when it comes to the query part. I'm not sure I got this right, but the way I understand it is that the urllib.parse.quote function used there isn't meant to encode the input, but rather to quote only what should be quoted but currently isn't. This currently works for the path but not the query. So for example this:

http://example.com/a%3Ab:c?query=a%3Abc

gets normalized into

http://example.com/a%3Ab%3Ac?query=a%253Abc

while IMO it should be

http://example.com/a%3Ab%3Ac?query=a%3Abc

So when normalize_feed_url is invoked multiple times, we can end up with URLs like

http://example.com/a%3Ab%3Ac?query=a%252525253Abc

If these issues are relevant, I can sumbit PRs to fix both, but there's a problem because in some cases these issues cancel each other. So those'd need to be fixed together. But fixing the problem with feedservice might break some URLs quoted twice that already exist in the production database. Fixing those 'd probably require a data migration.

BTW, I wasn't able to reproduce the original case where # is not even saved to the database. When I create and update the podcast via the API, it seems to work for me because of those 2 bugs (the url contains %2523, feedservice unquotes it).

zb3 avatar May 01 '20 21:05 zb3