rss-bridge icon indicating copy to clipboard operation
rss-bridge copied to clipboard

URL not properly formed with diacritics/accents not encoded

Open wrobelda opened this issue 11 months ago • 13 comments

Describe the bug If any of the feed query parameters contains diacritic (accent) characters, they are left as is and not encoded, which will results in some of the clients fail to add the RSS feed with a "URL invalid" error. See: https://stackoverflow.com/questions/33211310/convert-french-accent-to-specific-encoding-in-php

To Reproduce Steps to reproduce the behavior:

  1. For any chosen bridge which takes a text character parameter, use a string containing diacritic/accent characters (or copy and paste this: ąśćż)
  2. Generate feed URL
  3. Copy that feed to a RSS client of choice (it fails here with TT-RSS at least)
  4. See error

Expected behavior Diacritics/accents should be properly encoded

wrobelda avatar Feb 26 '24 22:02 wrobelda

i think this is a bug in TT-RSS or your browser. im not sure

dvikan avatar Feb 26 '24 23:02 dvikan

i think this is a bug in TT-RSS or your browser. im not sure

Sorry, what bug? Per RFC 3986, section 2.3, the URL should consist of only comprise of specific character set, which does not contain non-ascii characters, period. Any other characters need to be UTF-8 encoded, per RFC3987.

Meanwhile RSS-Bridge allows those characters to make it to the URL. Sure, modern browsers or some clients will automatically UTF-8 encode such query before they send it outside to webservers, but RSS-Bridge should not rely on that and instead generate a feed URL that conforms to the standards.

See also: https://www.w3schools.com/tags/ref_urlencode.ASP

wrobelda avatar Feb 26 '24 23:02 wrobelda

are you copy pasting url from browser?

are you talking about those urls that are produced inside <link> tags?

i was unable to reproduce. using firefox.

dvikan avatar Feb 27 '24 00:02 dvikan

Reproducible.

Search and result on reddit with a german umlaut "ä". Similar problem than the accented french characters. image

RSS bridge config image

Result on dvikans public instance image

Bockiii avatar Feb 27 '24 12:02 Bockiii

okay i get it.

it happens when parameters are used in http requests without url encoding them.

in the particular case of RedditBridge a solution is to manually url encode the user input parts.

related: https://github.com/RSS-Bridge/rss-bridge/issues/3091

dvikan avatar Feb 27 '24 17:02 dvikan

in the particular case of RedditBridge a solution is to manually url encode the user input parts.

That means each and every bridge has to handle encoding themselves for each of their arbitrary string inputs, whereas RSS-Bridge could do this itself once by encoding the complete feed URL it generated. There's no harm here: any characters needing encoding will get encoded, otherwise it will be left as is.

Not to mention the bridge code should not be concerned with things like that — its scope is to prepare articles and their content in UTF-8, not handle the intrinsics of HTTP communication between the RSS-Bridge server and an RSS client.

No offense, but I think you downplay the seriousness of this issue for any non-ASCII languages.

wrobelda avatar Mar 12 '24 12:03 wrobelda

I like your arguments. Okay let me dwell a bit on it.

dvikan avatar Mar 12 '24 19:03 dvikan

@Bockiii fixed for reddit in https://github.com/RSS-Bridge/rss-bridge/pull/4010

dvikan avatar Mar 12 '24 23:03 dvikan

i have discovered that curl will automatically escape the url if needed.

but if curl detects an already escaped url, it will NOT escape.

so this particular error only happens if a url is already partially escaped (as was the case with RedditBridge),

dvikan avatar Mar 31 '24 01:03 dvikan

i have discovered that curl will automatically escape the url if needed.

but if curl detects an already escaped url, it will NOT escape.

so this particular error only happens if a url is already partially escaped (as was the case with RedditBridge),

The problem here is not with how RSS handles that internally (i.e. the curl lib that it uses), but on the outside, i.e. with the RSS clients that you pass unescaped RSS-Bridge URL to.

In other words, we need to make sure that the URL generated and returned to the user (opened in a new browser tab) by the RSS Bridge after you click "Generate Feed" needs to be properly formed.

wrobelda avatar Apr 01 '24 22:04 wrobelda

im confused now. can you give an example?

dvikan avatar Apr 01 '24 23:04 dvikan

for the record i did some changes related to this issue in https://github.com/RSS-Bridge/rss-bridge/commit/545dc969d35bc8c94a8c15875562690ee2fd6605 but they are a refactor (should not be externally visible changes)

dvikan avatar Apr 04 '24 17:04 dvikan

here is a URL (manually copied from firefox url bar).

its HTML have URLs being properly encoded (as you requested)

it has always been like this as far as I can tell.

https://rss-bridge.org/bridge01/?action=display&bridge=FilterBridge&url=https%3A%2F%2Florem-rss.herokuapp.com%2Ffeed%3Funit%3Dday&filter=%C4%85%C5%9B%C4%87%C5%BC&filter_type=permit&target_title=on&length_limit=-1&format=Html

pls give example of a non-encoded url being produced

@Mynacol pls give feedback on this issue.

dvikan avatar Jun 18 '24 19:06 dvikan