rss-bridge IRI (Internationalized Resource Identifier) passed in a call to getSimpleHTMLDOM() should be properly URL-encoded

IRI (Internationalized Resource Identifier) passed in a call to getSimpleHTMLDOM() should be properly URL-encoded

Open wrobelda opened this issue 2 years ago • 4 comments

Describe the bug Consider the following URL: https://allegro.pl/listing?string=14%20MacBook%20Pro%20wyświetlacz This is handled perfectly fine by any browser — it was copied from Safari directly after searching that website, in fact — but passed to getSimpleHTMLDOM() and further down to libcurl will cause 404 Bad Request error.

This is because this is an IRI (Internationalized Resource Identifier), i.e. an URL that has its diacritics url-encoded. However, IRIs are not supported by libcurl, which requires a fully URL-encoded URI (https://curl.se/libcurl/c/CURLOPT_URL.html)

What needs to be done, then, is to have the URL passed to _http_request broken down and the query part re-encoded again to conform to libcurl requirements.

Oct 06 '22 14:10 wrobelda

When I in firefox right-click-copy that url I get this which differs a bit:

https://allegro.pl/listing?string=14%20MacBook%20Pro%20wy%C5%9Bwietlacz

As far as I can tell the url you posted is only half-urlencoded correctly? Not sure.

This issue reminds me of https://github.com/RSS-Bridge/rss-bridge/pull/3058

Oct 06 '22 19:10 dvikan

Some browsers will automatically encode it, but not all – Safari, for instance, doesn't. Chrome and Firefox do.

Also, no, in example above encoded are only the spaces (%20). These naturally have to be, nothing half-done about it.

Oct 06 '22 20:10 wrobelda

Yes so this is a bug in rssbridge. We do not currently url-encode the path or the query string, before passing them to curl.

In addition there is also an additional bug because we don't convert domain name to IDNA ASCII form. E.g. https://teßt.com

We must detect if urls are already encoded. I think these issues requires the use of a proper url component. E.g. spatie/url or league/uri. We could also roll it ourselves.

Oct 06 '22 22:10 dvikan

FYI, curl supports IDNA if compiled with support for it (with libidn as dependency, I think). I'd imagine modern curl distributions do that already?

I also don't think you need to do any detection magic. Just split the URL, decode and re-encode the query, and build the URL again. Do it for all the URLs and it will just work.

Oct 06 '22 23:10 wrobelda

rss-bridge rss-bridge copied to clipboard

IRI (Internationalized Resource Identifier) passed in a call to getSimpleHTMLDOM() should be properly URL-encoded

rss-bridge
rss-bridge copied to clipboard