rss-bridge
rss-bridge copied to clipboard
IRI (Internationalized Resource Identifier) passed in a call to getSimpleHTMLDOM() should be properly URL-encoded
Describe the bug
Consider the following URL: https://allegro.pl/listing?string=14%20MacBook%20Pro%20wyświetlacz
This is handled perfectly fine by any browser — it was copied from Safari directly after searching that website, in fact — but passed to getSimpleHTMLDOM()
and further down to libcurl will cause 404 Bad Request error.
This is because this is an IRI (Internationalized Resource Identifier), i.e. an URL that has its diacritics url-encoded. However, IRIs are not supported by libcurl, which requires a fully URL-encoded URI (https://curl.se/libcurl/c/CURLOPT_URL.html)
What needs to be done, then, is to have the URL passed to _http_request
broken down and the query part re-encoded again to conform to libcurl requirements.
When I in firefox right-click-copy that url I get this which differs a bit:
https://allegro.pl/listing?string=14%20MacBook%20Pro%20wy%C5%9Bwietlacz
As far as I can tell the url you posted is only half-urlencoded correctly? Not sure.
This issue reminds me of https://github.com/RSS-Bridge/rss-bridge/pull/3058
Some browsers will automatically encode it, but not all – Safari, for instance, doesn't. Chrome and Firefox do.
Also, no, in example above encoded are only the spaces (%20
). These naturally have to be, nothing half-done about it.
Yes so this is a bug in rssbridge. We do not currently url-encode the path or the query string, before passing them to curl.
In addition there is also an additional bug because we don't convert domain name to IDNA ASCII form. E.g. https://teßt.com
We must detect if urls are already encoded. I think these issues requires the use of a proper url component. E.g. spatie/url
or league/uri
. We could also roll it ourselves.
FYI, curl supports IDNA if compiled with support for it (with libidn as dependency, I think). I'd imagine modern curl distributions do that already?
I also don't think you need to do any detection magic. Just split the URL, decode and re-encode the query, and build the URL again. Do it for all the URLs and it will just work.