Incorrect working with redirect to site with cyrillic symbols in name
I've found that Jsoup does not work when redirected to another site with Cyrillic symbols in names. For example https://dic.academic.ru/dic.nsf/ushakov/780090 It will be redirected to https://academic2.ru/ГЛУШЬ_17885736 (site with Cyrillic symbols) which should be adjusted to https://academic2.ru/%D0%93%D0%9B%D0%A3%D0%A8%D0%AC_17885736 But during conversion, we have got: https://academic2.ru/?????_17885736 The reason is incorrect(?) encoding during get the value to bytes in HttpConnection.java:
private static String fixHeaderEncoding(String val) {
try {
byte[] bytes = val.getBytes("ISO-8859-1");
I've commented "ISO-8859-1" and it works properly. Maybe I do not understand the full idea of usage predefined encoding.
Generally, it works incorrectly on the Android platform. But works properly on PC.
I think the problem resides in line 395. From what I understood it checks whether the string is UTF-8 encoded, if it is then it simply returns the string, otherwise it return a new String object with UTF-8 charset.
So shouldn't it be like this instead ?
...
if (looksLikeUtf8(bytes))
...
without the logical not symbol
I think the problem resides in line 395.
No. It is exactly what I defined. The getbytes incorrectly made encoding and characters already broken before looksLikeUtf8. As I wrote after my change it works properly. And I want to emphasize that the situation appears only on Android phone but on PC it works correctly.
Are you still impacted by this? I haven't seen other reports of this. That you saw different results on Android vs PC makes me wonder if some Android middle-ware was causing the issue. I've seen multiple instances where various Android implementers make changes to the URL / network stack.
If it happened in the base Android impl, perhaps we could find a workaround.
If this issue is still impacting you and you are able to provide the requested information, please feel free to re-open this bug.