jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Incorrect working with redirect to site with cyrillic symbols in name

Open mirage93i opened this issue 5 years ago • 2 comments

I've found that Jsoup does not work when redirected to another site with Cyrillic symbols in names. For example https://dic.academic.ru/dic.nsf/ushakov/780090 It will be redirected to https://academic2.ru/ГЛУШЬ_17885736 (site with Cyrillic symbols) which should be adjusted to https://academic2.ru/%D0%93%D0%9B%D0%A3%D0%A8%D0%AC_17885736 But during conversion, we have got: https://academic2.ru/?????_17885736 The reason is incorrect(?) encoding during get the value to bytes in HttpConnection.java:

    private static String fixHeaderEncoding(String val) {
        try {
            byte[] bytes = val.getBytes("ISO-8859-1");

I've commented "ISO-8859-1" and it works properly. Maybe I do not understand the full idea of usage predefined encoding.

Generally, it works incorrectly on the Android platform. But works properly on PC.

mirage93i avatar Apr 14 '20 04:04 mirage93i

I think the problem resides in line 395. From what I understood it checks whether the string is UTF-8 encoded, if it is then it simply returns the string, otherwise it return a new String object with UTF-8 charset.

So shouldn't it be like this instead ?

...
if (looksLikeUtf8(bytes))
...

without the logical not symbol

MootezSaaD avatar Apr 14 '20 09:04 MootezSaaD

I think the problem resides in line 395.

No. It is exactly what I defined. The getbytes incorrectly made encoding and characters already broken before looksLikeUtf8. As I wrote after my change it works properly. And I want to emphasize that the situation appears only on Android phone but on PC it works correctly.

mirage93i avatar Apr 14 '20 10:04 mirage93i

Are you still impacted by this? I haven't seen other reports of this. That you saw different results on Android vs PC makes me wonder if some Android middle-ware was causing the issue. I've seen multiple instances where various Android implementers make changes to the URL / network stack.

If it happened in the base Android impl, perhaps we could find a workaround.

jhy avatar Jan 06 '23 22:01 jhy

If this issue is still impacting you and you are able to provide the requested information, please feel free to re-open this bug.

jhy avatar Jan 24 '23 09:01 jhy