jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Incorrect header encoding conversion

Open 821938089 opened this issue 2 years ago • 11 comments

        private static String fixHeaderEncoding(String val) {
            byte[] bytes = val.getBytes(ISO_8859_1);
            if (!looksLikeUtf8(bytes))
                return val;
            return new String(bytes, UTF_8);
        }

This encoding conversion is wrong, you cannot restore the original binary content from a string without knowing its encoding. Such conversion leads to loss of some characters.

image

Related references: https://stackoverflow.com/a/39308860

By the way: when will the next version be released?

821938089 avatar Oct 15 '23 13:10 821938089

Can you give me the code for this vs screenshots so that I can review?

jhy avatar Oct 18 '23 01:10 jhy

Is this it? search.php?search=我的

821938089 avatar Oct 18 '23 06:10 821938089

OK, I've moved the re-encoding fix-up to only response headers. That's in place to fix #706 where the header was encoded as 8559 but held UTF bytes instead. Browsers seem to do this fix up too so the solution seems necessary. We do need better tests for this - I wasn't able to get Jetty to emit the header incorrectly so can't directly add a test case.

For request headers, the value set by the user is now retained directly. When making the request, Java will encode the header as UTF-8. Servers will probably expect 8559 and so this may or not work. Per spec, the content should either be limited to 8559 content or encoded with RFC 2047. We don't attempt to automatically do that (and some servers will be OK). A bit of a grey area here. Happy for other suggestions.

jhy avatar Oct 20 '23 00:10 jhy

image This header comes from the server's response. Is there a way to fix it?

821938089 avatar Oct 20 '23 03:10 821938089

Can you give me a sample URL or code so that I can actually review the server's response properly?

jhy avatar Oct 20 '23 06:10 jhy

https://www.zhenshezw.com/ image Enter "我的" and click on the search icon.

821938089 avatar Oct 20 '23 06:10 821938089

Hi, if you can't reproduce the issue could you add a configuration option to skip the fixHeaderEncoding?

821938089 avatar Nov 01 '23 08:11 821938089

I get caught in bot detections when I try this. Can you provide sample code so that I can try and repro?

I won't add a configuration option unless I can validate it. You could always fork the code yourself, of course.

jhy avatar Nov 10 '23 01:11 jhy

package org.example;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class Main {
  public static void main(String[] args) {
    Map<String, String> headers = new HashMap<>();
    // get these header from browser devtools
    headers.put("User-Agent", "");
    headers.put("Cookie", "");

    try {
      Connection.Response response = Jsoup.connect("https://www.zhenshezw.com/gut.php")
              .followRedirects(false)
              .requestBody("search=%E6%88%91%E7%9A%84")
              .headers(headers)
              .method(Connection.Method.POST)
              .execute();

      System.out.println(response.header("Location"));
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

821938089 avatar Nov 10 '23 03:11 821938089

I found this issue to be platform related, in java it works fine but in android it has issues. After some research, I found out that Java and Android use different HttpURLConnection implementations and they have different handling of headers. Since the Android platform's HttpURLConnection implementation already decodes the headers correctly, there is a problem when fixing the headers again in jsoup.

Java: image

Android: image

821938089 avatar Nov 10 '23 03:11 821938089

Thanks, that's good sleuthing! Need to think of a good way to detect and handle this situation...

jhy avatar Nov 16 '23 00:11 jhy