crawl-anywhere
crawl-anywhere copied to clipboard
Wrong character encoding
If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract.
In line
346 if (input==null && rawData!=null) input = new ByteArrayInputStream(rawData.getBytes();
rawData.getBytes() returns byte representation of data-string with system-encoding (UTF-8)
and after that TikaWrapper seems to process the bytes with ISO-8859-1 to stream to filesystem. ISO-8859-1 is the origin encoding of content returned from web server. In that case, TikaWrapper should use system encoding (UTF-8) to handle the bytes .
Please provide a sample URL