crawl-anywhere icon indicating copy to clipboard operation
crawl-anywhere copied to clipboard

Wrong character encoding

Open torhar opened this issue 10 years ago • 1 comments

If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract.

In line

346 if (input==null && rawData!=null) input = new ByteArrayInputStream(rawData.getBytes();

rawData.getBytes() returns byte representation of data-string with system-encoding (UTF-8)

and after that TikaWrapper seems to process the bytes with ISO-8859-1 to stream to filesystem. ISO-8859-1 is the origin encoding of content returned from web server. In that case, TikaWrapper should use system encoding (UTF-8) to handle the bytes .

torhar avatar Jul 18 '14 11:07 torhar

Please provide a sample URL

bejean avatar Sep 22 '14 14:09 bejean