dstk icon indicating copy to clipboard operation
dstk copied to clipboard

html2story UTF-8 issue

Open petewarden opened this issue 14 years ago • 4 comments

From email:

I had just tried to mess with the html2story api, and sent an UTF-8 encoded html string in. The results were great, except all the accented characters (e.g. [áéíóöőúüű] - all the Hungarian vowels) where sent back as "??".

petewarden avatar Mar 25 '11 22:03 petewarden

Here's a page that reproduces the problem:

http://nol.hu/belfold/20110326-kontur_pal__a_telt_haz

petewarden avatar Mar 26 '11 19:03 petewarden

[cc-ed from email to reporter]

I wasn't able to reproduce it in the first test I tried, so I must be doing different steps. I wondered if I could get some more details from you? Here's what I'm trying:

Running OS X 10.6.6, in Terminal.app: curl "http://nol.hu/belfold/20110326-kontur_pal__a_telt_haz" > tests/data/hungarian.html html2story tests/data/hungarian.html

I see results like:

tasika | 2011. március 26. | 19:57:52 KOORMI001. MILYEN LÓRÓL BESZÉLSZ ? ÉN BÍZOK BENNE , HOGY LÓ ÉS SZAMÁR KEVERÉK ! ...

Which operating system and steps are you using?

petewarden avatar Apr 04 '11 22:04 petewarden

I've found what the difference was. I was running a local server on my OS X machine, but when I use the main http://www.datasciencetoolkit.org server, I see the ??'s.

petewarden avatar Apr 04 '11 22:04 petewarden

It looks like it was related to the default file-encoding assumed by Java. I added a switch to the command line running boilerpipe so that it would guess UTF-8, and it now seems to work.

For version 0.40

petewarden avatar Apr 04 '11 23:04 petewarden