buzzprofilecrawl icon indicating copy to clipboard operation
buzzprofilecrawl copied to clipboard

UTF-8 not correctly encoded

Open conradlee opened this issue 14 years ago • 3 comments

Is this scraper set up to properly encode utf8? Somehwere in my use of the scaper, the character encoding is getting messed up. Here's a simple example.

Let's assume we want to scrape the id 105058448954104555632

If you go to www.google.com/profiles/105058448954104555632 then you will see that the name is one that won't work in an ascii representation. If I use your scraper to get the data from this page, it returns

105058448954104555632 {"name":"Cirlésio Cunha","location":"Blumenau","mentions":["105058448954104555632","105058448954104555632"]}

Note that the name is garbled.

conradlee avatar Mar 26 '10 13:03 conradlee

Hmm, well when it's displayed in this issue form, it is being displayed correctly. However, when I open the output file in a text editor that is set to use UTF8 as its default encoding, the name appears as "Cirl & eacute ; sio Cunha" , [I inserted the spaces so github doesn't correect it again]

conradlee avatar Mar 26 '10 13:03 conradlee

You're right, it's actually pulling the raw HTML from the page, so you end up with entity encoding for non-ASCII characters. As a temporary workaround I actually decode these using html_entity_decode() when I'm doing further processing on the output data, but I need to put that into the crawler itself. I'll add that into the script and check it in once I've tested it.

Incidentally, my very first bug, I'm stoked! :) Thanks for reporting, it's good to know people are actually using it.

petewarden avatar Mar 27 '10 23:03 petewarden

Ok, thanks for that answer. Knowing that, I know how to correctly decode the text in python. Maybe this isn't a bug after all, more like a missing feature (utf8 encoding).

conradlee avatar Mar 30 '10 14:03 conradlee