mechanize icon indicating copy to clipboard operation
mechanize copied to clipboard

BeautifulStoneSoup in select_form messes up utf-8

Open cvogt opened this issue 12 years ago • 1 comments

In UTF-8 the character Ü is represented by two bytes, one of which appears as a key in mechanize._beautifulsoup.BeautifulStoneSoup.MS_CHARS

In Browser.open a subclass of BeautifulStoneSoup called MechanizeBs is used, which overrides BeautifulStoneSoup.PARSER_MASSAGE, so that MS_CHARS is ignored.

In Browser.select_form however, mechanize._form.RobustFormParser is used, which uses BeautifulStoneSoup directly, which uses MS_CHARS for replacements. This leads to one of the bytes of UTF-8 Ü being replaced, which destroys the Ü character. As a consequence controls with labels containing Ü cannot be found by their label anymore, i.e. the following : browser.click( label='Übernehmen' ) fails with a ControlNotFoundError: no control matching kind 'clickable', label 'Übernehmen'.

I currently worked around that using a monkey patch:

import mechanize mechanize._form.RobustFormParser.PARSER_MASSAGE = mechanize._html.MechanizeBs.PARSER_MASSAGE

A real fix would be appreciated :). Thx!

cvogt avatar Jan 16 '12 14:01 cvogt

I have been trying to figure out why non-ASCII UTF-8 characters are being mangled when using BeautifulSoup and just came to the same conclusion as cvogt. I was especially perplexed because I'm using Twill, which wraps Mechanize and it used to work correctly. The copy of Mechanize included with the last release of Twill has also done some monkeypatching to avoid the MS_CHARS mangling.

However, I have modified Twill, removing most of the included libraries like Mechanize and making it work with the current versions of those libraries. My fork of Twill mangles the UTF-8 sequences and I now know the blame lies with the current version of Mechanize. I haven't yet determined if the version of Mechanize included with Twill was modified as part of Twill to deal with this issue or if I have discovered a regression in Mechanize proper.

JonathanRRogers avatar Mar 13 '12 05:03 JonathanRRogers