Cannot process caches with unescaped `&#` in the cache name
When I try to process GC25WQJ, name "How Do I Solve All These &#$@! Puzzle Caches?", I get an error:
self.name = cache_details.find(id="ctl00_ContentBody_CacheName").text
^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'find'
Is it because the cache name has many punctuation characters in it? These GC codes also fail: GC8AKHK, GCA9PAE, GC6PJNF, GC1FJJT (archived)
This simple program shows the error:
import pycaching
geocaching = pycaching.login()
cache = geocaching.get_cache("GC25WQJ")
print(cache.name)
geocaching.logout()
It is difficult to search for additional caches for testing because the geocaching.com search filter "Geocache name contains" seems to really mean "Geocache name starts with".
The easiest solution I found was to use the lxml parser instead of html.parser.
The working version can be found in my fork :)
I'm not going to open a PR yet, as the parser change is quite groundbreaking and I'd like to hear the maintainer's opinion :)
I'm not going to open a PR yet, as the parser change is quite groundbreaking and I'd like to hear the maintainer's opinion :)
Do you have some more details how much this actually affects pycaching?
Apart from this, while using the lxml backend might be a solution, I would argue that this is a Groundspeak bug due to insufficient sanitization/escaping of user input: &# should usually prefix some integer and end with a semicolon, which Firefox complains about as well.
<h1 class="visually-hidden">How Do I Solve All These &#$@! Puzzle Caches? Rätsel-Geocaches</h1>
Do you have some more details how much this actually affects pycaching?
The tests in CI passed, so I assume the impact of the change is minimal or none. I've also tested it manually and everything seems to be working fine. The biggest change is a new dependency (lxml parser).
Apart from this, while using the
lxmlbackend might be a solution, I would argue that this is a Groundspeak bug due to insufficient sanitization/escaping of user input:&#should usually prefix some integer and end with a semicolon, which Firefox complains about as well.<h1 class="visually-hidden">How Do I Solve All These &#$@! Puzzle Caches? Rätsel-Geocaches</h1>
Yup, this is definitely a Groundspeak bug, but I don't think they would fix it just because some library.
Here's a similar problem, but in the Geocache Description instead of the name: GCR0EF The cache is archived, so not really much of an issue. @BelKed - does your lxml change allow this cache to be processed?
Yeah, the cache is processed without any errors :) I've added it to the tests (https://github.com/BelKed/pycaching/commit/09ed15763838f3fc4319b1281021ebce46894c29).