pycaching icon indicating copy to clipboard operation
pycaching copied to clipboard

Cannot process caches with unescaped `&#` in the cache name

Open GeoTime61 opened this issue 1 year ago • 5 comments

When I try to process GC25WQJ, name "How Do I Solve All These &#$@! Puzzle Caches?", I get an error:

self.name = cache_details.find(id="ctl00_ContentBody_CacheName").text
                ^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'find'

Is it because the cache name has many punctuation characters in it? These GC codes also fail: GC8AKHK, GCA9PAE, GC6PJNF, GC1FJJT (archived)

This simple program shows the error:

import pycaching
geocaching = pycaching.login()
cache = geocaching.get_cache("GC25WQJ")
print(cache.name)
geocaching.logout()

It is difficult to search for additional caches for testing because the geocaching.com search filter "Geocache name contains" seems to really mean "Geocache name starts with".

GeoTime61 avatar Feb 12 '24 22:02 GeoTime61

The easiest solution I found was to use the lxml parser instead of html.parser. The working version can be found in my fork :)


I'm not going to open a PR yet, as the parser change is quite groundbreaking and I'd like to hear the maintainer's opinion :)

BelKed avatar Feb 12 '24 23:02 BelKed

I'm not going to open a PR yet, as the parser change is quite groundbreaking and I'd like to hear the maintainer's opinion :)

Do you have some more details how much this actually affects pycaching?

Apart from this, while using the lxml backend might be a solution, I would argue that this is a Groundspeak bug due to insufficient sanitization/escaping of user input: &# should usually prefix some integer and end with a semicolon, which Firefox complains about as well.

<h1 class="visually-hidden">How Do I Solve All These &#$@! Puzzle Caches? Rätsel-Geocaches</h1>

FriedrichFroebel avatar Feb 13 '24 16:02 FriedrichFroebel

Do you have some more details how much this actually affects pycaching?

The tests in CI passed, so I assume the impact of the change is minimal or none. I've also tested it manually and everything seems to be working fine. The biggest change is a new dependency (lxml parser).

Apart from this, while using the lxml backend might be a solution, I would argue that this is a Groundspeak bug due to insufficient sanitization/escaping of user input: &# should usually prefix some integer and end with a semicolon, which Firefox complains about as well.

<h1 class="visually-hidden">How Do I Solve All These &#$@! Puzzle Caches? Rätsel-Geocaches</h1>

Yup, this is definitely a Groundspeak bug, but I don't think they would fix it just because some library.

BelKed avatar Feb 14 '24 04:02 BelKed

Here's a similar problem, but in the Geocache Description instead of the name: GCR0EF The cache is archived, so not really much of an issue. @BelKed - does your lxml change allow this cache to be processed?

GeoTime61 avatar Feb 14 '24 16:02 GeoTime61

Yeah, the cache is processed without any errors :) I've added it to the tests (https://github.com/BelKed/pycaching/commit/09ed15763838f3fc4319b1281021ebce46894c29).

BelKed avatar Feb 15 '24 16:02 BelKed