python-xextract icon indicating copy to clipboard operation
python-xextract copied to clipboard

Allow Elements to be passed to parse_*()

Open levic opened this issue 1 year ago • 1 comments

Addresses #10

@Mimino666 There's no documentation here yet (I wasn't going to add it until you're happy with what I've done)

Handling parse() was an unexpected quirk: if we only have an Element then it doesn't look like we can know whether a document was parsed as HTML or XML so we don't know whether to use an XML or a HTML extractor.

We can guess based on the presence (or not) of a namespace on the Element, but you can still parse XML snippets without a namespace so that could still lead to unexpected results. It also has the side effect of casting the Element back to a string as part of the XML header snooping which is what we were trying to avoid in the first place (although a check for this could be added).

I've opted to force the caller to be explicit: if you want to pass an Element to parse() then you must use parse_html() or parse_xml() instead.

levic avatar Feb 06 '23 16:02 levic