paperetl icon indicating copy to clipboard operation
paperetl copied to clipboard

example data processing warning using google colab

Open amscosta opened this issue 1 year ago • 4 comments

Hello, The following warning is issued when processing one of the .xml from the example data: Processing: paperetl/file/data/0.xml /usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py:35: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor. soup = BeautifulSoup(stream, "lxml")

Any clue how to avoid/correct that? Thanks a lot.

amscosta avatar Feb 26 '24 14:02 amscosta

I am using the colab notebook.

amscosta avatar Feb 26 '24 18:02 amscosta

You can ignore it like this:

import warnings
from bs4 import XMLParsedAsHTMLWarning

warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

davidmezzetti avatar Feb 28 '24 14:02 davidmezzetti

Thanks. But "using an XML parser will be more reliable" the message says.

amscosta2022 avatar Mar 01 '24 14:03 amscosta2022

Feel free to fork this project and try. It doesn't work in the tests I've run.

davidmezzetti avatar Mar 02 '24 15:03 davidmezzetti

Closing this issue as the question has been addressed.

davidmezzetti avatar Dec 28 '24 17:12 davidmezzetti