sec-edgar icon indicating copy to clipboard operation
sec-edgar copied to clipboard

XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document

Open mrahmadt opened this issue 10 months ago • 4 comments

Hello

Now sure why I'm getting below warning, if someone can help me please

# python3 test.py 
/root/secedgar/lib/python3.12/site-packages/secedgar/client.py:218: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document.

Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:

    from bs4 import XMLParsedAsHTMLWarning
    import warnings

    warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

  return BeautifulSoup(self.get_response(path, params, **kwargs).text,

# cat /etc/issue
Ubuntu 24.04.2 LTS \n \l
# python3 -V
Python 3.12.3
# cat test.py 
from secedgar import CompanyFilings, FilingType

my_filings = CompanyFilings(cik_lookup=['aapl'],
                            filing_type=FilingType.FILING_4,
                            user_agent='Name (email@gmail)')

my_filings.save('/root/tempdir')

I tried pip install git+https://github.com/sec-edgar/sec-edgar.git and pip install sec-edgar all with the same issue

mrahmadt avatar May 19 '25 20:05 mrahmadt

I have the same issues.

eshwar-parthiban avatar May 23 '25 13:05 eshwar-parthiban

me too

KarryRen avatar May 28 '25 09:05 KarryRen

To parse this document as XML, make sure you have the Python package 'lxml' installed,

Did you pip install lxml?

jackmoody11 avatar May 30 '25 22:05 jackmoody11

You would probably need to manually parse all the files as xml like I've had to as it's probably used a HTML parser rendering your downloaded files unintelligible. Try this:

import os
from bs4 import BeautifulSoup 

save_directory = '/your/directory/'

for filename in os.listdir(save_directory):
    if filename.endswith('.txt'):
        file_path = os.path.join(save_directory, filename)

        with open(file_path, 'r', encoding='utf-8') as f:
            file_content = f.read()

            soup = BeautifulSoup(file_content, features="xml")
            print(f"Successfully parsed {filename} as XML.")

        except Exception as e:
            print(f"Failed to parse {filename} as XML. Error: {e}") 

Joshwaa90 avatar Jul 23 '25 21:07 Joshwaa90