pudl icon indicating copy to clipboard operation
pudl copied to clipboard

Develop tooling for caching and accessing SEC 10k filings during experimental work

Open zschira opened this issue 11 months ago • 0 comments

Background

We have all of the SEC filings available in GCS with a metadata DB. To aid exploratory extraction of SEC filings, we need tooling to work with this documents. Getting generic company data out of filings is fairly straightforward as there's standard structure, but exhibit 21's (which contain info on subsidiaries of each company) are much less standardized and will require more complex models to extract this data.

Scope

  • [x] Develop tools to cache filings locally to make test/training sets with low latency access
  • [x] Add ability to create images of exhibit 21's from filings, which can be used in extraction models

zschira avatar Mar 25 '24 19:03 zschira