juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

1651 add clean extracted text method

Open Luis-manzur opened this issue 4 months ago • 1 comments

This pull request introduces a new post-extraction text cleanup mechanism to the codebase. The main addition is a cleanup_extracted_text method, which allows for sanitizing plain text after it has been extracted from source documents. This enables removal of extraction artifacts and unwanted content, improving the quality of processed text. The method is implemented as a no-op in the abstract base class and is overridden with custom logic in a subclass for SCOTUS slip opinions. Additionally, the sample caller is updated to use the cleaned text for further processing.

This PR addresses - #1651

Luis-manzur avatar Nov 05 '25 20:11 Luis-manzur

@Luis-manzur this needs a test - probably since it's a new feature a new test. Try to keep it minimal - in the same vein as extract from text - tests. please.

flooie avatar Nov 13 '25 20:11 flooie