texag is buggy and creating duplicates

Open grossir opened this issue 2 years ago • 1 comments

The scraper is picking the HTML "detail" page link instead of the PDF link. The XPATH should be updated We have a bunch of HTML pages and duplicates on CL as of now.

This wrong link selection and duplication goes back to opinions from 2019

self.expected_content_types = ["text/html"] should have a PDF value

Feb 16 '24 16:02 grossir

To completely solve this issue, we will need to manually delete all corrupted objects from the DB, which on the dev DB start from March 3rd, 2019, up to the date when the fix is merged

Then, we will need to run a backscraper for this period. Such backscraper will also help covering gaps. Gaps are easily identifiable in this source since opinions are numbered incrementally. For example, and from a visual scan, I see we are missing GA-0808, GA-0809, GA-0810, GA-0811, from 2010

Feb 16 '24 17:02 grossir