texag is buggy and creating duplicates
The scraper is picking the HTML "detail" page link instead of the PDF link. The XPATH should be updated We have a bunch of HTML pages and duplicates on CL as of now.
This wrong link selection and duplication goes back to opinions from 2019
self.expected_content_types = ["text/html"] should have a PDF value
To completely solve this issue, we will need to manually delete all corrupted objects from the DB, which on the dev DB start from March 3rd, 2019, up to the date when the fix is merged
Then, we will need to run a backscraper for this period. Such backscraper will also help covering gaps. Gaps are easily identifiable in this source since opinions are numbered incrementally. For example, and from a visual scan, I see we are missing GA-0808, GA-0809, GA-0810, GA-0811, from 2010