new `colo` and `coloctapp` problems with opinion content display rendering
We recently changed the colo scraper, and some newly scraped documents are rendering as raw HTML. I think it's because doctor is tagging them as txt files. I am not sure if this can be fixed on the scraper side,
They look like this, and the opinion backup is on the txt folder
On the other hand, some that have been identified properly as HTML, look strange, 2 which may be fixed using Site.cleanup_content
No PDF option for these new courts? I suppose huh.
It was more difficult to get the opinions as PDFs before, now it seems like a single request. I haven't checked if the PDFs have time related tags or something, though. I have implemented cleanup_content for coloctapp, for now
Now doctor is interpreting the content as "txt", which makes the rendering look bad
@grossir can you push a fix for this - the problem stems from doctor being unable to identify HTML if the HTML tag is not present. We should simply just wrap the final extracted content inside an HTML tag and that should fix this issue
The current content type detection, which uses the python-magic library, is returning text/plain when an tag isn't found. To fix this, we'll add a manual check for
mime = magic.from_buffer(content, mime=True)
# If the file content contains HTML tags, override the detected mime type to text/html
if b"<html" in content.lower() or b"<div" in content.lower():
mime = "text/html"
I've opened a PR for the sub-issue. Next up: fixing the affected data.