juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

new `colo` and `coloctapp` problems with opinion content display rendering

Open grossir opened this issue 1 year ago • 6 comments

We recently changed the colo scraper, and some newly scraped documents are rendering as raw HTML. I think it's because doctor is tagging them as txt files. I am not sure if this can be fixed on the scraper side,

They look like this, and the opinion backup is on the txt folder

image


On the other hand, some that have been identified properly as HTML, look strange, 2 which may be fixed using Site.cleanup_content

image

grossir avatar Jul 30 '24 01:07 grossir

No PDF option for these new courts? I suppose huh.

flooie avatar Aug 07 '24 19:08 flooie

It was more difficult to get the opinions as PDFs before, now it seems like a single request. I haven't checked if the PDFs have time related tags or something, though. I have implemented cleanup_content for coloctapp, for now

grossir avatar Sep 13 '24 04:09 grossir

Now doctor is interpreting the content as "txt", which makes the rendering look bad image

grossir avatar Oct 09 '24 13:10 grossir

@grossir can you push a fix for this - the problem stems from doctor being unable to identify HTML if the HTML tag is not present. We should simply just wrap the final extracted content inside an HTML tag and that should fix this issue

flooie avatar Oct 22 '24 20:10 flooie

The current content type detection, which uses the python-magic library, is returning text/plain when an tag isn't found. To fix this, we'll add a manual check for

tags within the content. If
tags are present, we'll treat the content as HTML, even if the tag is missing. This ensures proper rendering for HTML content that might not include a full wrapper.
mime = magic.from_buffer(content, mime=True)

# If the file content contains HTML tags, override the detected mime type to text/html
if b"<html" in content.lower() or b"<div" in content.lower():
    mime = "text/html"

Luis-manzur avatar May 26 '25 21:05 Luis-manzur

I've opened a PR for the sub-issue. Next up: fixing the affected data.

Luis-manzur avatar May 26 '25 22:05 Luis-manzur