juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Prevent duplicates by cleaning up HTML tags with timestamps or tokens

Open grossir opened this issue 1 year ago • 2 comments

We have seen this in coloctapp, which uses a vlex backend https://github.com/freelawproject/juriscraper/issues/1215 . There, some <img> tags had AWS tokens that changed each time

I have a new example in the older files (from before July 2012) for scctapp_u, which have a timestamped <script> tag from "Incapsula" Open this example a few seconds apart and this tag will change: <script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=1672922138" async></script>

So we could define a default cleaning step for HTML files that tries to remove elements that may hold tokens or timestamps.

Additionally, we could compute the hash in Juriscraper, so that one can inspect if this changes in development

grossir avatar Oct 22 '24 21:10 grossir

This is handled in a sense here, for nev and neb

https://github.com/freelawproject/courtlistener/blob/5ed425474fbf46a64a3839d041097febed5d7623/cl/scrapers/management/commands/cl_scrape_opinions.py#L302-L320

grossir avatar Mar 26 '25 19:03 grossir

Another script tag producing duplicates, this time for nyappdiv See this query

This part is problematic and will change everytime r:'93cba1e399dd4c1a',t:'MTc0NjczNTM2MC4wMDAwMDA vs r:'93cba314e9d66da1',t:'MTc0NjczNTQwOS4wMDAwMDA='

As downloaded in Python

<script>(function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'93cba1e399dd4c1a',t:'MTc0NjczNTM2MC4wMDAwMDA='};var a=document.createElement('script');a.nonce='';a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();</script></BODY>

As seen on the browser

window.__CF$cv$params={r:'93cba314e9d66da1',t:'MTc0NjczNTQwOS4wMDAwMDA='};var a=document.createElement('script');a.nonce='';a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);

grossir avatar May 08 '25 20:05 grossir