Prevent duplicates by cleaning up HTML tags with timestamps or tokens
We have seen this in coloctapp, which uses a vlex backend https://github.com/freelawproject/juriscraper/issues/1215 . There, some <img> tags had AWS tokens that changed each time
I have a new example in the older files (from before July 2012) for scctapp_u, which have a timestamped <script> tag from "Incapsula"
Open this example a few seconds apart and this tag will change:
<script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=1672922138" async></script>
So we could define a default cleaning step for HTML files that tries to remove elements that may hold tokens or timestamps.
Additionally, we could compute the hash in Juriscraper, so that one can inspect if this changes in development
This is handled in a sense here, for nev and neb
https://github.com/freelawproject/courtlistener/blob/5ed425474fbf46a64a3839d041097febed5d7623/cl/scrapers/management/commands/cl_scrape_opinions.py#L302-L320
Another script tag producing duplicates, this time for nyappdiv
See this query
This part is problematic and will change everytime
r:'93cba1e399dd4c1a',t:'MTc0NjczNTM2MC4wMDAwMDA
vs
r:'93cba314e9d66da1',t:'MTc0NjczNTQwOS4wMDAwMDA='
As downloaded in Python
<script>(function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'93cba1e399dd4c1a',t:'MTc0NjczNTM2MC4wMDAwMDA='};var a=document.createElement('script');a.nonce='';a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();</script></BODY>
As seen on the browser
window.__CF$cv$params={r:'93cba314e9d66da1',t:'MTc0NjczNTQwOS4wMDAwMDA='};var a=document.createElement('script');a.nonce='';a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);