Breaking

Open O4FDev opened this issue 2 years ago • 0 comments

Built out the below script with ChatGPT & Co-Pilot, it probably isn't the most efficient thing at scale & I'm sure the checks could be better so as to not false-positive but overall I don't know that Garlic would've stopped my scrapping madness days.

My other alternative to the below script (and my preferred method) is to simply use requests lib to throw all of the HTML into a Queue pool, have off-loading nodes check for JS that modifies the Dom in some way and attempt to run it in a closed system environment which would've prevented any issues with slowing me down at a (rather minor unfortunately) recourse tradeoff cost.

Feel free to reach out (anyone in general) on Discord (@youareexisting) and I'd be happy to hop on a call to discuss this, for context I did A LOT of web-scrapping (legally) to build my first startup and was regularly thinking about the rather annoying ways to slow me down or cost me a bit more cash.

You're on the right track so don't be discouraged but there's definitely a lot more that can go into this than encoding the content of the [client-side].

Result:

<html><head><script is:global="">
   function decodeBase64(root) {
       for (let i = 0; i < root.childNodes.length; i++) {
           const child = root.childNodes[i];
           if (child.childNodes.length === 1 && child.childNodes[0].nodeType === 3) {
               try {
               const decoded = atob(child.childNodes[0].nodeValue).split("_yummy_")[0]
               child.innerHTML = decoded;
               console.log(child);
               } catch (e) {
                   console.log(e);
               }
           }
           else if (child.nodeType === 3) {
               try {
               const decoded = atob(child.nodeValue).split("_yummy_")[0]
               child.nodeValue = decoded;
               console.log(child);
               } catch (e) {
                   console.log(e);
               }
           }
           else {
               decodeBase64(child);
           }
       }
   }
   document.addEventListener("DOMContentLoaded", function(event) {

       decodeBase64(document.body);
   });
</script>

      <link rel="stylesheet" href="/_astro/index.12c5fa68.css">
<link rel="stylesheet" href="/_astro/index.017e34d0.css"></head><body><div><main class="astro-J7PV25F6"><div class="App astro-J7PV25F6"><h1 id="garlic" class="astro-J7PV25F6">Garlic</h1><p class="astro-J7PV25F6">Garlic is a simple, fast and secure way to protect your website from being scraped by bots.</p><p class="astro-J7PV25F6">You write your code and text as you would any other day, just let garlic protect your content from scraping.</p><a href="/more/about" class="astro-J7PV25F6">About</a></div></main></div></body></html>

Script:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def has_body_modifying_scripts(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    scripts = soup.find_all('script')
    script_files = soup.find_all('script', src=True)
    
    for script in scripts:
        if 'document.body' in script.text:
            return True
    
    for script_file in script_files:
        script_url = script_file['src']
        response = requests.get(script_url)
        if response.status_code == 200:
            script_content = response.text
            if 'document.body' in script_content:
                return True
    
    return False

def get_html_with_selenium(url):
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    rendered_html = driver.page_source
    driver.quit()
    return rendered_html

def fetch_url_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        html_content = response.text
        if has_body_modifying_scripts(html_content):
            print("Scripts found that modify document.body, using Selenium to get the rendered page.")
            return get_html_with_selenium(url)
        else:
            print("No scripts found that modify document.body, using requests output.")
            return html_content
    else:
        print("Failed to fetch the URL.")
        return None

# Example usage
url = "https://garlic-astro.netlify.app/"  # Replace with the URL you want to scrape
html_content = fetch_url_content(url)
print(html_content)

Dec 17 '23 07:12 O4FDev