psa-scrape icon indicating copy to clipboard operation
psa-scrape copied to clipboard

403 Forbidden and JSON Data Error

Open sunnyseera opened this issue 1 year ago • 8 comments

Hey man, if I try to run this locally I get the following:

collecting data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801

Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems

Traceback (most recent call last):
  File "/home/USERNAME/psa-scrape/pop_report/original_to_github.py", line 112, in <module>
    ppr.scrape()
  File "/home/USERNAME/psa-scrape/pop_report/original_to_github.py", line 43, in scrape
    cards = json_data["data"]
UnboundLocalError: local variable 'json_data' referenced before assignment

So when I try the URL (I changed it to a Pokemon one) firstly I get forbidden 403 and then theres a json_data error.

Is there anyway to resolve this? I have been trying to fix it locally but got no where.

I am running it using Python3

I am running it on Ubuntu 22.04.3 LTS

I can resolve the json_data error by doing the following:

Fix 1 of 2 Changing this:

    try:
        json_data = self.post_to_url(sess, form_data)
    except Exception as err:
        print("Error pulling data for {}, with error: {}".format(self.set_name, err))
    cards = json_data["data"]  # This line causes UnboundLocalError if the try block fails

To this:

    try:
        json_data = self.post_to_url(sess, form_data)
    except Exception as err:
        print("Error pulling data for {}, with error: {}".format(self.set_name, err))
        return  # Early exit if an error occurs

    # Ensure json_data is valid before proceeding
    if not json_data or "data" not in json_data:
        print("No valid data found for set: {}".format(self.set_name))
        return  # Exit if there's no data

    cards = json_data["data"]  # Now safe to access since we checked for validity

Fix 2 of 2 Changing this:

json_data = self.post_to_url(sess, form_data)
cards += json_data["data"]  # Assumes json_data is valid

To this:

try:
    json_data = self.post_to_url(sess, form_data)
    if not json_data or "data" not in json_data:
        print("No valid data found for additional page: {}".format(curr_page))
        break  # Exit loop if there's no more data
    cards += json_data["data"]
except Exception as err:
    print("Error pulling additional data for set {}, page {}: {}".format(self.set_name, curr_page, err))
    break  # Exit loop on error

If I put those changes in, I am still left with the 403 Error: Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems

Sorry for the long message but I wanted to give as much context as possible!

Hopefully you can help resolve this!

I think the 403 might come from Cloudflare blocking the request.

sunnyseera avatar Oct 11 '24 14:10 sunnyseera

Hello! Thanks for reporting this. Yeah the 403 is definitely a result of PSA using CloudFlare. The requests are being blocked because they're missing some required cookies in the req header. When I navigate to a pop report page (I'm using https://www.psacard.com/pop/baseball-cards/2018/topps-update/161401 locally as an example), I can see the GetSetItems XHR call in the network calls of the DevTools UI in Chrome. I can right-click that call, copy as CURL (which contains all the required cookie headers AND has a legit UA), run the curl in a terminal, and get back the pop report json, just like the Python program used to be able to do.

This project used to use Selenium and a WebDriver to scrape PSA data, if I were to go back to using that I'm fairly certain this would work again. Using that approach, the web driver should have all the required headers to get around CloudFlare.

This is what the project/code looked like with Selenium.

I might at some point try to get that working again, but probably won't get to it for a while. I don't have the drive or desire to keep up with the PSA website changes.

ChrisMuir avatar Oct 14 '24 17:10 ChrisMuir

I spent some time today getting the Selenium + Webdriver solution working again on the pop report pages. However, the pagination is completely not working on ChromeDriver. I can see the pagination at the bottom of the page, and I can point selenium at those page elements, but the page will not load anything past the first page. Even when I load the page in ChromeDriver and manually try to click thru to page 2, I can interact with the pagination elements, but nothing beyond page 1 will load.

I went to download PhantomJS, I've used that years ago for headless driver scraping, but that project was archived in 2018 :(

I don't know if pagination not working is intentional (CloudFlare detects it's a webdriver), or it's just a bug.

I'm walking away from this for now.

ChrisMuir avatar Oct 20 '24 03:10 ChrisMuir

You only need to solve CloudFlare in order to crawl properly, without automation

Hey man, if I try to run this locally I get the following:

collecting data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801

Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems

Traceback (most recent call last):
  File "/home/USERNAME/psa-scrape/pop_report/original_to_github.py", line 112, in <module>
    ppr.scrape()
  File "/home/USERNAME/psa-scrape/pop_report/original_to_github.py", line 43, in scrape
    cards = json_data["data"]
UnboundLocalError: local variable 'json_data' referenced before assignment

So when I try the URL (I changed it to a Pokemon one) firstly I get forbidden 403 和 then theres a json_data error.

Is there anyway to resolve this? I have been trying to fix it locally but got no where.

I am running it using Python3

I am running it on Ubuntu 22.04.3 LTS

I can resolve the json_data error by doing the following:

Fix 1 / 2 Changing this:

    try:
        json_data = self.post_to_url(sess, form_data)
    except Exception as err:
        print("Error pulling data for {}, with error: {}".format(self.set_name, err))
    cards = json_data["data"]  # This line causes UnboundLocalError if the try block fails

To this:

    try:
        json_data = self.post_to_url(sess, form_data)
    except Exception as err:
        print("Error pulling data for {}, with error: {}".format(self.set_name, err))
        return  # Early exit if an error occurs

    # Ensure json_data is valid before proceeding
    if not json_data or "data" not in json_data:
        print("No valid data found for set: {}".format(self.set_name))
        return  # Exit if there's no data

    cards = json_data["data"]  # Now safe to access since we checked for validity

Fix 2 / 2 Changing this:

json_data = self.post_to_url(sess, form_data)
cards += json_data["data"]  # Assumes json_data is valid

To this:

try:
    json_data = self.post_to_url(sess, form_data)
    if not json_data or "data" not in json_data:
        print("No valid data found for additional page: {}".format(curr_page))
        break  # Exit loop if there's no more data
    cards += json_data["data"]
except Exception as err:
    print("Error pulling additional data for set {}, page {}: {}".format(self.set_name, curr_page, err))
    break  # Exit loop on error

If I put those changes in, I am still left with the 403 Error: Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems

Sorry for the long message but I wanted to give as much context as possible!

Hopefully you can help resolve this!

I think the 403 might come from Cloudflare blocking the request.

You only need to solve CloudFlare in order to crawl properly, without automation

huazz233 avatar Feb 05 '25 03:02 huazz233

Hi @huazz233 , do you have a suggestion on how to get the pop report scraper working again?

ChrisMuir avatar Feb 06 '25 01:02 ChrisMuir

Hi @huazz233 , do you have a suggestion on how to get the pop report scraper working again?

Yes, I can fix the 403 error and keep asking for it, but it seems the syntax has changed and I need to start over

huazz233 avatar Feb 06 '25 02:02 huazz233

Ok! If you're putting together a fix locally and you get it working, I'd welcome a PR

ChrisMuir avatar Feb 06 '25 02:02 ChrisMuir

Any update or hope to avoid 403?

notbnull avatar Apr 26 '25 22:04 notbnull

@notbnull Nah, I haven't touched this since February. @huazz233 I still would love to see a PR, if you've gotten this to work locally.

ChrisMuir avatar Apr 27 '25 00:04 ChrisMuir