realestate-com-au-api icon indicating copy to clipboard operation
realestate-com-au-api copied to clipboard

This project is dead ?

Open BenGardiner123 opened this issue 1 year ago • 12 comments

Hi guys - pretty sure that REDC has implemented changes that render this project useless unfortunately which is a shame

Anyone with knowledge otherwise ?

BenGardiner123 avatar Jan 31 '24 23:01 BenGardiner123

Oh damn, that sucks.

Had a quick look at their current search query and it looks similar. I imagine it won't be too hard to bring this repo up to date.

@BenGardiner123 are you interested in contributing? Happy to help steer you in the right direction!

tomquirk avatar Jan 31 '24 23:01 tomquirk

Hey @tomquirk I could have a go for sure but full disclosure I have almost zero exp with Python. But if as you say can steer me in the right direction I might produce something of value ? :)

BenGardiner123 avatar Feb 05 '24 23:02 BenGardiner123

HI @tomquirk , count me in. Would love to contribute and see the outcome

mhtkhan302 avatar Feb 11 '24 23:02 mhtkhan302

Hi @tomquirk, I would love to contribute as well.

ZYMoridae avatar Feb 12 '24 12:02 ZYMoridae

Is anyone working on this? or does anyone wants to start a new repo?

themachineworks avatar Mar 17 '24 04:03 themachineworks

I'd be happy to help contribute. I just don't know the details of the API.

aaronshenhao avatar May 04 '24 03:05 aaronshenhao

As far as I can tell, the issue seems to be some kind of bot protection / obfuscation that has been added to website? Something called KPSDK?

angusturner avatar May 07 '24 05:05 angusturner

I didnt notice that. Also I tired sending a single request using chrome user-agent, I didnt get back anything. Correct me if I am wrong, this website looks like a mirco-service architecture and hence calling different API to fill the HTML which it presents first. My idea was to call all these API and construct the json at my end but doesnot seems to work

themachineworks avatar May 07 '24 08:05 themachineworks

After some more digging, it seems like if you trigger the bot detection then instead of receiving a JSON response (as expected), you will instead get a 429 error code, along with an empty page and a bit of javascript that I think is meant to represent some kind of automated challenge to see if you are a bot?

Actually it seems so aggressive that if I use firefox then I trigger the bot detection and the website doesn't even work. I have to use chrome.

And I have had no luck making automated requests.

(By the way @themachineworks , you are correct about the general architecture - as you scroll / tab through the search results the front-end makes requests to some kind of GraphQL service which returns JSON formatted results to populate the HTML).

Edit: Just realized this is all old news, and has been discussed in #36.

🪦 🪦 🪦

angusturner avatar May 07 '24 16:05 angusturner

The html returned seem to even appear on the realestate.com.au landing page too:

<!DOCTYPE html>
<html>
   <head></head>
   <body>
      <script>window.KPSDK={};KPSDK.now=typeof performance!=='undefined'&&performance.now?performance.now.bind(performance):Date.now.bind(Date);KPSDK.start=KPSDK.now();</script><script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP_UIDz-1=0IkyTraxWKJz8DAGZOeIWa82r4iOP4PHU3CRAN4saAqH6BjYlet5sRq86f9KakeK9WKVM6D6wlxmZeBVePGtnACkGWoOOBXVaV0hWvKZkJhTbwUolpjp1NSAFjtFnqB3zwExdRFLtLwnwbXiomoenEVtT7FRcJQcbQmZcDRU&amp;x-kpsdk-im=CiQ4ODJjNGNlNi05MzU5LTQwZTYtOGY4OC02ZTJkY2VhZmIyMGU"></script>
   </body>
</html>

Maybe, what we need is some sort of cookie injection that injects the KP_UIDz cookie, from KASADA bot protection, into each request.

@tomquirk Did you write Fajita for cookie management for Linkedin webscraping?

Since RealestateComAu class extends Fajita, maybe a cookie_directory needs to be provided to Fajita.__init__, but currently this is hard coded to ROOT/cookies/ folder

What is the required format of the Cookie Jar .jr? Something like this, extracted from curl:

#HttpOnly_www.realestate.com.au	FALSE	/	FALSE	1724382684	KP_UIDz	0G4NnTR24yDoAQbIDflysjJMoTt1YqXizzHXhC5CmtAnL8L8yLLgLZsXzdMfkbA6k7GWLjdHuQYi57X8DzTKmxSOwnz0uSwB1AP9QphWl094ZuxDHIX0kT42cmgQQJvL593ZNuEqqusZojo0hLXk11c1xPWmI8eUe10I319f

Part of the challenge is reverse engineering KASADA. I don't really agree with the abstractions made in the codebase, but gotta work with it now.

MengLinMaker avatar Aug 22 '24 02:08 MengLinMaker

@MengLinMaker I had a quick try at this, but I couldn't get it to work

res = requests.get('https://realestate.com.au', headers=REQUEST_HEADERS)

reg = r'KP_UIDz=(.*)&'
cookie = re.search(reg, res.text).group(1)
dict_cookie = {'KP_UIDz': cookie}
cookie_jar = cookiejar_from_dict(dict_cookie)

api._client._set_session_cookies(cookie_jar)

listings = api.search(locations=["brisbane, qld 4000"], channel="buy", limit=10)

From my brief reading, Kasada looks quite gnarly to evade. It looks like there's a few headers involved, in addition to the cookie:

  • x-kpsdk-v
  • x-kpsdx-ct
  • x-kpsdx-cd

More here:

  • https://blog.csdn.net/zhzhsgg/article/details/135253952
  • https://github.com/unicorn-aio/kpsdk

I tried a request via the Bright Data Web Unblocker proxy and it worked though (commercial alternative though).

tomquirk avatar Sep 02 '24 14:09 tomquirk

Thank you for your investigation.

Looks like bypassing KASADA would require an automated browser to evaluate the KPSDK javascript bundle. Regardless, realestate.com.au limits the amount of paginated pages.

MengLinMaker avatar Sep 02 '24 14:09 MengLinMaker