realestate-com-au-api
realestate-com-au-api copied to clipboard
This project is dead ?
Hi guys - pretty sure that REDC has implemented changes that render this project useless unfortunately which is a shame
Anyone with knowledge otherwise ?
Oh damn, that sucks.
Had a quick look at their current search query and it looks similar. I imagine it won't be too hard to bring this repo up to date.
@BenGardiner123 are you interested in contributing? Happy to help steer you in the right direction!
Hey @tomquirk I could have a go for sure but full disclosure I have almost zero exp with Python. But if as you say can steer me in the right direction I might produce something of value ? :)
HI @tomquirk , count me in. Would love to contribute and see the outcome
Hi @tomquirk, I would love to contribute as well.
Is anyone working on this? or does anyone wants to start a new repo?
I'd be happy to help contribute. I just don't know the details of the API.
As far as I can tell, the issue seems to be some kind of bot protection / obfuscation that has been added to website? Something called KPSDK?
I didnt notice that. Also I tired sending a single request using chrome user-agent, I didnt get back anything. Correct me if I am wrong, this website looks like a mirco-service architecture and hence calling different API to fill the HTML which it presents first. My idea was to call all these API and construct the json at my end but doesnot seems to work
After some more digging, it seems like if you trigger the bot detection then instead of receiving a JSON response (as expected), you will instead get a 429 error code, along with an empty page and a bit of javascript that I think is meant to represent some kind of automated challenge to see if you are a bot?
Actually it seems so aggressive that if I use firefox then I trigger the bot detection and the website doesn't even work. I have to use chrome.
And I have had no luck making automated requests.
(By the way @themachineworks , you are correct about the general architecture - as you scroll / tab through the search results the front-end makes requests to some kind of GraphQL service which returns JSON formatted results to populate the HTML).
Edit: Just realized this is all old news, and has been discussed in #36.
🪦 🪦 🪦
The html returned seem to even appear on the realestate.com.au landing page too:
<!DOCTYPE html>
<html>
<head></head>
<body>
<script>window.KPSDK={};KPSDK.now=typeof performance!=='undefined'&&performance.now?performance.now.bind(performance):Date.now.bind(Date);KPSDK.start=KPSDK.now();</script><script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/ips.js?KP_UIDz-1=0IkyTraxWKJz8DAGZOeIWa82r4iOP4PHU3CRAN4saAqH6BjYlet5sRq86f9KakeK9WKVM6D6wlxmZeBVePGtnACkGWoOOBXVaV0hWvKZkJhTbwUolpjp1NSAFjtFnqB3zwExdRFLtLwnwbXiomoenEVtT7FRcJQcbQmZcDRU&x-kpsdk-im=CiQ4ODJjNGNlNi05MzU5LTQwZTYtOGY4OC02ZTJkY2VhZmIyMGU"></script>
</body>
</html>
Maybe, what we need is some sort of cookie injection that injects the KP_UIDz
cookie, from KASADA bot protection, into each request.
@tomquirk Did you write Fajita
for cookie management for Linkedin webscraping?
Since RealestateComAu
class extends Fajita
, maybe a cookie_directory
needs to be provided to Fajita.__init__
, but currently this is hard coded to ROOT/cookies/
folder
What is the required format of the Cookie Jar .jr
? Something like this, extracted from curl
:
#HttpOnly_www.realestate.com.au FALSE / FALSE 1724382684 KP_UIDz 0G4NnTR24yDoAQbIDflysjJMoTt1YqXizzHXhC5CmtAnL8L8yLLgLZsXzdMfkbA6k7GWLjdHuQYi57X8DzTKmxSOwnz0uSwB1AP9QphWl094ZuxDHIX0kT42cmgQQJvL593ZNuEqqusZojo0hLXk11c1xPWmI8eUe10I319f
Part of the challenge is reverse engineering KASADA. I don't really agree with the abstractions made in the codebase, but gotta work with it now.
@MengLinMaker I had a quick try at this, but I couldn't get it to work
res = requests.get('https://realestate.com.au', headers=REQUEST_HEADERS)
reg = r'KP_UIDz=(.*)&'
cookie = re.search(reg, res.text).group(1)
dict_cookie = {'KP_UIDz': cookie}
cookie_jar = cookiejar_from_dict(dict_cookie)
api._client._set_session_cookies(cookie_jar)
listings = api.search(locations=["brisbane, qld 4000"], channel="buy", limit=10)
From my brief reading, Kasada looks quite gnarly to evade. It looks like there's a few headers involved, in addition to the cookie:
- x-kpsdk-v
- x-kpsdx-ct
- x-kpsdx-cd
More here:
- https://blog.csdn.net/zhzhsgg/article/details/135253952
- https://github.com/unicorn-aio/kpsdk
I tried a request via the Bright Data Web Unblocker proxy and it worked though (commercial alternative though).
Thank you for your investigation.
Looks like bypassing KASADA would require an automated browser to evaluate the KPSDK javascript bundle. Regardless, realestate.com.au limits the amount of paginated pages.