zillow_real_estate
zillow_real_estate copied to clipboard
returns empty dataset using testcase
csv is blank
CSV is blank
It's because it's getting blocked by a captcha if you look at the response.text
Yup, the csv output is empty because the request is being blocked by a captcha.
I'm not sure that its helpful, but here's what response.text
looks like:
response.text
<html><head><meta name="robots" content="noindex, nofollow"/><link href="https://www.zillowstatic.com/vstatic/80d5e73/static/css/z-pages/captcha.css" type="text/css" rel="stylesheet" media="screen"/><script>
window._pxAppId = 'PXHYx10rg3';
window._pxJsClientSrc = '/HYx10rg3/init.js';
window._pxHostUrl = '/HYx10rg3/xhr';
window._pxFirstPartyEnabled = true;
window._pxreCaptchaTheme='light';
</script><script type="text/javascript" src="https://captcha.px-cdn.net/PXHYx10rg3/captcha.js?a=c&m=0"></script>
<script>
function getQueryString(name, url) {
if (!url) url = window.location.href;
name = name.replace(/[\[\]]/g, "\\$&");
var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
results = regex.exec(url);
if (!results) return null;
if (!results[2]) return '';
return decodeURIComponent(results[2].replace(/\+/g, " "));
}
document.addEventListener("DOMContentLoaded", function(e) {
var uuidVerifyRegExp = /^\{?[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}?$/i;
document.getElementById("uuid").innerText = "UUID: " + uuidVerifyRegExp.exec(getQueryString("uuid"));
});
function handleCaptcha(response) {
var vid = getQueryString("vid"); // getQueryString is implemented below
var uuid = getQueryString("uuid");
var name = '_pxCaptcha';
var cookieValue = btoa(JSON.stringify({r:response,v:vid,u:uuid}));
var cookieParts = [name, '=', cookieValue, '; path=/'];
cookieParts.push('; domain=' + window.location.hostname);
cookieParts.push('; max-age=10');//expire after 10 seconds
document.cookie = cookieParts.join('');
var originalURL = getOriginalUrl("url");
var originalHost = window.location.host;
var newHref = window.location.protocol + "//" + originalHost;
originalURL = originalURL || '/';
newHref = newHref + originalURL;
window.location.href = newHref;
}
function getOriginalUrl(name) {
var url = getQueryString(name);
if (!url) return null;
var regExMatcher = new RegExp("(([^&#@]*)|&|#|$)");
var matches = regExMatcher.exec(url);
if (!matches) return null;
return matches[0];
}
</script></head><body><main class="zsg-layout-content"><div class="error-content-block"><div class="error-text-content"><!-- <h1>Captcha</h1> --><h5>Please verify you're a human to continue.</h5><div id="content" class="captcha-container"><div id="px-captcha" data-callback="handleCaptcha"></div><img src="https://www.zillowstatic.com/static/logos/logo-65x14.png" width="65" alt="Zillow" height="14"></img></div></div></div></main><h4 id="uuid" class="uuid-string zsg-fineprint"></h4></body></html><!-- H:033 T:1ms S:2605 R:Mon Apr 20 17:12:59 PDT 2020 B:5.0.64443-master.122d0fb~delivery_ready.0d3d7d1b -->
Solution
To bypass the captcha, you can add cookies to your request. The following worked for me in bash, but I have not tried in python.
- Download the site's cookies with the chrome cookie.txt extension.
- Run the following:
ZIP=<zip code>
URL="https://www.zillow.com/homes/$ZIP_rb" wget -x --load-cookies cookies.txt "$URL"
replace the following code accordingly to bypass the captcha . Line 79 #search_results = json_data.get('searchResults').get('listResults', []) search_results = json_data.get('cat1').get('searchResults').get('listResults', [])
The chrome extension will generate a cookie.txt file that contains the key(name)/ value pairs. Turn those into a string json. Then to use the cookies, add the cookies=cookies_dict parameter to your request.get(). @Luis1Madrid where do you see json_data.get('searchResults'
?