zillow_real_estate icon indicating copy to clipboard operation
zillow_real_estate copied to clipboard

returns empty dataset using testcase

Open cliffordgreen opened this issue 5 years ago • 5 comments

csv is blank

cliffordgreen avatar Jul 01 '19 03:07 cliffordgreen

CSV is blank

Mehra-Ashish avatar Sep 07 '19 04:09 Mehra-Ashish

It's because it's getting blocked by a captcha if you look at the response.text

jchamish avatar Mar 22 '20 02:03 jchamish

Yup, the csv output is empty because the request is being blocked by a captcha.

I'm not sure that its helpful, but here's what response.text looks like:

response.text
<html><head><meta name="robots" content="noindex, nofollow"/><link href="https://www.zillowstatic.com/vstatic/80d5e73/static/css/z-pages/captcha.css" type="text/css" rel="stylesheet" media="screen"/><script>
        window._pxAppId = 'PXHYx10rg3';
        window._pxJsClientSrc = '/HYx10rg3/init.js';
        window._pxHostUrl = '/HYx10rg3/xhr';
        window._pxFirstPartyEnabled = true;
        window._pxreCaptchaTheme='light';
    </script><script type="text/javascript" src="https://captcha.px-cdn.net/PXHYx10rg3/captcha.js?a=c&amp;m=0"></script>
    <script>
        function getQueryString(name, url) {
            if (!url) url = window.location.href;
            name = name.replace(/[\[\]]/g, "\\$&");
            var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
                results = regex.exec(url);
            if (!results) return null;
            if (!results[2]) return '';
            return decodeURIComponent(results[2].replace(/\+/g, " "));
        }
        document.addEventListener("DOMContentLoaded", function(e) {
            var uuidVerifyRegExp = /^\{?[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}?$/i;
            document.getElementById("uuid").innerText = "UUID: " + uuidVerifyRegExp.exec(getQueryString("uuid"));
        });

        function handleCaptcha(response) {
            var vid = getQueryString("vid"); // getQueryString is implemented below
            var uuid = getQueryString("uuid");
            var name = '_pxCaptcha';
            var cookieValue =  btoa(JSON.stringify({r:response,v:vid,u:uuid}));
            var cookieParts = [name, '=', cookieValue, '; path=/'];
            cookieParts.push('; domain=' + window.location.hostname);
            cookieParts.push('; max-age=10');//expire after 10 seconds
            document.cookie = cookieParts.join('');
            var originalURL = getOriginalUrl("url");
            var originalHost = window.location.host;
            var newHref = window.location.protocol + "//" + originalHost;
            originalURL = originalURL || '/';
            newHref = newHref + originalURL;
            window.location.href = newHref;
        }

        function getOriginalUrl(name) {
            var url = getQueryString(name);
            if (!url) return null;
            var regExMatcher = new RegExp("(([^&#@]*)|&|#|$)");
            var matches = regExMatcher.exec(url);
            if (!matches) return null;
            return matches[0];
        }
    </script></head><body><main class="zsg-layout-content"><div class="error-content-block"><div class="error-text-content"><!-- <h1>Captcha</h1> --><h5>Please verify you're a human to continue.</h5><div id="content" class="captcha-container"><div id="px-captcha" data-callback="handleCaptcha"></div><img src="https://www.zillowstatic.com/static/logos/logo-65x14.png" width="65" alt="Zillow" height="14"></img></div></div></div></main><h4 id="uuid" class="uuid-string zsg-fineprint"></h4></body></html><!-- H:033  T:1ms  S:2605  R:Mon Apr 20 17:12:59 PDT 2020  B:5.0.64443-master.122d0fb~delivery_ready.0d3d7d1b -->

Solution

To bypass the captcha, you can add cookies to your request. The following worked for me in bash, but I have not tried in python.

  • Download the site's cookies with the chrome cookie.txt extension.
  • Run the following:
ZIP=<zip code>
URL="https://www.zillow.com/homes/$ZIP_rb"                                                                                                                            wget -x --load-cookies cookies.txt "$URL"

twesleyb avatar Apr 21 '20 00:04 twesleyb

replace the following code accordingly to bypass the captcha . Line 79 #search_results = json_data.get('searchResults').get('listResults', []) search_results = json_data.get('cat1').get('searchResults').get('listResults', [])

Luis1Madrid avatar Jun 24 '21 20:06 Luis1Madrid

The chrome extension will generate a cookie.txt file that contains the key(name)/ value pairs. Turn those into a string json. Then to use the cookies, add the cookies=cookies_dict parameter to your request.get(). @Luis1Madrid where do you see json_data.get('searchResults'?

ingrid88 avatar Jul 18 '21 05:07 ingrid88