scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

Page hangs on function instead of redirecting

Open lime-n opened this issue 2 years ago • 0 comments

I am attempting an SSO login to a website (I have access to this) via scrapy-playwright, and find that my playwright-script hangs when I use wait_for_function and this recursively produces the same network requests in the reactor and are all consoled. Eventually, all tasks are pending -- example output:

....

task: <Task pending name='Task-88505' coro=<_make_request_logger.<locals>._log_request() running at /Users//tealium_playwright/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py:463> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[AsyncIOEventEmitter._emit_run.<locals>.callback() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/pyee/asyncio.py:65, ProtocolCallback.__init__.<locals>.cb() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:168]>
2022-12-10 21:06:09 [asyncio] ERROR: Task was destroyed but it is pending!
task: <Task pending name='Task-88624' coro=<_make_request_logger.<locals>._log_request() running at /Users//tealium_playwright/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py:463> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[AsyncIOEventEmitter._emit_run.<locals>.callback() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/pyee/asyncio.py:65, ProtocolCallback.__init__.<locals>.cb() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:168]>

I have attempted the following script:

import scrapy
from scrapy_playwright.page import PageMethod
from path import Path
from urllib.parse import urlencode

class telSpider(scrapy.Spider):
    name = 'tel'
    start_urls = 'https://my.tealiumiq.com/login/sso/'

    custom_settings = {
        'CONTENT-TYPE': 'application/json',
        'USER-AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Safari/605.1.15',
    }

    def start_requests(self):
        yield scrapy.Request(
            self.start_urls,
            meta = dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_methods = [
                        PageMethod('wait_for_selector', selector = '.bodyMain', state='attached'),
                        PageMethod('wait_for_function', """(function() {const setValue = Object.getOwnPropertyDescriptor(
                                  window.HTMLInputElement.prototype,
                                  "value").set;
                                  const modifyInput = (name, value) => {
                                  const input = document.getElementsByName(name)[0]
                                  setValue.call(input, value)
                                  input.dispatchEvent(new Event('input', { bubbles: true}))
                                  };
                                  modifyInput('email', "[email protected]");
                                  document.querySelector("#submitBtn").click();
                                  setTimeout(() => {
                                      if(window.location.href.includes('https://okta.com/login/login.htm')){
                                          console.log(window.location.href);
                                      }else{
                                          console.log('not yet');
                                      }
                                  }, 5000)
                                  
                                  }())""", timeout=0),
                        PageMethod("screenshot", path=Path(__file__).parent / "tealium1.png", full_page=True),
                        ]),
                callback = self.parse)
 
    def parse(self, response):
        print(response)

Email me for a working email to test. However, replacing wait_for_function with evaluate and using the above, I find that only the first query is implemented, and click is not activated. Because, otherwise, I would get red-text under the input highlighting the email is incorrect. Any idea why this might be happening?

P.S. It works absolutely fine on the console of the web-browser.

-- I eventually got it working by including multiple wait_for_timeouts, which worked better than wait_for_function, however, I would be interested to know why the latter keeps the crawler in a loop inside the reactor with unfinished tasks.

But I get the following page indicating the CSRF is invalid and so the cookies were not set-up properly. What do you advise? I have attempted this with scrapy-splash it redirects to the original page (not supposed to), it's a matter of how to properly assign cookies so your advice will be very helpful!. tealium3

lime-n avatar Dec 10 '22 21:12 lime-n