scrapy-playwright
scrapy-playwright copied to clipboard
Page hangs on function instead of redirecting
I am attempting an SSO login to a website (I have access to this) via scrapy-playwright, and find that my playwright-script hangs when I use wait_for_function
and this recursively produces the same network requests in the reactor and are all consoled. Eventually, all tasks are pending -- example output:
....
task: <Task pending name='Task-88505' coro=<_make_request_logger.<locals>._log_request() running at /Users//tealium_playwright/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py:463> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[AsyncIOEventEmitter._emit_run.<locals>.callback() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/pyee/asyncio.py:65, ProtocolCallback.__init__.<locals>.cb() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:168]>
2022-12-10 21:06:09 [asyncio] ERROR: Task was destroyed but it is pending!
task: <Task pending name='Task-88624' coro=<_make_request_logger.<locals>._log_request() running at /Users//tealium_playwright/venv/lib/python3.10/site-packages/scrapy_playwright/handler.py:463> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[AsyncIOEventEmitter._emit_run.<locals>.callback() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/pyee/asyncio.py:65, ProtocolCallback.__init__.<locals>.cb() at /Users//tealium_playwright/venv/lib/python3.10/site-packages/playwright/_impl/_connection.py:168]>
I have attempted the following script:
import scrapy
from scrapy_playwright.page import PageMethod
from path import Path
from urllib.parse import urlencode
class telSpider(scrapy.Spider):
name = 'tel'
start_urls = 'https://my.tealiumiq.com/login/sso/'
custom_settings = {
'CONTENT-TYPE': 'application/json',
'USER-AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Safari/605.1.15',
}
def start_requests(self):
yield scrapy.Request(
self.start_urls,
meta = dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector', selector = '.bodyMain', state='attached'),
PageMethod('wait_for_function', """(function() {const setValue = Object.getOwnPropertyDescriptor(
window.HTMLInputElement.prototype,
"value").set;
const modifyInput = (name, value) => {
const input = document.getElementsByName(name)[0]
setValue.call(input, value)
input.dispatchEvent(new Event('input', { bubbles: true}))
};
modifyInput('email', "[email protected]");
document.querySelector("#submitBtn").click();
setTimeout(() => {
if(window.location.href.includes('https://okta.com/login/login.htm')){
console.log(window.location.href);
}else{
console.log('not yet');
}
}, 5000)
}())""", timeout=0),
PageMethod("screenshot", path=Path(__file__).parent / "tealium1.png", full_page=True),
]),
callback = self.parse)
def parse(self, response):
print(response)
Email me for a working email to test. However, replacing wait_for_function
with evaluate
and using the above, I find that only the first query is implemented, and click
is not activated. Because, otherwise, I would get red-text under the input highlighting the email is incorrect. Any idea why this might be happening?
P.S. It works absolutely fine on the console of the web-browser.
--
I eventually got it working by including multiple wait_for_timeouts
, which worked better than wait_for_function
, however, I would be interested to know why the latter keeps the crawler in a loop inside the reactor with unfinished tasks.
But I get the following page indicating the CSRF is invalid and so the cookies were not set-up properly. What do you advise? I have attempted this with scrapy-splash
it redirects to the original page (not supposed to), it's a matter of how to properly assign cookies so your advice will be very helpful!.