scrapy-splash
scrapy-splash copied to clipboard
How to implement Scrapy Splash in Virtual Machine
How do I run scrapy splash on a virtual machine with linux? Essentially, I have a lua script that requires me to send keys onto a site to log in and then scrape it.
I have installed docker however I cannot seem to get the scraper to work as it won't connect to the server.
Are there any simple steps that I can follow to get this to work on a VM? Like what should I install, and what should I do next before running scrapy crawl spider
.
As for docker, I have implemented the following whilst in admin mode:
docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600
However this is currently running and I'd like it to run in on the background. I cannot seem to figure this out; I have tried:
docker run -d 8050:8050 scrapinghub/splash --max-timeout 3600
But I just get the error:
Unable to find image '8050:8050' locally
I believe this may solve my issue or perhaps not and I need some further installations. Please let me know! I really need expert guidance to figure this out.
I have opened another instance whilst docker was running on the first instance.
I get the following error when running the scrapy crawler:
2022-02-16 02:55:26 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'description': 'Error happened while executing Lua script', 'info':
{'type': 'JS_ERROR', 'js_error_type': 'TypeError', 'js_error_message': 'null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'js_error':
'TypeError: null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'message': '[string "..."]:12: error during JS function call: \'TypeEr
ror: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\'', 'source': '[string "..."]', 'line_number': 12, 'error': 'error during JS
function call: \'TypeError: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\''}}
2022-02-16 02:55:26 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://instagram.com/ via http://localhost:8050/execute> (referer: None)
2022-02-16 02:55:26 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://instagram.com/>: HTTP status code is not handled or not allowed
The scraper works perfectly fine on my mac so there's definitely an installation that I am missing somewhere.
To run the instance in the background, you can use the -d flag, which stands for detached mode. Here's the updated command
docker run -d -p 8050:8050 scrapinghub/splash --max-timeout 3600