scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

How to implement Scrapy Splash in Virtual Machine

Open lime-n opened this issue 2 years ago • 1 comments

How do I run scrapy splash on a virtual machine with linux? Essentially, I have a lua script that requires me to send keys onto a site to log in and then scrape it.

I have installed docker however I cannot seem to get the scraper to work as it won't connect to the server.

Are there any simple steps that I can follow to get this to work on a VM? Like what should I install, and what should I do next before running scrapy crawl spider.

As for docker, I have implemented the following whilst in admin mode:

docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600

However this is currently running and I'd like it to run in on the background. I cannot seem to figure this out; I have tried:

docker run -d 8050:8050 scrapinghub/splash --max-timeout 3600

But I just get the error:

Unable to find image '8050:8050' locally

I believe this may solve my issue or perhaps not and I need some further installations. Please let me know! I really need expert guidance to figure this out.

I have opened another instance whilst docker was running on the first instance.

I get the following error when running the scrapy crawler:

2022-02-16 02:55:26 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'description': 'Error happened while executing Lua script', 'info': 
{'type': 'JS_ERROR', 'js_error_type': 'TypeError', 'js_error_message': 'null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'js_error':
 'TypeError: null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'message': '[string "..."]:12: error during JS function call: \'TypeEr
ror: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\'', 'source': '[string "..."]', 'line_number': 12, 'error': 'error during JS
 function call: \'TypeError: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\''}}
2022-02-16 02:55:26 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://instagram.com/ via http://localhost:8050/execute> (referer: None)
2022-02-16 02:55:26 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://instagram.com/>: HTTP status code is not handled or not allowed

The scraper works perfectly fine on my mac so there's definitely an installation that I am missing somewhere.

lime-n avatar Feb 16 '22 02:02 lime-n

To run the instance in the background, you can use the -d flag, which stands for detached mode. Here's the updated command

docker run -d -p 8050:8050 scrapinghub/splash --max-timeout 3600

Ehsan-U avatar Mar 09 '23 04:03 Ehsan-U