ARGUS
ARGUS copied to clipboard
Argus is returning no text for some websites
Hi,
I am using the ARGUS tool in order to generate data for my master’s thesis. For some of the 500 websites I want to scrape, ARGUS is neither returning an error nor returning any text, even though there is text present on the website (e.g. Link, Link).
Is there any workaround to get the text from these missing websites?
Down below you can see:
-
My current settings for the run

-
The first lines of the output

Hi both sites seem to rely quite heavily on java script which may cause the problem. Another issue may be that the text is not enclosed by html tags ARGUS is using to extract text: https://github.com/datawizard1337/ARGUS/blob/4f61679595f305d3587caaedb030a1884c2f422e/build/lib/ARGUS/spiders/dualspider.py#L86-L107
You could try to add the missing tags. After doing that you have to deploy your updated project using the
scrapyd-deploy
command in your command line interface in the ARGUS main directory. See also: https://github.com/scrapy/scrapyd-client#scrapyd-deploy
Hi,
thank you for the quick response, really appreciate that!
I checked the websites again and the text is enclosed by html tags imo, which Argus is using (e.g. Link):

Could it be that the heavy dependence on JavaScript prevents ARGUS from scraping the site? Or is it that JavaScript is disabled, hence nothing is shown for the specified website when scraping?
Thanks a lot again!
Could be because of JavaScript or some kind of delayed loading. Not sure to be honest. How frequent is that issue in your dataset?
For roughly 43% it returns nothing & no error (207/472 websites). Could an explicit wait maybe be implemented somewhere to solve the delayed loading issue?
Yeah, you can change any scrapy related settings in the settings file: https://github.com/datawizard1337/ARGUS/blob/4f61679595f305d3587caaedb030a1884c2f422e/build/lib/ARGUS/settings.py
Check out https://docs.scrapy.org/en/latest/topics/settings.html for more info. And don't forget to deploy your changed project as described above.
I implemented different download delays and deployed the project mentioned above. Unfortunately this didn't help.
Then I checked, if some of the missing sites are loading with JavaScript disabled. None of the checked sites showed any text which this seems to be the issue, why no text is returned as no text is shown, when JavaScript is disabled.
Looking into the scrapy documentation I found the entry for prerendering sites with JavaScript implementation here: https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering
This recommends using the scrapy-splash module. I am unsure if this will solve the issue and if the implementation is easy and quick within ARGUS. Do you have an opinion on that?
My knowledge about splash is very limited. I think the implementation is not straight forward, especially if you want to combine it with ARGUS. Sorry!