ARGUS icon indicating copy to clipboard operation
ARGUS copied to clipboard

Argus is returning no text for some websites

Open DJW-TU opened this issue 3 years ago • 7 comments

Hi,

I am using the ARGUS tool in order to generate data for my master’s thesis. For some of the 500 websites I want to scrape, ARGUS is neither returning an error nor returning any text, even though there is text present on the website (e.g. Link, Link).

Is there any workaround to get the text from these missing websites?

Down below you can see:

  • My current settings for the run

  • The first lines of the output

DJW-TU avatar Jun 27 '22 21:06 DJW-TU

Hi both sites seem to rely quite heavily on java script which may cause the problem. Another issue may be that the text is not enclosed by html tags ARGUS is using to extract text: https://github.com/datawizard1337/ARGUS/blob/4f61679595f305d3587caaedb030a1884c2f422e/build/lib/ARGUS/spiders/dualspider.py#L86-L107

You could try to add the missing tags. After doing that you have to deploy your updated project using the scrapyd-deploy command in your command line interface in the ARGUS main directory. See also: https://github.com/scrapy/scrapyd-client#scrapyd-deploy

datawizard1337 avatar Jun 28 '22 07:06 datawizard1337

Hi,

thank you for the quick response, really appreciate that!

I checked the websites again and the text is enclosed by html tags imo, which Argus is using (e.g. Link):

image

Could it be that the heavy dependence on JavaScript prevents ARGUS from scraping the site? Or is it that JavaScript is disabled, hence nothing is shown for the specified website when scraping?

Thanks a lot again!

DJW-TU avatar Jun 28 '22 10:06 DJW-TU

Could be because of JavaScript or some kind of delayed loading. Not sure to be honest. How frequent is that issue in your dataset?

datawizard1337 avatar Jun 28 '22 14:06 datawizard1337

For roughly 43% it returns nothing & no error (207/472 websites). Could an explicit wait maybe be implemented somewhere to solve the delayed loading issue?

DJW-TU avatar Jun 29 '22 11:06 DJW-TU

Yeah, you can change any scrapy related settings in the settings file: https://github.com/datawizard1337/ARGUS/blob/4f61679595f305d3587caaedb030a1884c2f422e/build/lib/ARGUS/settings.py

Check out https://docs.scrapy.org/en/latest/topics/settings.html for more info. And don't forget to deploy your changed project as described above.

datawizard1337 avatar Jun 29 '22 12:06 datawizard1337

I implemented different download delays and deployed the project mentioned above. Unfortunately this didn't help.

Then I checked, if some of the missing sites are loading with JavaScript disabled. None of the checked sites showed any text which this seems to be the issue, why no text is returned as no text is shown, when JavaScript is disabled.

Looking into the scrapy documentation I found the entry for prerendering sites with JavaScript implementation here: https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering

This recommends using the scrapy-splash module. I am unsure if this will solve the issue and if the implementation is easy and quick within ARGUS. Do you have an opinion on that?

DJW-TU avatar Jun 29 '22 14:06 DJW-TU

My knowledge about splash is very limited. I think the implementation is not straight forward, especially if you want to combine it with ARGUS. Sorry!

datawizard1337 avatar Jun 30 '22 08:06 datawizard1337