spider
spider copied to clipboard
Spider is a Web spidering library for Ruby. It handles the robots.txt, scraping, collecting, and looping so that you can just handle the data.
Would it be possible to make Spider to be aware of the link's `rel` attribute such as `nofollow`?
The function ` generate_next_urls` scans every page, effectively downloading and loading *every* page into memory. This may not be a problem for small files, but it's completely inefficient, and makes...
I make a crawler who scan same URL. Here an example: ~~~ - https://www.jared.com/diamond-engagement-ring-78-carat-tw-roundcut-18k-white-gold/p/# - https://www.jared.com/diamond-engagement-ring-78-carat-tw-roundcut-18k-white-gold/p/#skiptonavigation - https://www.jared.com/diamond-engagement-ring-78-carat-tw-roundcut-18k-white-gold/p/#skip-to-content ~~~
Right now, it only parses HTML to get the URLs, and while I have written code that parses JS(both inside an HTML file, and in asset files), and gets all...