cat-crawler icon indicating copy to clipboard operation
cat-crawler copied to clipboard

A webcrawler I wrote in Golang that I can use to find and download cat pictures.

trafficstars

Cat Crawler

A webcrawler I'm writing in Golang that I can use to find and download cat pictures.

Installation

  • Make sure your GOPATH environment variable is set up properly: export GOPATH=$HOME/golib
  • Make sure the bin directory is in your path: PATH=$PATH:$GOPATH/bin
  • Now install the package go get -v github.com/dmuth/cat-crawler

Running the crawler

cat-crawler [--seed-url url[,url[,url[...]]]] [ --num-connections n ] [--allow-urls [url,[url,[...]]]] [--search-string cat]
    --seed-url What URL to start at? More than one URL may be 
        specified in comma-delimited format.
    --num-connections How many concurrent connections?
    --search-string A string we want to search for in ALT and TITLE attributes on images
    --allow-urls If specified, only URLs starting with the URLs listed here are crawled
    --stats Print out stats once a second using my stats package

Examples

cat-crawler --seed-url cnn.com --num-connections 1

Get top stories. :-)

cat-crawler --seed-url (any URL) --num-connections 1000

This will saturate your download bandwidth. Seriously, don't do it.

cat-crawler --seed-url cnn.com  --num-connections 1 --allow-urls cnn.com

Don't leave CNN's website

cat-crawler --seed-url cnn.com  --num-connections 1 --allow-urls foobar

After crawling the first page, nothing will happen. Oops.

Sequence diagram

Sequence Diagram

Development

go get -v github.com/dmuth/cat-crawler && cat-crawler [options]

Running the tests

go get -v -a github.com/dmuth/procedural-webserver # Dependency
go test -v github.com/dmuth/cat-crawler

You should see results like this:

=== RUN TestSplitHostnames
--- PASS: TestSplitHostnames (0.00 seconds)
=== RUN TestHtmlNew
--- PASS: TestHtmlNew (0.00 seconds)
=== RUN TestHtmlBadImg
--- PASS: TestHtmlBadImg (0.00 seconds)
=== RUN TestHtmlLinksAndImages
--- PASS: TestHtmlLinksAndImages (0.00 seconds)
=== RUN TestHtmlNoLinks
--- PASS: TestHtmlNoLinks (0.00 seconds)
=== RUN TestHtmlNoImages
--- PASS: TestHtmlNoImages (0.00 seconds)
=== RUN TestHtmlNoLinksNorImages
--- PASS: TestHtmlNoLinksNorImages (0.00 seconds)
=== RUN TestHtmlPortNumberInBaseUrl
--- PASS: TestHtmlPortNumberInBaseUrl (0.00 seconds)
=== RUN TestGetFilenameFromUrl
--- PASS: TestGetFilenameFromUrl (0.00 seconds)
=== RUN Test
--- PASS: Test (0.00 seconds)
=== RUN TestFilterUrl
--- PASS: TestFilterUrl (0.00 seconds)
=== RUN TestIsUrlAllowed
--- PASS: TestIsUrlAllowed (0.00 seconds)
PASS
ok      github.com/dmuth/cat-crawler    0.037s

Depdendencies

This repo uses other packages I wrote:

Bugs

  • I am not accessing the maps inside of an array.
    • Fix: A separate source file, with a single goroutine which service requests through a channel is a possibility

TODO

  • Rate limiting by domain in URL crawler
    • I could have an array of key=domain, value=count and a goroutine that decrements count regularly
      • Could get a bit crazy on the memory, though!
  • Write instrumentation to detect how many goroutines are active/idle
    • GoStatStart(key)
    • GoStatStop(key)
    • go GoStatDump(interval)

Contact

Questions? Complaints? Here's my contact info: http://www.dmuth.org/contact