ragno
ragno copied to clipboard
Common Lisp Web crawling library based on Psychiq.
Ragno
Common Lisp Web crawling library based on Psychiq.
Warning: This software is still ALPHA quality. The APIs will be likely to change.
Usage
1. Writing a crawler class
;; A part of 'my-crawlers' system.
(defclass rakugo-kyokai (ragno:crawler) ()
(:default-initargs
:request-delay 5
:user-agent "Rakugo-Kyokai-Crawler"))
(defmethod ragno:parse ((crawler rakugo-kyokai) response)
(let* ((uri (ragno:response-uri response))
(path (quri:uri-path uri)))
(cond
((string= "/broadcast/" path)
(lquery:$ (lquery:initialize (ragno:response-body response))
".main .contents ul li a"
(attr "href")
(map (lambda (href)
(psy:enqueue 'rakugo-kyokai
(list (quri:render-uri (quri:merge-uris (quri:uri href) uri))))))))
((string= "/jyoseki/index.php" path)
(parse-jyoseki (ragno:response-body response)))
(t ;; Unknown page
nil)))))
(defun parse-jyoseki (html)
(lquery:$ (lquery:initialize html)
".main .contents .member-detail"
(combine "h2" ".time-table .yose")
(map-apply (lambda (h2 table)
(format t "~A~%~{- ~A~%~}~%"
(lquery:$1 h2 (text))
(coerce (lquery:$ table
"tr td.name a" (text)) 'list))))))
2. Enqueue a job
(psy:enqueue 'rakugo-kyokai (list "http://rakugo-kyokai.jp/broadcast/"))
3. Start a worker process
$ psychiq --host localhost --port 6379 --system my-crawlers
Worker options
-
:request-delay
: Interval seconds between for each job processes (Default:0
) -
:user-agent
: User-Agent header string when accessing web pages (Default:"Ragno-Crawler"
) -
:max-redirects
: Redirection limit when requesting web pages (Default:5
) -
:concurrency-per-domain
: Concurrency limit for each URL domains (Default:1
)
See Also
Author
- Eitaro Fukamachi ([email protected])
Copyright
Copyright (c) 2017 Eitaro Fukamachi ([email protected])
License
Licensed under the LLGPL License.