Zeno
Zeno copied to clipboard
State-of-the-art web crawler 🔱
Closes #61 @CorentinB Please note, I'm not proficient in Go, so feedback is welcomed and edit at will.
I built Zeno from source ([687b5d5](https://github.com/internetarchive/Zeno/commit/687b5d5982be433206b03022d0a03dc0a1227501)) and ran `Zeno get url` only be told I did not have enough space. It would be great if (1) this value was customizable...
``` panic: open jobs/warcs/SPNOUTLINKS-20221021045127671-00030-crawl900.us.archive.org.warc.gz.open: no such file or directory goroutine 149 [running]: github.com/CorentinB/warc.isFileSizeExceeded({0xc166f684e0?, 0xc0001b4520?}, 0x408f400000000000) /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/utils.go:196 +0x10e github.com/CorentinB/warc.recordWriter(0xc00057e0f0, 0x0?, 0x0?) /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/warc.go:120 +0x499 created by github.com/CorentinB/warc.(*RotatorSettings).NewWARCRotator /var/www/go/pkg/mod/github.com/!corentin!b/[email protected]/warc.go:50 +0x75 ```
This is quite a half-baked idea, but we'd be looking to implement some sort of hit counter for items in the hash table, allowing us to clean it up when...
Allow operators to define headers in a yml file per domain to allow for greater control over headers like User-Agent or similar headers that may need to be configurated per...
``` 2024/08/15 07:49:36 http: panic serving 127.0.0.1:45290: runtime error: invalid memory address or nil pointer dereference goroutine 212832422 [running]: net/http.(*conn).serve.func1() /var/www/.go/src/net/http/server.go:1903 +0xbe panic({0x1371ac0?, 0x21a7f70?}) ```