Corentin Barreau
Corentin Barreau
Caused when CTRL+C a crawl, it was in finishing state then this happened. ``` panic: send on closed channel panic: send on closed channel goroutine 778 [running]: github.com/internetarchive/Zeno/internal/pkg/crawl.(*Crawl).Capture.func1(0xc0016dd2c0) /X/Zeno/internal/pkg/crawl/capture.go:233 +0x50...
Would be interesting to try to do OCR on images (as an option) to extract URLs from watermark and such.
Use case being: running many Zeno on the same machine.
github.com/clbanning/mxj/v2 is being used for XML processing, I think it can be replaced by standard lib-only code.
So the idea is basically to "replicate" the excellent Heritrix3 web UI. We want to give a way to start, stop, pause, unpause the crawl, but also inject seeds, search...
I'm seeing a lot of DEBUG logs printed to stdout: ```time=2024-09-21T09:25:34.467+02:00 level=DEBUG msg="unable to extract URLs from JSON in script tag" error="invalid character 'l' after top-level value" url=https://old.reddit.com/r/PublicFreakout/comments/1fla2ks/another_video_of_israeli_soldiers_throwing/ time=2024-09-21T09:25:34.468+02:00 level=DEBUG...
If you use get list with a seeds list that contain an empty line, Zeno won't start crawling.