colly
colly copied to clipboard
Pluggable cache system?
Hi,
Thanks for Colly! :)
I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)
I'd like to propose a pluggable cache system, similar to how you've made other Colly components.
Perhaps with an API like this:-
type Cache interface {
Init() error
Get(url string) (*colly.Response, error)
Put(url string, r colly.Response) error
Remove(url string) error
}
...or...
type Cache interface {
Init() error
Get(url string) ([]byte, error)
Put(url string, data []byte) error
Remove(url string) error
}
The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.
The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.
If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.
I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)
— Your thoughts?
Another option is to pass/return io.Reader
across the interface, instead of []byte
@jimsmart thanks for this great idea and for the detailed and clear specification.
I'm currently working on a storage interface (04d96ab7c5bbf879de632fcc683af592be4ba0b8) which follows pretty much the same concept. It allows us to use any kind of backend to store visited urls and cookies (http://go-colly.org/docs/best_practices/storage/).
Let me describe its architecture, perhaps it can help to design a more sophisticated cache functionality.
storage
interface is a stand alone module in colly: https://github.com/gocolly/colly/blob/master/storage/storage.go and colly imports it. storage
module also contains the default naive implementation of the storage interface to be backward compatible. I decided to create separate packages for different implementations of the Storage
interface to reduce colly's dependencies. Redis storage implementation can be found here: https://github.com/gocolly/redisstorage .
I think, a similar concept could work with the 2nd cache interface from your specification. What do you think?
Also, I'd suggest an additional Close() error
function in the interface to be able to terminate connections.
Yes. That's what I was thinking, similar to package storage
.
And have a SetCache(c cache.Cache) error
method on Collector
.
I don't need Close()
myself... Get()
, Put()
and Remove()
should all be atomic operations, and each method would simply obtain a database connection on execution, and close it before returning. All Go db drivers use a connection pool behind the scenes anyhow, so it's a good pattern to follow, has minimal overheads and is easy to reason about. It's better to do that than to have an open db connection hanging about for hours/days (it's better to release it back to the pool after every use, that way the driver can deal with broken or stale connections, etc.). — My Init()
method here wasn't for creating a connection, it was to create the db / db table if it doesn't already exist (or, for the file system: create the cache dir).
Your redisstorage.Storage
could then also implement Get()
, Put()
and Remove()
and would satisfy both storage.Storage
and cache.Cache
interfaces.
— Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have colly.Storage
as the interface in the method calls.
And have a SetCache(c cache.Cache) error method on Collector.
Yes.
I don't need Close() myself... Get(), Put() and Remove() should all be atomic operations
You're right, i've removed it from Storage too, thanks.
Your redisstorage.Storage could then also implement Get(), Put() and Remove() and would satisfy both storage.Storage and cache.Cache interfaces.
Storage
is responsible for storing both visited urls and cookies, so a simple Get()
, Put()
isn't enough in this case. Also, it is planned to add robots txt handling to the Storage
interface which makes the interface even more complex.
Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have colly.Storage as the interface in the method calls.
I agree, it is a bit redundant, but the package doesn't use internals of the colly
package which is quite big. I'm not against moving Storage
/Cache
interface to colly
, especially because it would allow us to use the first proposed interface.
Storage
is responsible for storing both visited urls and cookies, so a simple Get(), Put() isn't enough in this case.
Of course not! :) I wasn't suggesting it would be, I was suggesting that specific implementations of Storage
could (optionally) also implement the cache methods (Get
/Put
/Remove
) as well as the methods on the Storage
interface — this would then give the user the option of using that store as both a cache and a store for state persistence, or just one or the other.
@jimsmart what exactly are you get/putting in the cache? My imagination is having a hard time going beyond image and css files which aren't downloaded through Colly in any case. If it's just HTML isn't the IsVisited
feature enough? I was initially excited about the idea of a cacheing backend as well, but I couldnt find a clear enough use case.
I have several hundred thousand web pages I am processing to extract data.
Currently, if the task fails part way through, I just resume later, and rely on the cached data (in the current file system cache) to get reprocessed and then it simply resumes from where it got to.
But there are so many files, and most are hundreds of kb each. The cache folder is >6gb after a crawl. Compressed and in Sqlite, that same data is a single file, of <1.5gb. It's fast. It's easy to manage.
I've pushed two repos with working prototype code, run go test
in each project folder to execute the tests.
https://github.com/jimsmart/collysqlite - implements a cache using Sqlite. https://github.com/jimsmart/collyzstandard - implements a compressing cache wrapper using Zstandard.
Both of these use []byte
for the type of data being cached — it was the simplest solution to get a prototype up and running.
Basically, if I'm going to invest the time and resources to download that much data, I'd most certainly like to keep it around, for later reruns.
Scrapy, the leading Python web scraper, has pluggable caches. Maybe take a look?
FWIW: some log output from collyzstandard tests showing compression stats...
2018/02/12 02:40:51 compressed 12930/5251, ratio 2.46, in 2.757465ms https://google.com/
2018/02/12 02:40:51 decompressed in 91.872µs https://google.com/
2018/02/12 02:40:53 compressed 351734/71301, ratio 4.93, in 78.481158ms https://facebook.com/
2018/02/12 02:40:53 decompressed in 1.093598ms https://facebook.com/
2018/02/12 02:40:54 compressed 148439/29126, ratio 5.10, in 54.710252ms https://twitter.com/
2018/02/12 02:40:54 decompressed in 338.833µs https://twitter.com/
@jimsmart impressive! Could you please open a PR?
Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today.
I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already.
I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar?
In my local repo I also have an implementation of 99% of the 'storage' API... for SQLite.
Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today. I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already.
Awesome, there is no rush. Thanks!
I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar?
Does this change introduce any backward incompatibility?
Um, maybe. But I want to minimise that, obviously. In an ideal world there'd be no breaking changes. That's partly why the code needs more work.
I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best.
I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best.
Perfect! The project is relatively young, so we don't have strict rules about api changes. Try to avoid 'breaking changes' if possible and be gentle otherwise as you mentioned.
I wonder if a good solution is to use an interface like that in go-cloud: https://github.com/google/go-cloud/tree/master/blob
Check out https://github.com/google/go-cloud/blob/master/blob/example_test.go for a "local/file" cache. Similarly, check out the GCS or S3 tests. In this way, I could pass my own bucket to colly and it would work with non-local file systems.
@jonathaningram I think the interface used in go-cloud/blob
looks like a good model to follow! Having it as a hard dependency would bloat this project, but I think emulating their framework would a good move.
@asciimoo @jimsmart , Thoughts?
FYI: my use case. I want to cache the responses of the POST requests to API. It is an example of the customizable cache handler. https://github.com/gocolly/colly/compare/master...michilu:customizable-cache-handler
I just wanted to drop this here https://www.nayuki.io/page/self-encrypted-cache-structure
Hey I am thinking of picking that up but just wondering where this is at?
I should probably describe my use case. I ran into a blocker as we use heroku and their dynos' file systems are ephemeral. If you're not using a cloud storage mechanism (which you can but is quiet limited under the free tier) then you'll loose and will only have a day worth of caching at most...
Hi, the repos mentioned in https://github.com/gocolly/colly/issues/103#issuecomment-364816457 (above) contain my prototype code. Feel free to open issues on those repos if you have any questions — happy to help, as best I can.
I recall there were also changes needed with Colly to make these work, but I don't believe I ever pushed those, sorry. But IIRC the interface was fairly straightforward.
Hi, just want to catch up this issue
@jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite.
@asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it.
any thoughts?
@hsinhoyeh
For our use case, scraping many hundreds of thousands of HTML files (with a big variety in sizes), there is certainly a benefit to storing the files in SQLite, it’s actually faster than using raw files on the file system, and also takes much less disk space. There are docs on the SQLite website to support this, and our own benchmarks and metrics confirm this, without doubt.
I’m not sure why you say you don’t see any advantage here? - have you actually tried using SQLite as a storage system in such cases, and compared the metrics? It’s easy to flippantly dismiss something as a bad idea if one has not tried it, but as the old adage goes “to measure is to know” - perhaps your dismissal is better reasoned than it first appears, and you’ve made some decent measurements/comparisons, in which case I would certainly be interested to see your benchmarks/stats, and to know some details about the system they were produced upon (e.g. the underlying file system used, etc)
Furthermore we also compress each page body individually before storing it into SQLite - we tried quite a few compression algorithms, including tar/gzip, but the benchmarks/metrics showed that Zstandard was in fact best for our use case, both speed-wise and size-wise. Naturally, this works very well on top of the aforementioned SQLite storage layer, but also produced improvements when used on top of the native file system.
Unfortunately, we no longer use Colly to build our scrapers, ours now have a completely different architecture, for a variety of reasons, and Colly was no longer the best fit for us, although I still think it is a great project. But because of this, I’ve not pursued any further development of these storage-related plugins, although the original prototype code is still available in my GitHub repos.
Hope that helps.
On 19 Nov 2021, at 07:26, hsinhoyeh @.***> wrote:
Hi, just want to catch up this issue
@jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite.
@asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it.
any thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
I really like the colly API for scraping, but having just the filesystem for storage is not nice. I came back to this issue a few times over the years and always went back to use my own simple crawler, that basically just uses httpclient and goquery. But I always add caching with compression in some form, when the number of pages gets > 10000.
Thats why I would like to bring this topic up again. Is it worth investigating this issue further? I think @jimsmart did some nice prototyping. Is there any chance to get this included, if we have a complete pull request @hsinhoyeh If so I would like to work on this.
Bumping up. I'd love to see more pluggable caching/storage system, but I don't believe my go-fu is strong enough to attempt refactoring. I may need to, because I would really use it tho, so I wonder if there are people at least willing to do some solid peer-review.