fscrawler
fscrawler copied to clipboard
Support parallel crawling
Since piping graphical PDFs without actual text through tesseract is very time intensive, support for pareallel crawling would be a great. This could be done in two ways:
- via a multi-threaded process model
- by having two distinct status codes for a document in the index
The latter should be relatively simple to do and it would enable the operator to just run as many fscrawlers on one particular directory as wanted. Maybe worth to take a look at.
We have started using fscrawler in our production application recently and in our experience, I'd actually call this an "anti-feature".
Here's why:
- Containerized workloads scale horizontally, or at least should.
- One of FSCrawler's assets (IMO) and appeals is that it is stateless. Starting to add state into the document with "status codes" begins to get into stateful territory beyond the scope of the crawler IMO, and begins to expose data to the elastic clients "too soon".
- This is probably the biggest (for us/me): Threading in a container is non-deterministic. We have very little control and understanding of the internal scheduling of a given thread. This is a GREAT segue to...
- Running the fscrawler container on a formal container orchestration system allows you to auto-scale (Kubernetes HPA or even vertical scaler in terms of CPU and Memory allocation). Which is a segue to...
- Very robust scaling can be done on an arbitrary number of parameters (dynamic even) by the runtime.
- Being opinionated on this actually works against situations where sophisticated teams/deployment environments have horizontal scale already in place. It's an anti-pattern for microservices and/or containers to multi-thread in this regard usually.
Long way of saying: horizontal scaling (which is what this basically is) is well out of the scope of fscrawler (or should be) IMO insofar as the project has provided the right hooks/design to allow the project to be easily deployed in a fashion where it can be credibly horizontally scaled by a devops team - and fscrawler provides containers already, it's quite easy to drop these into a Kubernetes Deployment with replicas and a service (or LB) in front of them with an HPA that "just magically" scales whenever a given container gets saturated on CPU or memory. You can also very easily tie these things to an arbitrary metric such as "requests", etc.
If anything I'd say having the containers and processes report such metrics would be useful so the external frameworks/orchestrators (read: Kubernetes) that we use to operationalize fscrawler in production environments can scale based on things other than CPU and RAM usage (although I am not sure that's even necessary as many LBs will LB based on response time metrics).
I am building a huge archive with historic documents. They are from 1950-2000, written with typewriters and scanned by volunteers. I have considerable computing resources in my garage with a couple of hundred XEON-cores that I only power on when it can excuse the current energy costs.
I am not sure if I understand your motivation or that I have understood fscrawler's function correctly.
If I fire up 10 processes (be it dockers or native) it is my experience that all these processes will crawl all files, OCR them, and then submit them to elastic, without checking if the work has already been done, or if the data is already in the index. That will result in that all 10 processes will crawl, OCR and index all documents one time each.
I have added a simple check before indexing and OCR starts to see if the document has already been indexed: https://github.com/dadoonet/fscrawler/compare/master...slundell:fscrawler:master It helps somewhat but often other processes may hold on to the data before submitting it to the index.
without checking if the work has already been done
The reason for this is that we normally check the date of the document first. Then if we detect that the document has been modified or created, it goes to the indexing stage.
If I fire up 10 processes (be it dockers or native) it is my experience that all these processes will crawl all files, OCR them, and then submit them to elastic, without checking if the work has already been done, or if the data is already in the index.
Yeah. I was assuming that the best practice until we implement a parallel processing would be to start one FSCrawler instance per sub dir. So you don't try to reindex again the same data.