SCAN,INDEX after SCANIDX seems to ignore directories with space(s) in the name
Describe the bug Running Datashare version 18.2.1 with --mode SERVER (same problem with version 18.0.0 and 18.1.1). Trying to update the database with --stage SCANIDX and --stage SCAN,INDEX as described here: https://icij.gitbook.io/datashare/server-mode/add-documents-from-the-cli
Unfortunately the SCAN,INDEX step (after first SCANIDX ) seems to ignore directories with space(s) in the name. The log does show "[main] INFO ScannerVisitor - Entering directory" for directories with space(s) in the name, but the "INFO DocumentConsumer" never checks any file from a directories with space(s) in the name.
To Reproduce Set up a directory "no_spaces" with one file and a second directory "one space" with one file and try to update the database with --stage SCANIDX and --stage SCAN,INDEX
Expected behavior All directories should be checked.
Desktop (please complete the following information):
- OS: Ubuntu 24.04.1 LTS
- Version: Datashare version 18.2.1 with --mode SERVER
The problem seems to be with the SCAN stage: adding instead of overwriting the existing queue datashare:queue
I did some checking of the REDIS lists (NOTE: my REDIS container is named "datashare-redis-1"):
docker exec -it datashare-redis-1 redis-cli --bigkeys
Which showed:
[...] Biggest list found 'datashare:queue:index' has 9387116 items Biggest hash found 'report:queue' has 185510 fields [...]
The number of fields in report:queue are the same as shown on the Datashare web interface > Insights:
448.7K documents among which 185.5K on disk
But the items in datashare:queue:index are much too high.
I wrote the datashare:queue:index to a text-file with:
docker exec -it datashare-redis-1 redis-cli LRANGE datashare:queue:index 0 -1 > redis_datashare-queue-index.txt
So now I was able to check the contents of datashare:queue:index and I found that all filenames were in it multiple times. Including many lines with "POISON".
It is unclear to me, but "POISON" might be the end of the SCAN stage. I have run the SCAN stage multiple times and apparently each time all the filenames were added to the existing datashare:queue:index instead of overwriting it.
I assume that the INDEX stage stops when reaching "POISON" and thus the new files were not indexed and not added to Datashare.
So, I deleted report:queue and datashare:queue:index to be able to start with empty lists:
docker exec -it datashare-redis-1 redis-cli DEL report:queue
docker exec -it datashare-redis-1 redis-cli DEL datashare:queue:index
And ran the SCANIDX and SCAN stage again, which resulted in a much smaller datashare:queue:index list:
[...] Biggest list found 'datashare:queue:index' has 291475 items [...]
Checking the last line in the list:
docker exec -it datashare-redis-1 redis-cli LRANGE datashare:queue:index -1 -1
Indeed it is:
- "POISON"
Currently I am running the INDEX stage and I will post the results when finished.
Hi @OpenCV-Peter, thanks for the detailed ticket. We have not seen any problem with paths with spaces but that deserves a proper look!
However, the SCAN stage behaves as it should, without overriding the existing queue. This is by design since the first versions of Extract (the text extraction engine behind Datashare). This has served us well so far but I understand how it can be confusing.
We are working since a few months on a complete revamp of tasks management in Datashare. Among other things, we are exploring an alternative to the classic SCAN, + INDEX workflow, where we would create parent tasks that ships with a set of configuration for a given path. Your feedback will definitely help us to build this new feature!
Running SERVER version 18.2.1
I ran first the SCANIDX stage, followed by the SCAN and finally the INDEX stage. After all stages were done Datashare has added all missing files.
Now also the datashare:queue:index is empty:
docker exec -it datashare-redis-1 redis-cli LRANGE datashare:queue:index 0 -1
Replies:
(empty list or set)