Content search in PDF
Hi ! I have a folder with tons of articles mentioning the MHC (Major histocompatibility complex), both in the title and the body of the article. When I type c/MHC, only three or two files appears, while there are what, twenty of them at least which correspond ? I checked out. It's like broot had barely started to index the content then stopped and I don't know how to re-index. How does it work ? Having more control over that, as in recoll-index, would be neat.
Last broot, endeavour os.
Broot version 1.9.2, MacOS
I ran into the same problem. I'm in a directory with 78 sub directories full of python files. Using c/import, I only get one result.
Using ctrl-s does reveal all the search results.
Did you try hitting ctrl-s ? When there are no shallow results and the disk is slow, the search may stop before everything is scanned.
EDIT: ok, you did.
Downgrading to 1.9.1 fixes the problem.
Uh ? That's interesting. Would the search be slower now ? I'll have to check.
Are you sure you're not just trying on a warm disk ?
ctrl-s does nothing: "search was already total".
So there's no other file matching. It can be related to hidden files, or files ignored because of .gitignore rules. You may try alt-i and alt-h to toggle showing those files too.
alt-j or alt-h do nothing. Even the file does not match it IS here, and it's not hidden. See at : mage I wondered if it qualified as "NSFW", but probably not. Only a man of culture can program so well anyway ;-)
could you explain in broad terms, how the full text search works ? I wish we could parametrize what kinds of files are indexed. I mean, if you wish to replace recoll, making these features configurable have to be done.
It's explained here: https://dystroy.org/broot/input/
You might be interested into this introduction too: https://dystroy.org/blog/broot-c-search/
thanks, it was useful refreshing my mind, but I had read that already, everything actually ;-) I rather wondered about the internal implementation. I understand that it doesn't construct a database but open each file. Which on non-SSD disks is absolutely tasking, eventhough nicer on memory and spares you (and memory) the hassle of indexing. Filters are more than enough search BUT we mention file size or other metadata. It would be neat. Being already in the folder where the file is, it's certainly not that.
The content search you use filters files to keep the one containing exactly the searched pattern, it's an exact search, not a fuzzy one, nor a case insensitive one.
But I can't start to analyze the problem you mention without test data. Please upload somewhere the file that isn't found and show me the pattern which doesn't work.
/pdf$/i&(cr/male/i) removing "male" shows all files with a pdf extension, but but whatever I put inside the slashs, even single letters, remove all the reults. removing the filename search part does allow for content search though. But immediately Broot freezes or slow down considerably. Only content search shows nothing either. Try with this article: https://drive.google.com/file/d/1EaVCkwlpGJi9EkEZCQiCPwNr0HyM6Jml/view?usp=sharing
c/male in the ~ directory does show a few files, but no complex filtering patterns works, and even without, in the directory where several files with "male" in either the name or the content, only one appears. Well, it did that once... I can't see the reproduce the behavior. Yesterday, a complex pattern made broot properly crash. I'll try to repeat that.
Broot doesn't search into binary files, only text files.
Ah ! I overlooked that detail. Even though those pdfs are searchable... Weird that I didn't reached that conclusion. That makes full text unusable/useless except for programmers. Changing that is absolutely necessary, lest for a big part of broot's potential appeal to disappear.