broot icon indicating copy to clipboard operation
broot copied to clipboard

Content search in PDF

Open evanescente-ondine opened this issue 3 years ago • 15 comments

Hi ! I have a folder with tons of articles mentioning the MHC (Major histocompatibility complex), both in the title and the body of the article. When I type c/MHC, only three or two files appears, while there are what, twenty of them at least which correspond ? I checked out. It's like broot had barely started to index the content then stopped and I don't know how to re-index. How does it work ? Having more control over that, as in recoll-index, would be neat.

Last broot, endeavour os.

evanescente-ondine avatar Feb 03 '22 17:02 evanescente-ondine

Broot version 1.9.2, MacOS

I ran into the same problem. I'm in a directory with 78 sub directories full of python files. Using c/import, I only get one result.

dkao1978 avatar Feb 03 '22 19:02 dkao1978

Using ctrl-s does reveal all the search results.

dkao1978 avatar Feb 03 '22 19:02 dkao1978

Did you try hitting ctrl-s ? When there are no shallow results and the disk is slow, the search may stop before everything is scanned.

EDIT: ok, you did.

Canop avatar Feb 03 '22 19:02 Canop

Downgrading to 1.9.1 fixes the problem.

dkao1978 avatar Feb 03 '22 19:02 dkao1978

Uh ? That's interesting. Would the search be slower now ? I'll have to check.

Are you sure you're not just trying on a warm disk ?

Canop avatar Feb 03 '22 19:02 Canop

ctrl-s does nothing: "search was already total".

evanescente-ondine avatar Feb 03 '22 20:02 evanescente-ondine

So there's no other file matching. It can be related to hidden files, or files ignored because of .gitignore rules. You may try alt-i and alt-h to toggle showing those files too.

Canop avatar Feb 03 '22 20:02 Canop

alt-j or alt-h do nothing. Even the file does not match it IS here, and it's not hidden. See at : mage I wondered if it qualified as "NSFW", but probably not. Only a man of culture can program so well anyway ;-)

evanescente-ondine avatar Feb 03 '22 20:02 evanescente-ondine

could you explain in broad terms, how the full text search works ? I wish we could parametrize what kinds of files are indexed. I mean, if you wish to replace recoll, making these features configurable have to be done.

evanescente-ondine avatar Feb 05 '22 17:02 evanescente-ondine

It's explained here: https://dystroy.org/broot/input/

You might be interested into this introduction too: https://dystroy.org/blog/broot-c-search/

Canop avatar Feb 05 '22 17:02 Canop

thanks, it was useful refreshing my mind, but I had read that already, everything actually ;-) I rather wondered about the internal implementation. I understand that it doesn't construct a database but open each file. Which on non-SSD disks is absolutely tasking, eventhough nicer on memory and spares you (and memory) the hassle of indexing. Filters are more than enough search BUT we mention file size or other metadata. It would be neat. Being already in the folder where the file is, it's certainly not that.

evanescente-ondine avatar Feb 05 '22 20:02 evanescente-ondine

The content search you use filters files to keep the one containing exactly the searched pattern, it's an exact search, not a fuzzy one, nor a case insensitive one.

But I can't start to analyze the problem you mention without test data. Please upload somewhere the file that isn't found and show me the pattern which doesn't work.

Canop avatar Feb 06 '22 06:02 Canop

/pdf$/i&(cr/male/i) removing "male" shows all files with a pdf extension, but but whatever I put inside the slashs, even single letters, remove all the reults. removing the filename search part does allow for content search though. But immediately Broot freezes or slow down considerably. Only content search shows nothing either. Try with this article: https://drive.google.com/file/d/1EaVCkwlpGJi9EkEZCQiCPwNr0HyM6Jml/view?usp=sharing

c/male in the ~ directory does show a few files, but no complex filtering patterns works, and even without, in the directory where several files with "male" in either the name or the content, only one appears. Well, it did that once... I can't see the reproduce the behavior. Yesterday, a complex pattern made broot properly crash. I'll try to repeat that.

evanescente-ondine avatar Feb 06 '22 12:02 evanescente-ondine

Broot doesn't search into binary files, only text files.

Canop avatar Feb 06 '22 15:02 Canop

Ah ! I overlooked that detail. Even though those pdfs are searchable... Weird that I didn't reached that conclusion. That makes full text unusable/useless except for programmers. Changing that is absolutely necessary, lest for a big part of broot's potential appeal to disappear.

evanescente-ondine avatar Feb 06 '22 16:02 evanescente-ondine