obsidian-omnisearch [Feature request] Search should also find non-md files (most importantly pdf)

Problem: The internal search allows you to find and open pdfs and other non-md files using keywords from their filename.

Solution: As the default search of Obsidian, Omnisearch should find non-md files when the typed keyword is part of their filename. Enter opens them. (The perfect solution would be to also index the content of pdf files ;-)).

May 24 '22 06:05 matthiashaldimann

Indexing PDFs is a long-term feature I'd like to implement :)

May 24 '22 06:05 scambier

Some of use are using the txt-as-md-obsidian plug-in to make files that are plain text but not .md files appear in obsidian (I mostly use .rmd and .qmd files). Although this works and allows files to appear (and be editable) in obsidian, they are not visible in the search (https://github.com/deathau/txt-as-md-obsidian/issues/1#issue-901311966).

It would be great if omnisearch could add that functionality. Just throwing my two cents in extending this enhancement to other file types beyond pdf. Since they are also just plain text, maybe it is even easier?

Thanks!

Sep 14 '22 22:09 jeffkimbrel

@jeffkimbrel

Added a new feature & setting to index plaintext files that are not markdown. This change should also make the plugin "txt-as-md" redundant, since Omnisearch itself will register txt (and other) files.

This change is deployed in 1.6.0-beta3, if you'd like to try it.

Sep 23 '22 18:09 scambier

Thank you very much, works as expected! Looking much forward to pdf indexing ;-).

Sep 23 '22 21:09 matthiashaldimann

Awesome, this is great. Thanks so much!!

Sep 25 '22 21:09 jeffkimbrel

PDF support has been added in the latest beta build. There are a few bugs to iron out, but it's a start.

Don't hesitate to post feedback here 👍

Sep 29 '22 20:09 scambier

The feature is now published. It's not 100% complete yet, follow-up at #100

Sep 30 '22 21:09 scambier

Update - Omnisearch no longer registers the other files itself. It still indexes them, but Obsidian's behavior is respected, and the files will open in an external application by default. You may want to use another plugin (like plaintext for example) to visualize those files within Obsidian

Oct 01 '22 13:10 scambier

There's a random Electron crash during PDF indexing, this change is reverted.

Oct 02 '22 20:10 scambier

Fixed and re-deployed as a BRAT beta version. Will need more feedback before this can go into release.

Oct 03 '22 12:10 scambier

Seems like downloading the files is considerably slow and does freeze Obsidian. It might be advisable to see if there is a way to do this in the background so that the main app is still responsive.

Searching is slowed down a little bit when searching with PDFs however whilst noticeable with my vault of about 1450 notes it wasn't noticeable enough to be anything more than a minor inconvenience.

Searching inside of PDFs also appears to be working. Here I am searching for the first few words of the first paragraph in the PDF and this is finding a hit, albeit it isn't shown in the search results which is a little confusing:

This was done using all recommended settings, OmniSearch 1.6.5 Beta, and Obsidian 0.16.5 with the following plugins installed:

Annotator 0.2.6
AutoMoc 1.1.1
Calendar 1.5.10
DataView 0.5.46
ExcaliBrain 0.1.11
ExcaliDraw 1.7.22
File Explorer Note Count 1.2.0
Find orphaned files and broken links 1.8.0
Homepage 2.2.1
Note Refactor 1.7.1
Obsidian42 - BRAT 0.6.35
Omnisearch 1.6.5-beta
Raindrop Highlights - 0.0.14
Templater - 1.14.3

Default theme was used in case that helps (doubt its relevant here though).

Total of 31 PDFs were in the vault I used for testing.

Loading delay and freeze was about 20-30 seconds without the "Store Index in File" option, with freeze occurring from about second 12 to second 30. Also tested using recommended settings of "BETA - Index PDFs" and "Store Index in File". Delay was also about 20-30 seconds when using the "Store Index in File" option, so this option did not seem to make a difference for my testing case.

Attempting to search before the indexing is complete does seem possible so something to potentially be aware of though this just lead to missing results in my experience that were then corrected when I attempted to search again a few seconds later so this appears to just be a minor issue.

Oct 09 '22 01:10 tekwizz123

Upping sample size to 231 PDFs whilst also not including the "Store Index in File" part produced similar results although this time I did get a warning from the plugin. It does disappear a little fast for my liking and only appeared for about 3-5 seconds so you do have to be quick to notice it.

Testing with 1000 files went over 200 seconds before I stopped counting. The left side of the screen on the file navigation bar would go in and out of being interactable but I was never able to use it to open any files.

Looking at task manager showed that Obsidian was stuck at somewhere between 37 and 45 percent CPU usage with about 700 to 1GB of memory usage.

I thought this may be due to using Synology Drive to sync the files so I paused that thinking it might allow the files to process quicker, as I did see SearchFilterHost.exe using about 30% CPU at times however this didn't change things.

There should likely be a limit on the number of files or some way to cause this processing to occur in the background to prevent it freezing Obsidian itself.

Overall though I think this is a great addition, nice work! Just need to figure out how to optimize it to handle big PDF collections a bit better as it seems fine at lower PDF numbers but once the count goes up it becomes problematic.

Oct 09 '22 02:10 tekwizz123

Sorry realized the timing info earlier wasn't helpful so here is a quick update:

Using 967 PDF files with "BETA - Index PDFs" and "Store Index in File" takes 8 minutes. Without the "Store Index In File" was a similar time of about 7 min 50 seconds.

Oct 09 '22 03:10 tekwizz123

@tekwizz123 Thanks for the feedback 👐 The indexing is already done in the background, so I'll split it into smaller chunks to ease the CPU usage.

Edit: actually it may not be done in the background, PDFJS' seems to not load its worker correctly.

Searching is slowed down a little bit when searching with PDFs

Indexing a PDF is done in 2 phases:

First the text is extracted. That what takes time, and slows Obsidian.
Then the text is indexed in memory, just like a normal note.

So the search time shouldn't be impacted.

Searching inside of PDFs also appears to be working. Here I am searching for the first few words of the first paragraph in the PDF and this is finding a hit, albeit it isn't shown in the search results which is a little confusing

I noticed some search results issues too, I'll have to check if it's relative to PDFs or a more general bug 👍

Oct 09 '22 07:10 scambier

Beta update: https://github.com/scambier/obsidian-omnisearch/releases/tag/1.6.5-beta.3

PDF indexing is now correctly done in the background, and results are directly cached. You can safely close Obsidian even if the work is still in progressed, it should just resume where it left.

Oct 12 '22 19:10 scambier

Alright, PDF now also works on mobile. Unless some breaking bug is discovered by then, I'll release this feature by the end of next week. @tekwizz123 I'd be very thankful if you could put a stress test on it with your thousand PDFs :)

Oct 15 '22 16:10 scambier

@scambier Thanks, upgrading and testing this out now 👍

Oct 15 '22 20:10 tekwizz123

Still hitting some issues. Initially delays were only 2-5 seconds however now they are getting longer. Generally delays are consistent in time however the more concerning part is that the left hand pane for file navigation can stop responding at times. If this was just it that would be fine however when I right click on a file in the list and try to open it I just get this:

I also tried opening other files normally and got similar results. Opening up a Annotator file had PDF.js error out the first time, then the pages just showed up as blank/white pages after that.

For reference this was with a heavy stress test of just under 5000 files. The good news is CPU usage generally was around 30%-50% so come down from earlier. Memory usage maxed out at around 4GB, and generally went from around 1.5GB to 4GB before starting back down at 1.5GB again.

Oct 15 '22 20:10 tekwizz123

Thank you. In the folder .obsidian\plugins\omnisearch, you should find a file named pdfCache.data. Could tell me its size?

Oct 15 '22 20:10 scambier

Oops, looks like now Obsidian is black screen of deathing on me 😨

EDIT: Looks like in task manager there were 4 processes associated with Obsidian. Its now 3 processes, none of them are using CPU, and the memory usage is very low. Looks very much like something crashed n Obsidian didn't clean up the other processes 😕

Oct 15 '22 20:10 tekwizz123

Thank you. In the folder .obsidian\plugins\omnisearch, you should find a file named pdfCache.data. Could tell me its size?

Huh I'm not seeing a file with that name, here is what I see:

Oct 15 '22 20:10 tekwizz123

Thank you. In the folder .obsidian\plugins\omnisearch, you should find a file named pdfCache.data. Could tell me its size?

Huh I'm not seeing a file with that name, here is what I see:

🤔 The indexing is now done 100% in the background. The only task susceptible to freeze Obsidian is the cache file writing. It's throttled but I think it's just too big for the throttling delay, and it keeps overwriting itself - which explains why it's getting worse with time, since it's getting larger and larger. I was planning to do a cache refacto for the next version, but I'll have to do that sooner 😅

Thanks again, your stress tests are definitely helping me!

Oct 15 '22 20:10 scambier

@scambier For reference would that cache writing be affected by any settings? I did try disable the use of the cache/index file in the settings but still getting that issue when trying to open new files as well as the delay mentioned earlier.

Also saw when trying to open a file that one of the Obsidian subprocesses seemed to terminate for some reason, but not the one that seems to be running the PDF inspection code.

Oct 15 '22 20:10 tekwizz123

@tekwizz123

For reference would that cache writing be affected by any settings

No, extracted texts from PDFs are automatically saved in their own cache. Extraction is slow and eats a lot of CPU, so it'd be unusable without a cache.

Edit: unusable for users with a small number of PDFs, because the cache writing is certainly what breaks Obsidian in your case :p

Oct 16 '22 07:10 scambier

I released a new beta version, the PDF cache is now handled by IndexedDB.

I ran some tests, and the only stutters left are caused by minisearch when indexing big files (e.g. ebooks). I'll look if I can defer that to a worker, but it's doubtful. Those stutters disappear when the "write index on disk" setting is enabled.

I think there's 0 chance that indexing 5000 PDFs is ever going to be 100% smooth, but I did everything I could to make it not crash :D

Edit: you should avoid enabling "write index in disk" though, you'll probably have the same issue as before due to the ultra-large index size. I didn't have time yet to refactor this part to IndexedDB.

Oct 16 '22 20:10 scambier

Thanks @scambier (sorry writing from work account). I agree 5000 PDFs will never be 100% smooth was more testing for edge cases there like the crashes we experienced 👍 I'll see how this goes though with the new update :)

Oct 17 '22 15:10 gwillcox-r7

Hmm still crashing and getting blank screens at the moment with just the PDF search capabilities on with latest version. Writing files to cache on disk seems to work for a bit longer but also hits the same issue.

Oct 17 '22 15:10 gwillcox-r7

Do you happen to know where I could download a bunch of different PDFs? It would be more effficient (and less annoying for you) if I could stress test myself.

Oct 17 '22 17:10 scambier

Do you happen to know where I could download a bunch of different PDFs? It would be more effficient (and less annoying for you) if I could stress test myself.

Sure, see https://corpora.tika.apache.org/base/packaged/pdfs/archive/pdfs_202002/ which has a nice corpus of tests for testing PDF parsers.

Also can use https://www.vm.ibm.com/library/pdfzip.html

Oct 17 '22 17:10 gwillcox-r7

Also more info at https://www.pdfa.org/stressful-pdf-corpus-grows/ if you want to see the related document listing more details on this.

Oct 17 '22 17:10 gwillcox-r7

obsidian-omnisearch obsidian-omnisearch copied to clipboard

[Feature request] Search should also find non-md files (most importantly pdf)

obsidian-omnisearch
obsidian-omnisearch copied to clipboard