markdown-links icon indicating copy to clipboard operation
markdown-links copied to clipboard

Performance on large directories is bad

Open tsbertalan opened this issue 4 years ago • 10 comments

I have a project directory that includes a node_modules directory with a couple thousand markdown files. While I believe that a few thousand markdown files should be a doable task, in the short term, is there perhaps a way I could tell markdown-links to ignore some folders? If I move this folder out of my directory tree and restart, it's able to handle the remaining 400 or so files admirably.

In fact, the whole Foam stack, including this repo and the vscode-markdown-notes extension, can then successfully handle my ~450,000-file Dropbox (with only those 400 markdown files (although, for some reason, only 285 files are reported in the corner of the graph view), which Zettlr can't yet do. (Although, some tasks like finding backlinks, are a bit slow.)

tsbertalan avatar Jul 23 '20 03:07 tsbertalan

Ignoring files is in progress. @ingalless started a PR which should significantly help with it. Other than that, we are planning to respect .gitignore file which should also handle node_modules in all or close to all cases.

Regarding files that are not reported – they might have been not discovered which most likely means they are missing a Markdown H1 title. It was an early assumption that is not true anymore for many projects and we are gathering feedback on improving it. See #28.

Thanks for reporting the problem!

tchayen avatar Jul 24 '20 10:07 tchayen

I have the same issue - I've loaded about 1200 files (which will be expanded to at least 3000 soon) and it takes the graph minutes to load up, and is extremely laggy when zooming and panning.

Is it possible, at the very least, to save it all so that start-up is quick? I assume the slowness in general usage means that an entirely different way of generating it is needed. For example, Obsidian uses D3 and Pixi.js and the same files perform flawlessly in their graph.

nixsee avatar Jul 24 '20 19:07 nixsee

Do you know if they are using d3 just for data manipulation or d3-force for simulation too?

Here we currently use d3-force displayed and manipulated as an SVG. Moving it to WebGL would definitely help with performance. I have some experience with this exact kind of thing so I will try playing with it.

Is it possible, at the very least, to save it all so that start-up is quick?

You mean save the previous graph data so it might start from the cache and just do the simulation step? Actually, I have never measured performance so the first thing would be to check what is really causing the slowness here.

tchayen avatar Jul 25 '20 12:07 tchayen

I don't know what they do exactly, but am pretty sure they use d3-force. Seems like WebGL or Canvas would be a good solution to the data scale issue, so long as they don't come with other big tradeoffs.

You mean save the previous graph data so it might start from the cache and just do the simulation step? Actually, I have never measured performance so the first thing would be to check what is really causing the slowness here.

Yes, I figured it would be worthwhile to cache the data rather than rebuild. I have no idea about any of this, but your idea to check where the bottleneck actually is sounds like a good plan!

Here's the dataset that I was using: https://github.com/nickmilo/IMF-v3 (its a great method/template to look through and learn from anyway!)

nixsee avatar Jul 25 '20 20:07 nixsee

@nixsee @tchayen Worth mentioning--performance is not the only reason that you might want to cache node locations. It's also valuable to encourage some consistency in their location from run to run, just to improve usability.

tsbertalan avatar Jul 30 '20 22:07 tsbertalan

Just leaving another option I've been mulling over:

What about a config option for large datasets that enables the graph parsing to only scan links for the currently open file, perhaps only 1-2 nodes out? You would lose being able to see the whole graph, and perhaps caching is a better option, but I thought I'd drop this here

ingalless avatar Sep 07 '20 10:09 ingalless

@ingalless How does this differ from @ianjsikes 's Focus graph? Ultimately I prefer that view, but full graph mode is also useful at times.

Aside from probably needing to change a lot of code, what would be the drawback to using WebGL/canvas for better performance?

nixsee avatar Sep 07 '20 18:09 nixsee

@nixsee part of the performance issue seems to be not only rendering the graph, which changing the rendering engine would solve, but also parsing all the markdown files in a project to gather all the edges and nodes

@ianjsikes branch, to my knowledge, doesn't actually change how markdown links parses the files. It "hides" nodes that aren't relevant, but do correct me if that understanding is wrong! The proposed solution was the ability to selectively switch to an "on demand" mode as it were, where processing would only happen on node change, and then only for 1-2 edges out.

A combined solution would attack both performance bottlenecks, although switching to WebGL would allow those who are fine waiting for the nodes all to load to have a smooth experience

I've not checked, but could you even batch load nodes so that the user can see something whilst distant nodes are loading?

This is an alternative or even combinative solution to the issue, as "fix performance" is very general.

ingalless avatar Sep 07 '20 19:09 ingalless

@ingalless I don't have any insight into how any of this works, so I assume you are correct - I was merely asking! I'll leave it to all you folks to figure out what is the best way forward. But it would seem to me that, in addition to an adjustable-distance Focus mode, being able to efficiently load and smoothly work with the full graph would be a desirable goal. Obsidian uses d3js and PixiJS for their rendering, and it works pretty quickly and smoothly with a couple thousand nodes.

nixsee avatar Sep 08 '20 00:09 nixsee

I can confirm: 1200 .md files need around 20 seconds to render as a graph and already painful to interact with. There are no other files in the directory. All ideas I have was already mentioned above: caching and smart scan from the current node to neighbors (if it is possible to re-render nodes in real-time, of course).

Atarity avatar Jan 27 '21 10:01 Atarity