lychee icon indicating copy to clipboard operation
lychee copied to clipboard

Faster way to read input files

Open mre opened this issue 4 years ago • 4 comments

After reading this comment I was wondering what would be the downsides of testing bstr for reading inputs.

The way I see it

  • we need a read-only view anyway
  • inputs don't necessarily have to be valid utf-8

This would probably be the fastest way to read inputs if there was a way to stream the input to the extractor (which would be a bigger change). Another alternative: memory maps.

This is just a thought for now. Would love to get people's opinions.

mre avatar Feb 16 '22 14:02 mre

where do you intend to use bstr?

we already use crates to extract links and traversing nodes of html. what else can we benefit from using bstr?

lebensterben avatar Feb 16 '22 16:02 lebensterben

We read the entire file into memory here and we pass that object around:

https://github.com/lycheeverse/lychee/blob/6d56c6b55cb01f0c0285419f3ccb89d86495f52a/lychee-lib/src/types/input.rs#L51

https://github.com/lycheeverse/lychee/blob/6d56c6b55cb01f0c0285419f3ccb89d86495f52a/lychee-lib/src/types/input.rs#L279-L281

I was thinking of ways to speed that up. Maybe bstr ist not the right abstraction for that. Maybe an mmap would be nicer for faster file I/O and less allocations? I would love to have a "view" into the file instead of allocating. Passing Cows around is tedious or at least it wasn't very elegant last time I tried. Just brainstorming ideas for now.

mre avatar Feb 16 '22 16:02 mre

The consumer of InputContent is in lychee-lib::extract module.

Most of time it's directly passed to html5gum or pulldown_cmark.

The performance improvement is possible but I doubt how much it would be.

More performance gain can be made if html5gum starts to allow unsafe code.

lebensterben avatar Feb 16 '22 17:02 lebensterben

We could test it and run a benchmark. It probably also depends on the platform. Ripgrep uses mmap, but for some reason not on macOS.

https://github.com/BurntSushi/ripgrep/blob/master/crates/searcher/src/searcher/mmap.rs

There are some general caveats mentioned in this article, but none we should be worried about for lychee. The article mentions that mmap was around 30% faster than pread. Not sure if that would be worth it. I still think it would be nice to stream the contents as read-only into the extractor without doing any allocations along the way.

mre avatar Feb 17 '22 12:02 mre

Converting this issue to a discussion, since it doesn't track any kind of planned work. The benchmark would still be valuable, but this is not fleshed out enough to be tackled as an actionable item.

mre avatar Nov 28 '22 23:11 mre