Paul Masurel

Results 327 comments of Paul Masurel

@PSeitz It is not working like that actually. Terms are stored in blocks. The first term is serialized using the scheme you copy pasted. It is quite wasteful. The other...

> This sounds to me like a feature outside of the scope of Tantivy and better suited to projects that implement a distributed search engine over it. Well we could...

@umitgunduz reopening. @umitgunduz can you detail what you mean by dynamic field?

Here is a slightly curated output of `strace -fT` for a process that simply add a doc and commits. ``` mmap(NULL, 1523712, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 getcwd( ) = 0x7f9de3a8a000...

This confirms `fsync` is the most costly operation, and all fsync are "expensive". Then we have 28 fsync calls... Which is waaaaay too much. Here is some detail: ``` fsync(4...

First surprise: Because we keep the list of files in managed.json, we end up doing way to much flush calls. Second surprise: The atomic write dance does not do any...

For the second surprise: tempfile does not do that for us. We need to call sync_all manually. I think we can consider that a bug in tantivy.

We do not really care about metadata. sync_data should be ok. Besides: it does update file metadata when needed (like len).

Some work was done in https://github.com/quickwit-inc/tantivy/commit/c3cc93406d9656a5674c2f62159f00516034710d. It: - the fixes the bug by adding some sync - replaces all fsync by fdatasync - removes some fsyncs. On linux on a...

Reducing further more the number of syncs in #1228. Now we are down to 16.