1brc
1brc copied to clipboard
Suggestion: clear disk cache between runs
Hi, In order to have a fair comparison between I/O methods I think that when measuring multiple runs it would be better to clean the disk cache before every run, otherwise only the first time that the file is read the reading will happen from disk, while all the subsequent read operations will be performed directly from the disk cache in RAM, which could lead to wrong overall conclusions.
In order to clean disk cache you can run this command: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
.
Btw very nice challenge!
The goal is to make it as fast as possible. If disk cache helps why disable it? The measurement is done 5 times and lowest and highest values are discarded afaik.
Let me explain better why I think that in this case disk cache can be deceptive.
The goal of the challenge is to explore how far modern Java can be pushed for aggregating one billion rows from a text file.
.
This means in my opinion that we should be measuring the end to end time required to:
- read the file from disk.
- perform the aggregations.
- print the output.
Now let's say that there are two solutions, solution A, and solution B. Let's assume that solution A is faster then solution B while reading the file from disk cache, but slower while reading the file from disk (first read).
Using the actual measurement approach that doesn't take into account disk caching solution A would win, but in a real world application when reading from disk is done only once I think that we would prefer using solution B, because it would be faster.
Therefore the suggestion to take into account disk caching, by clearing the cache between subsequent runs, in order to remove this artifact.
Please note that even if this issue could seem academic it leads to huge differences in timing, even by a factor of 2 or even more.
I don't think it's acadamic at all, it is a reasonable request. But it would just make for a very different challenge. The way it has been (intentionally) set up is that disk speed does not matter. Existing entries into the challenge are implemented based on that assumption, so flushing the page cache at this point would be a significant move of goal posts which we should avoid. This would make an interesting topic though for another challenge, at another time.
Would be awesome to check with I/O speed in mind. In SirixDB I have a hard time to figure out what's wrong with a Direct I/O based file backend for instance (the Direct I/O API in general is very unintuitive IMHO... ;-))
I'm gonna close this, as I think it has been answered. Another challenge, at another time :)