1brc icon indicating copy to clipboard operation
1brc copied to clipboard

Few suggestions/ideas.

Open balayanv opened this issue 1 year ago • 1 comments

Since our best times are in the relm of 2 seconds, imo it is better to run not just 5 times but 10/15 discard outliers and follow the same formula, the chance of getting lucky will be much less.

Since we only use 8 out of 32 cores, it is also matter of luck if one of your fork join threads is actually used by the OS or doing something else... (i assume we dont have any thread affinity/isolation setup for the testing hardware).

Also depending on how you get lucky and which actual cores are allocated to the task you may have good or bad core to core latency. AMD cpus are worse at this than intel. See https://chronicle.software/why-core-to-core-latency-matters/

Now here is one suggestion. When i run some of the best submissions locally (mac m2 pro) and log how quick each core finishes processing its own segment i get varying times up to 10% difference (and i did this on 100m not 1b elements because i want to ensure all of the data fits in ram, it should be bigger the bigger the dataset). Very rarely they will finish uniformly. Granted my machine is a user machine and all sorts of other things maybe impacting the cores but i think it will be similar for server as well. This means that we can improve the numbers by making sure the workers who finish first do not just wait idle for the slowest one. Instead work stealing comes to the rescue. This can be done very efficiently using forkjoinpool counted completers, which do minimal possible synchronization + give us complete control how to break down tasks... so for example we could do this.

Assume that all 8 cores are actively processing their respective segment, we could check for a volatile bool flag if a worker has finished the segment (this is our clue to start breaking down the remaining work so it can be stolen)...

First worker that manges to finish will write to that volatile field, every woker will do a read on this field (which is as cheap as normal read) every N elements processed (matter of experimentation)... as soon as other workers observe that someone has finished they will break down the remaining task (create forks), which will let the idle ones join forces and hopefully finish sooner.

Volatile flag should not have any significant impact since we only write to it once but read the rest of the times.

balayanv avatar Jan 13 '24 13:01 balayanv

Hey, it is not quite clear to me what you are suggesting (apart from running with more iterations than 5, which indeed I am doing already for the top contenders). Are you proposing a change to the test set-up, or rather a change to specific implementations? In case of the former, could you clarify what it is? In case of the latter, that's not something I could do myself, but some of the participants did discuss work stealing approaches indeed. So far no one has implemented it though. If you'd like to do it, your submission will be very welcome (note we require to complete in 10 sec or less by now on the reference machine, as it became quite tough to manage with so many submissions by now).

gunnarmorling avatar Jan 13 '24 22:01 gunnarmorling

Yes you are correct, for the contest i was only suggesting more iterations (i didnt know you already do it),

and for other devs i was suggesting to test what is a time difference between first finishing worker/core vs last... if its big enough then maybe we have opportunity to improve that.

I'll do my attempt on it.

balayanv avatar Jan 14 '24 04:01 balayanv