opal
opal copied to clipboard
Process multiple cells in one iteration, add padding to database sequences, optimize for cache
This method is explained in Rognes's article, under "Processing four consecutive cells along the database sequences". They claim that by processing multiple cells they have better usage of SSE registers and caching. Another thing is adding padding to database sequences so they always have the length that is mutliple of 4. That way I can check every 4 columns if some sequence ended instead of doing it every column, which could make things faster.
First, I am going to pad database sequences so they have length that is multiple of 4, and I am going to do it for SW. When I have this, I will be able to do householding jobs (checking for overflow and sequence end) every 4 columns instead of every column, and I hope to get some speedup on that.
Later, I can try processing 4 cells in a row at once, for extra speedup.
While doing this, I did some learning about cache optimizations, so I also did some work on that. I also fixed few things that I found were not optimal. All together:
- I aligned all arrays accessed by SIMD instructions, and aligned them correctly
- I joined prevHs and prevEs into one structure
- I moved declaration of query profile outside of inner loop
- I simplified tracking of lengths of sequences Although I was hoping that this stuff will help, I did not get any improvement. However, I left the changes, since they seem right. I did some profiling on cache misses using cachegrind, and I get 0% misses for small query sequences, and about 8% for larger query sequences (size of database does not matter).
I also did some measurements of code speed using very primitive method: when I want to see how slow is some part of code, I duplicate it and see for how longer does program execute now: that is how much time this piece of code normally takes. I have to do this carefully, because duplicated code must not change results of execution (because it might affect execution time then) and it also can not be trivial because compiler may optimize it out. Based by those measurements, core loop takes about 70% of execution time, and building query profile takes about 25% of execution time! This is good to know, because I was not sure how much time does query profile building take. TLDR:
Core loop: 70% of time Query profile building: 25% of time