django-mysql
django-mysql copied to clipboard
Smart Iteration Algorithms
The current algorithm used by SmartChunkedIterator is currently restricted to the one from pt-online-schema-change, with filtering added. This works badly for really sparse distributions of results.
There are other algorithms we can use, in fact listed on the pt-table-sync documentation. For example, the current algorithm is (approximately) there as 'chunk'. The 'nibble' strategy would work quite well for sparse queries - it's the same as pt-archiver, using LIMIT to determine how many get fetched.
algorithm could be another argument to SmartChunkedIterator - maybe even defaulting to 'auto' with some heuristic to pick between the algos.
Hi @adamchainz
I know it's a bit of an old ticket, but I'd be interested in working on this. Would you have any suggestions on a setup to test the efficiency of difference solutions outside of unit tests?
Many thanks for your great work!
I think you really need a big table with realistic fragmentation to investigate this properly. Also look in the source code for pt-table-sync.
Are you having any problems with the current algorithm?
We ran into some inefficiency when dealing with queries that have sparse results over big tables. This was due to the query conditions but also due to removed rows (with ids earlier in the table being more sparse for example). We have found some workarounds for our cases (ex. not using chunks at all for queries with smaller counts) and the iterator still works nicely most of the time, but I was personally interested in a better solution for sparse distributions.
Yeah sparse results are difficult because there's no predicting when the density of results changes. What the current algorithm does is just react to this by decreasing the chunk size rapidly.