django-mysql icon indicating copy to clipboard operation
django-mysql copied to clipboard

Smart Iteration Algorithms

Open adamchainz opened this issue 10 years ago • 4 comments

The current algorithm used by SmartChunkedIterator is currently restricted to the one from pt-online-schema-change, with filtering added. This works badly for really sparse distributions of results.

There are other algorithms we can use, in fact listed on the pt-table-sync documentation. For example, the current algorithm is (approximately) there as 'chunk'. The 'nibble' strategy would work quite well for sparse queries - it's the same as pt-archiver, using LIMIT to determine how many get fetched.

algorithm could be another argument to SmartChunkedIterator - maybe even defaulting to 'auto' with some heuristic to pick between the algos.

adamchainz avatar Mar 21 '15 17:03 adamchainz

Hi @adamchainz

I know it's a bit of an old ticket, but I'd be interested in working on this. Would you have any suggestions on a setup to test the efficiency of difference solutions outside of unit tests?

Many thanks for your great work!

Geekfish avatar Jul 27 '18 10:07 Geekfish

I think you really need a big table with realistic fragmentation to investigate this properly. Also look in the source code for pt-table-sync.

Are you having any problems with the current algorithm?

adamchainz avatar Jul 27 '18 11:07 adamchainz

We ran into some inefficiency when dealing with queries that have sparse results over big tables. This was due to the query conditions but also due to removed rows (with ids earlier in the table being more sparse for example). We have found some workarounds for our cases (ex. not using chunks at all for queries with smaller counts) and the iterator still works nicely most of the time, but I was personally interested in a better solution for sparse distributions.

Geekfish avatar Jul 27 '18 13:07 Geekfish

Yeah sparse results are difficult because there's no predicting when the density of results changes. What the current algorithm does is just react to this by decreasing the chunk size rapidly.

adamchainz avatar Jul 27 '18 13:07 adamchainz