drush icon indicating copy to clipboard operation
drush copied to clipboard

Replace usage of in_array() in MigrateExecutable::handleMissingSourceRows

Open mdolnik opened this issue 1 year ago • 0 comments

Describe the bug Usage of in_array() in MigrateExecutable::handleMissingSourceRows() is proving to be very inefficient for migrations with a very large amount of rows.

To Reproduce Run any migration ID with a very large amount of rows (eg 10,000+). While the actual migration has a progress bar and lets you know when its finished, the logic in handleMissingSourceRows() will have the process seem like its frozen for an indeterminate amount of time.

Actual behavior Running a migration ID with many rows (in my case over 300,000 for upgrade_d7_file_private) would take roughly 20-30 minutes for the actual migration, but would hang on MigrateExecutable::handleMissingSourceRows() for multiple hours before having to manually stop the process.

Using in_array() can be very inefficient as it needs to compare all array values until it finds a match not to mention the current logic is trying to find an an array within an array of arrays.

Workaround Instead of using in_array() the $allSourceIdValues property should be keyed with a unique ID in order to utilize isset()

Having a dedicated method to build the key off the source ID values can allow it to be used when writing to the $allSourceIdValues property in MigrateExecutable::onPrepareRow() and reading it within handleMissingSourceRows().

Making this change to the example above with 300k rows, brought this post-migration logic to finish within a few minutes instead of multiple hours.

mdolnik avatar Sep 25 '23 22:09 mdolnik