deepdiff
deepdiff copied to clipboard
Enhancement: Adding multiprocessing to DeepDiff
Is your feature request related to a problem? Please describe. Im running DeepDiff on list of dict which has 500k element. The runtime is quite time-sensitive. It would be great if I could improve the performance with multiprocessing.
Describe the solution you'd like Faster processing time.
Describe alternatives you've considered I tried the following solution: Split the list into smaller list. Number of smaller list is matching the cpu count. I used multiprocessing and hashed these smaller list with DeepHash. Once I got the results of these smaller list I have merged these smaller list together. Shared list of multiprocessing caused some performance issue. Array could be a better solution however it requires further testing
Hello,
Are you using ignore_order=True ? Have you seen the optimizations page:
https://zepworks.com/deepdiff/current/optimizations.html#
Yes multi processing is not supported yet.
Sep Dehpour
On Oct 27, 2020, at 3:46 AM, Andor Markus [email protected] wrote:
Is your feature request related to a problem? Please describe. Im running DeepDiff on list of dict which has 500k element. The runtime is quite time-sensitive. It would be great if I could improve the performance with multiprocessing.
Describe the solution you'd like Faster processing time.
Describe alternatives you've considered I tried the following solution: Split the list into smaller list. Number of smaller list is matching the cpu count. I used multiprocessing and hashed these smaller list with DeepHash. Once I got the results of these smaller list I have merged these smaller list together. Shared list of multiprocessing caused some performance issue. Array could be a better solution however it requires further testing
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi Sep,
Most of my data looks like this and the difference between of two list of dict are low. I would say less than 10% is different. Im only looking for the "iterable_item_removed" output of ddiff.
[ {"contact_id": 1, "launch_id": 1, "domain": "gmx.de", "email_sent_at": "2020-10-22T07:18:22.000Z", "campaign_type": "batch", "bounce_type": "block", "campaign_id": 1, "message_id": 1, "event_time": "2020-10-26T07:32:15.000Z", "customer_id": 1, "partitiontime": "2020-10-26T00:00:00.000Z", "loaded_at": "2020-10-26T07:32:16.353Z"}, {"contact_id": 2, "launch_id": 1, "domain": "gmx.de", "email_sent_at": "2020-10-25T08:26:00.000Z", "campaign_type": "batch", "bounce_type": "block", "campaign_id": 1, "message_id": 2, "event_time": "2020-10-26T08:36:05.000Z", "customer_id": 2, "partitiontime": "2020-10-26T00:00:00.000Z", "loaded_at": "2020-10-26T08:36:06.983Z"} ]
Caching in this case make sense. I tried to run it with the following setting but performance did not improved
DeepDiff(json_a, json_b, ignore_order=True, hasher=DeepHash.murmur3_128bit, cache_size=500, cache_tuning_sample_size=500)
or
DeepDiff(json_a, json_b, ignore_order=True, hasher=DeepHash.murmur3_128bit, cache_size=500)
Somehow I don't hit the cache
{'DIFF COUNT': 29, 'DISTANCE CACHE HIT COUNT': 0, 'MAX DIFF LIMIT REACHED': False, 'MAX PASS LIMIT REACHED': False, 'PASSES COUNT': 1}
Thanks, Andor