deepdiff icon indicating copy to clipboard operation
deepdiff copied to clipboard

Enhancement: Adding multiprocessing to DeepDiff

Open andormarkus opened this issue 5 years ago • 2 comments

Is your feature request related to a problem? Please describe. Im running DeepDiff on list of dict which has 500k element. The runtime is quite time-sensitive. It would be great if I could improve the performance with multiprocessing.

Describe the solution you'd like Faster processing time.

Describe alternatives you've considered I tried the following solution: Split the list into smaller list. Number of smaller list is matching the cpu count. I used multiprocessing and hashed these smaller list with DeepHash. Once I got the results of these smaller list I have merged these smaller list together. Shared list of multiprocessing caused some performance issue. Array could be a better solution however it requires further testing

andormarkus avatar Oct 27 '20 10:10 andormarkus

Hello,

Are you using ignore_order=True ? Have you seen the optimizations page:

https://zepworks.com/deepdiff/current/optimizations.html#

Yes multi processing is not supported yet.

Sep Dehpour

On Oct 27, 2020, at 3:46 AM, Andor Markus [email protected] wrote:

 Is your feature request related to a problem? Please describe. Im running DeepDiff on list of dict which has 500k element. The runtime is quite time-sensitive. It would be great if I could improve the performance with multiprocessing.

Describe the solution you'd like Faster processing time.

Describe alternatives you've considered I tried the following solution: Split the list into smaller list. Number of smaller list is matching the cpu count. I used multiprocessing and hashed these smaller list with DeepHash. Once I got the results of these smaller list I have merged these smaller list together. Shared list of multiprocessing caused some performance issue. Array could be a better solution however it requires further testing

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

seperman avatar Oct 27 '20 17:10 seperman

Hi Sep,

Most of my data looks like this and the difference between of two list of dict are low. I would say less than 10% is different. Im only looking for the "iterable_item_removed" output of ddiff. [ {"contact_id": 1, "launch_id": 1, "domain": "gmx.de", "email_sent_at": "2020-10-22T07:18:22.000Z", "campaign_type": "batch", "bounce_type": "block", "campaign_id": 1, "message_id": 1, "event_time": "2020-10-26T07:32:15.000Z", "customer_id": 1, "partitiontime": "2020-10-26T00:00:00.000Z", "loaded_at": "2020-10-26T07:32:16.353Z"}, {"contact_id": 2, "launch_id": 1, "domain": "gmx.de", "email_sent_at": "2020-10-25T08:26:00.000Z", "campaign_type": "batch", "bounce_type": "block", "campaign_id": 1, "message_id": 2, "event_time": "2020-10-26T08:36:05.000Z", "customer_id": 2, "partitiontime": "2020-10-26T00:00:00.000Z", "loaded_at": "2020-10-26T08:36:06.983Z"} ]

Caching in this case make sense. I tried to run it with the following setting but performance did not improved DeepDiff(json_a, json_b, ignore_order=True, hasher=DeepHash.murmur3_128bit, cache_size=500, cache_tuning_sample_size=500)

or

DeepDiff(json_a, json_b, ignore_order=True, hasher=DeepHash.murmur3_128bit, cache_size=500)

Somehow I don't hit the cache {'DIFF COUNT': 29, 'DISTANCE CACHE HIT COUNT': 0, 'MAX DIFF LIMIT REACHED': False, 'MAX PASS LIMIT REACHED': False, 'PASSES COUNT': 1}

Thanks, Andor

andormarkus avatar Oct 27 '20 20:10 andormarkus