dedupe
dedupe copied to clipboard
Performance degrades when loading/training with large labeled training file to prepare_train()
I have a dedupe pipeline that is working well with small number of pre-labeled files but as I increase the size of the labeled dataset the performance drops off steeply. Loading and training happen in under a minute with 1k observations. When loading and training 5k observations time was more than 40 mins before I interrupted the program. I am loading my pre-labeled dataset via deduper.prepare_train()
Is dedupe designed to scale well with a large pre-labeled dataset? I have a dataset with ~40k observations that I would like to load and train with eventually. My computer is a 12core intel macbook pro. Dedupe version 2.0.8 and python 3.8.
Thanks for any guidance on this issue!
910 training files load and train successfully. Training takes around 1 minute.
980 training examples gives the following error:
Process SpawnProcess-1:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 132, in __call__
filtered_pairs: Optional[Tuple] = self.fieldDistance(record_pairs)
File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 149, in fieldDistance
distances = self.data_model.distances(records)
File "/.../venv/lib/python3.8/site-packages/dedupe/datamodel.py", line 84, in distances
distances[i, start:stop] = compare(record_1[field],
File "affinegap/affinegap.pyx", line 111, in affinegap.affinegap.normalizedAffineGapDistance
File "affinegap/affinegap.pyx", line 124, in affinegap.affinegap.normalizedAffineGapDistance
ZeroDivisionError: normalizedAffineGapDistance cannot take two empty strings
Process SpawnProcess-13:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 189, in mergeScores
raise
RuntimeError: No active exception to reraise
Traceback (most recent call last):
File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 156, in <module>
TrainDeduplicator.run(test_data,
File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 75, in run
clustered_dupes = deduper.partition(test_data, 0.5)
File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition
pair_scores = self.score(pairs)
File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score
matches = core.scoreDuplicates(pairs,
File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 256, in scoreDuplicates
raise ChildProcessError
1400 training examples loads and "trains", but no blocking rules are learned at it returns this error:
Starting training...
Blocking predicates:
()
Writing training file...
Writing settings file...
Releasing memory...
Exporting model...
Clustering...
Traceback (most recent call last):
File "/.../src/handlers/run_deduplication.py", line 156, in
910 training files load and train successfully. Training takes around 1 minute.
980 training examples gives the following error:
Process SpawnProcess-1: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 132, in __call__ filtered_pairs: Optional[Tuple] = self.fieldDistance(record_pairs) File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 149, in fieldDistance distances = self.data_model.distances(records) File "/.../venv/lib/python3.8/site-packages/dedupe/datamodel.py", line 84, in distances distances[i, start:stop] = compare(record_1[field], File "affinegap/affinegap.pyx", line 111, in affinegap.affinegap.normalizedAffineGapDistance File "affinegap/affinegap.pyx", line 124, in affinegap.affinegap.normalizedAffineGapDistance ZeroDivisionError: normalizedAffineGapDistance cannot take two empty strings Process SpawnProcess-13: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 189, in mergeScores raise RuntimeError: No active exception to reraise Traceback (most recent call last): File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 156, in <module> TrainDeduplicator.run(test_data, File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 75, in run clustered_dupes = deduper.partition(test_data, 0.5) File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition pair_scores = self.score(pairs) File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score matches = core.scoreDuplicates(pairs, File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 256, in scoreDuplicates raise ChildProcessError
this error is because your data has empty strings, you should cast those to None
the block learning subroutine has not been optimized for training sets this size, so i can believe that it is quite slow.
Thanks for the quick answer @fgregg. It is expected that the blocking predicates would be empty after training with a larger amount of data?
As a work around, is there a way to learn the blocking predicates with a small number of training pairs but train the classifier with more training data? If I could isolate the blocking training and classifier training that would be a good solution.
no it is not expected.
i suppose that it is possible that could happen if there were a lot of predicates, then you could meet the recursion limit or max_calls limit before finding a solution that covered all the dupe pairs.
https://github.com/dedupeio/dedupe/blob/master/dedupe/training.py#L251
i would verify that you can get a solution with smaller dataset.
if so, then play around with max_calls or increasing the recursion limit.
Hi -- I'm running into this issue too; any advice on how to train the model with a large dataset? Should I just pull a smaller subset of records in the initial dataset query to feed to the trainer?
I've had the same issue with the blocking phase failing to find any predicates, even with small training sets. The issue has been intermittent on my dataset and I haven't been able to reproduce it on one of the examples.
It seems to be an issue with the dedupe.training.BlockLearner.random_forest_candidates
as I can run the train
method over and over until it finds predicates. I've also had success switching from random_forest_candidates
to simple_candidates
.
I've also had issues with the active learning portion of dedupe
struggling to find matching records - I wonder if these two things are related?
of predicates, then you could meet the recursion limit or max_calls limit before finding a solution that covered all the dupe pairs
hi @cbhower did you get why it is throwing no block found error if you gave some manual training json file?
I did not. I ended up going with recordlinkage and a classifier for my application since I had enough data. Good luck!
Get Outlook for iOShttps://aka.ms/o0ukef
From: Siddhartha Shrestha @.> Sent: Thursday, February 9, 2023 5:31:30 AM To: dedupeio/dedupe @.> Cc: Christian Hower @.>; Mention @.> Subject: Re: [dedupeio/dedupe] Performance degrades when loading/training with large labeled training file to prepare_train() (Issue #940)
of predicates, then you could meet the recursion limit or max_calls limit before finding a solution that covered all the dupe pairs
hi @cbhowerhttps://github.com/cbhower did you get why it is throwing no block found error if you gave some manual training json file?
— Reply to this email directly, view it on GitHubhttps://github.com/dedupeio/dedupe/issues/940#issuecomment-1423964184, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJK2O5LZJCCFGJSLKQ4V5FLWWTBQFANCNFSM5MZM4PVA. You are receiving this because you were mentioned.Message ID: @.***>
1400 training examples loads and "trains", but no blocking rules are learned at it returns this error:
Starting training... Blocking predicates: () Writing training file... Writing settings file... Releasing memory... Exporting model... Clustering... Traceback (most recent call last): File "/.../src/handlers/run_deduplication.py", line 156, in TrainDeduplicator.run(test_data, File "/.../src/handlers/run_deduplication.py", line 75, in run clustered_dupes = deduper.partition(test_data, 0.5) File "/.../venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition pair_scores = self.score(pairs) File "/.../venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score matches = core.scoreDuplicates(pairs, File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 227, in scoreDuplicates raise BlockingError("No records have been blocked together. " dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?
I have been getting this same exact issue when I provide more than ~250 training samples. I thought that I was changing the data during processing but the same exact file will start working if I remove enough labels.
I also get the following errors before it says I have empty predicates:
Click me
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
I also tried increasing the max_calls
parameter here to 10_000
and it was able to find a predicate set (5000
was not sufficient)
hm... i think that what might be going on is that as we have more training data, the random forest candidates converge to too small a set of options to cover the whole set.
if anyone on this thread has this issue and wants to investigate, lmk, and we can figure it out.