dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

Performance degrades when loading/training with large labeled training file to prepare_train()

Open cbhower opened this issue 3 years ago • 12 comments

I have a dedupe pipeline that is working well with small number of pre-labeled files but as I increase the size of the labeled dataset the performance drops off steeply. Loading and training happen in under a minute with 1k observations. When loading and training 5k observations time was more than 40 mins before I interrupted the program. I am loading my pre-labeled dataset via deduper.prepare_train()

Is dedupe designed to scale well with a large pre-labeled dataset? I have a dataset with ~40k observations that I would like to load and train with eventually. My computer is a 12core intel macbook pro. Dedupe version 2.0.8 and python 3.8.

Thanks for any guidance on this issue!

cbhower avatar Jan 25 '22 21:01 cbhower

910 training files load and train successfully. Training takes around 1 minute.

980 training examples gives the following error:

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 132, in __call__
    filtered_pairs: Optional[Tuple] = self.fieldDistance(record_pairs)
  File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 149, in fieldDistance
    distances = self.data_model.distances(records)
  File "/.../venv/lib/python3.8/site-packages/dedupe/datamodel.py", line 84, in distances
    distances[i, start:stop] = compare(record_1[field],
  File "affinegap/affinegap.pyx", line 111, in affinegap.affinegap.normalizedAffineGapDistance
  File "affinegap/affinegap.pyx", line 124, in affinegap.affinegap.normalizedAffineGapDistance
ZeroDivisionError: normalizedAffineGapDistance cannot take two empty strings
Process SpawnProcess-13:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 189, in mergeScores
    raise
RuntimeError: No active exception to reraise
Traceback (most recent call last):
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 156, in <module>
    TrainDeduplicator.run(test_data,
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 75, in run
    clustered_dupes = deduper.partition(test_data, 0.5)
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition
    pair_scores = self.score(pairs)
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score
    matches = core.scoreDuplicates(pairs,
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 256, in scoreDuplicates
    raise ChildProcessError

cbhower avatar Jan 26 '22 01:01 cbhower

1400 training examples loads and "trains", but no blocking rules are learned at it returns this error:

Starting training... Blocking predicates: () Writing training file... Writing settings file... Releasing memory... Exporting model... Clustering... Traceback (most recent call last): File "/.../src/handlers/run_deduplication.py", line 156, in TrainDeduplicator.run(test_data, File "/.../src/handlers/run_deduplication.py", line 75, in run clustered_dupes = deduper.partition(test_data, 0.5) File "/.../venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition pair_scores = self.score(pairs) File "/.../venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score matches = core.scoreDuplicates(pairs, File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 227, in scoreDuplicates raise BlockingError("No records have been blocked together. " dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?

cbhower avatar Jan 26 '22 01:01 cbhower

910 training files load and train successfully. Training takes around 1 minute.

980 training examples gives the following error:

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 132, in __call__
    filtered_pairs: Optional[Tuple] = self.fieldDistance(record_pairs)
  File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 149, in fieldDistance
    distances = self.data_model.distances(records)
  File "/.../venv/lib/python3.8/site-packages/dedupe/datamodel.py", line 84, in distances
    distances[i, start:stop] = compare(record_1[field],
  File "affinegap/affinegap.pyx", line 111, in affinegap.affinegap.normalizedAffineGapDistance
  File "affinegap/affinegap.pyx", line 124, in affinegap.affinegap.normalizedAffineGapDistance
ZeroDivisionError: normalizedAffineGapDistance cannot take two empty strings
Process SpawnProcess-13:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 189, in mergeScores
    raise
RuntimeError: No active exception to reraise
Traceback (most recent call last):
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 156, in <module>
    TrainDeduplicator.run(test_data,
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/src/handlers/run_deduplication.py", line 75, in run
    clustered_dupes = deduper.partition(test_data, 0.5)
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition
    pair_scores = self.score(pairs)
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score
    matches = core.scoreDuplicates(pairs,
  File "/Users/howerc/Documents/repos/ECM-entity-deduplication/venv/lib/python3.8/site-packages/dedupe/core.py", line 256, in scoreDuplicates
    raise ChildProcessError

this error is because your data has empty strings, you should cast those to None

fgregg avatar Jan 26 '22 13:01 fgregg

the block learning subroutine has not been optimized for training sets this size, so i can believe that it is quite slow.

fgregg avatar Jan 26 '22 13:01 fgregg

Thanks for the quick answer @fgregg. It is expected that the blocking predicates would be empty after training with a larger amount of data?

As a work around, is there a way to learn the blocking predicates with a small number of training pairs but train the classifier with more training data? If I could isolate the blocking training and classifier training that would be a good solution.

cbhower avatar Jan 26 '22 20:01 cbhower

no it is not expected.

i suppose that it is possible that could happen if there were a lot of predicates, then you could meet the recursion limit or max_calls limit before finding a solution that covered all the dupe pairs.

https://github.com/dedupeio/dedupe/blob/master/dedupe/training.py#L251

i would verify that you can get a solution with smaller dataset.

if so, then play around with max_calls or increasing the recursion limit.

fgregg avatar Jan 26 '22 20:01 fgregg

Hi -- I'm running into this issue too; any advice on how to train the model with a large dataset? Should I just pull a smaller subset of records in the initial dataset query to feed to the trainer?

timstallmann avatar Sep 29 '22 16:09 timstallmann

I've had the same issue with the blocking phase failing to find any predicates, even with small training sets. The issue has been intermittent on my dataset and I haven't been able to reproduce it on one of the examples.

It seems to be an issue with the dedupe.training.BlockLearner.random_forest_candidates as I can run the train method over and over until it finds predicates. I've also had success switching from random_forest_candidates to simple_candidates.

I've also had issues with the active learning portion of dedupe struggling to find matching records - I wonder if these two things are related?

johnmarkpittman avatar Jan 25 '23 17:01 johnmarkpittman

of predicates, then you could meet the recursion limit or max_calls limit before finding a solution that covered all the dupe pairs

hi @cbhower did you get why it is throwing no block found error if you gave some manual training json file?

sids07 avatar Feb 09 '23 10:02 sids07

I did not. I ended up going with recordlinkage and a classifier for my application since I had enough data. Good luck!

Get Outlook for iOShttps://aka.ms/o0ukef


From: Siddhartha Shrestha @.> Sent: Thursday, February 9, 2023 5:31:30 AM To: dedupeio/dedupe @.> Cc: Christian Hower @.>; Mention @.> Subject: Re: [dedupeio/dedupe] Performance degrades when loading/training with large labeled training file to prepare_train() (Issue #940)

of predicates, then you could meet the recursion limit or max_calls limit before finding a solution that covered all the dupe pairs

hi @cbhowerhttps://github.com/cbhower did you get why it is throwing no block found error if you gave some manual training json file?

— Reply to this email directly, view it on GitHubhttps://github.com/dedupeio/dedupe/issues/940#issuecomment-1423964184, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJK2O5LZJCCFGJSLKQ4V5FLWWTBQFANCNFSM5MZM4PVA. You are receiving this because you were mentioned.Message ID: @.***>

cbhower avatar Feb 22 '23 22:02 cbhower

1400 training examples loads and "trains", but no blocking rules are learned at it returns this error:

Starting training... Blocking predicates: () Writing training file... Writing settings file... Releasing memory... Exporting model... Clustering... Traceback (most recent call last): File "/.../src/handlers/run_deduplication.py", line 156, in TrainDeduplicator.run(test_data, File "/.../src/handlers/run_deduplication.py", line 75, in run clustered_dupes = deduper.partition(test_data, 0.5) File "/.../venv/lib/python3.8/site-packages/dedupe/api.py", line 170, in partition pair_scores = self.score(pairs) File "/.../venv/lib/python3.8/site-packages/dedupe/api.py", line 103, in score matches = core.scoreDuplicates(pairs, File "/.../venv/lib/python3.8/site-packages/dedupe/core.py", line 227, in scoreDuplicates raise BlockingError("No records have been blocked together. " dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?

I have been getting this same exact issue when I provide more than ~250 training samples. I thought that I was changing the data during processing but the same exact file will start working if I remove enough labels.

I also get the following errors before it says I have empty predicates:

Click me
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/evan/miniconda3/envs/tenant/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

I also tried increasing the max_calls parameter here to 10_000 and it was able to find a predicate set (5000 was not sufficient)

EvanOman avatar Apr 21 '23 04:04 EvanOman

hm... i think that what might be going on is that as we have more training data, the random forest candidates converge to too small a set of options to cover the whole set.

if anyone on this thread has this issue and wants to investigate, lmk, and we can figure it out.

fgregg avatar Dec 18 '23 15:12 fgregg