csvdedupe
csvdedupe copied to clipboard
Using existing training.json throws error
When trying to use an existing training.json file on a dataset instead of getting output I have errors thrown:
csvdedupe --config_file=processors/csvdedupe-config.json --training_file=training.json --settings_file=processors/learned_settings data/finished/arts-and-cultural-assets-massachusetts-clustered.csv > test2.csv
INFO:root:imported 2673 rows
INFO:root:using fields: ['Name', 'Municipality']
INFO:root:taking a sample of 1500 possible pairs
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sortedAcronym, Municipality), SimplePredicate: (wholeFieldPredicate, Name))
INFO:root:reading labeled examples from training.json
INFO:dedupe.api:reading training from file
Traceback (most recent call last):
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 168, in __call__
doc_id = self.index._doc_to_id[doc]
AttributeError: 'NoneType' object has no attribute '_doc_to_id'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 650, in readTraining
self.markPairs(training_pairs)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 730, in markPairs
self.active_learner.mark(examples, y)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 359, in mark
learner.fit_transform(self.pairs, self.y)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 195, in fit_transform
recall=1.0)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 26, in learn
dupe_cover = Cover(self.blocker.predicates, matches)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 379, in __init__
self._cover(predicates, pairs)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 387, in _cover
in enumerate(pairs)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 389, in <setcomp>
set(predicate(record_2, target=True)))}
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 170, in __call__
raise AttributeError("Attempting to block with an index "
AttributeError: Attempting to block with an index predicate without indexing records
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mzagaja/.virtualenvs/dedupe-examples/bin/csvdedupe", line 8, in <module>
sys.exit(launch_new_instance())
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 180, in launch_new_instance
d.main()
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 110, in main
self.dedupe_training(deduper)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvhelpers.py", line 257, in dedupe_training
deduper.readTraining(tf)
File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 653, in readTraining
raise UserWarning('Training data has records not known '
UserWarning: Training data has records not known to the active learner. Read training in before initializing the active learner with the sample method, or use the prepare_training method.
Allegedly resolved in https://github.com/dedupeio/dedupe/pull/761 on the dedupe side, but still manifesting here.
csvdedupe requires dedupe>=1.6,<2
, which turns out to be 1.10.0
.
This was released on 9th Jan 2020.
https://github.com/dedupeio/dedupe/pull/761 was merged on 10 Aug 2019, so in theory we should already be using it.
Perhaps this is a separate issue?
Hello, I also recently ran csvdedupe for the first time. After I finished, a training.json
file was created. When I tried running csvdedupe again, I got the same error as @mzagaja. I have dedupe v1.10.0 installed.
Replacing readTraining
function in dedupe/api.py with the following code fixes the issue. I will try submit a patch to the maintainers.
def readTraining(self, training_file):
'''
Read training from previously built training data file object
Arguments:
training_file -- file object containing the training data
'''
logger.info('reading training from file')
self.training_pairs = json.load(training_file,
cls=serializer.dedupe_decoder)