csvdedupe icon indicating copy to clipboard operation
csvdedupe copied to clipboard

Using existing training.json throws error

Open mzagaja opened this issue 4 years ago • 3 comments

When trying to use an existing training.json file on a dataset instead of getting output I have errors thrown:

csvdedupe --config_file=processors/csvdedupe-config.json --training_file=training.json --settings_file=processors/learned_settings data/finished/arts-and-cultural-assets-massachusetts-clustered.csv > test2.csv
INFO:root:imported 2673 rows
INFO:root:using fields: ['Name', 'Municipality']
INFO:root:taking a sample of 1500 possible pairs
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sortedAcronym, Municipality), SimplePredicate: (wholeFieldPredicate, Name))
INFO:root:reading labeled examples from training.json
INFO:dedupe.api:reading training from file
Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 168, in __call__
    doc_id = self.index._doc_to_id[doc]
AttributeError: 'NoneType' object has no attribute '_doc_to_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 650, in readTraining
    self.markPairs(training_pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 730, in markPairs
    self.active_learner.mark(examples, y)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 359, in mark
    learner.fit_transform(self.pairs, self.y)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 195, in fit_transform
    recall=1.0)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 26, in learn
    dupe_cover = Cover(self.blocker.predicates, matches)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 379, in __init__
    self._cover(predicates, pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 387, in _cover
    in enumerate(pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 389, in <setcomp>
    set(predicate(record_2, target=True)))}
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 170, in __call__
    raise AttributeError("Attempting to block with an index "
AttributeError: Attempting to block with an index predicate without indexing records

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/bin/csvdedupe", line 8, in <module>
    sys.exit(launch_new_instance())
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 180, in launch_new_instance
    d.main()
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 110, in main
    self.dedupe_training(deduper)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvhelpers.py", line 257, in dedupe_training
    deduper.readTraining(tf)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 653, in readTraining
    raise UserWarning('Training data has records not known '
UserWarning: Training data has records not known to the active learner. Read training in before initializing the active learner with the sample method, or use the prepare_training method.

Allegedly resolved in https://github.com/dedupeio/dedupe/pull/761 on the dedupe side, but still manifesting here.

mzagaja avatar Apr 29 '20 18:04 mzagaja

csvdedupe requires dedupe>=1.6,<2, which turns out to be 1.10.0. This was released on 9th Jan 2020. https://github.com/dedupeio/dedupe/pull/761 was merged on 10 Aug 2019, so in theory we should already be using it.

Perhaps this is a separate issue?

ghost avatar Sep 21 '20 23:09 ghost

Hello, I also recently ran csvdedupe for the first time. After I finished, a training.json file was created. When I tried running csvdedupe again, I got the same error as @mzagaja. I have dedupe v1.10.0 installed.

chrismp avatar Oct 11 '21 01:10 chrismp

Replacing readTraining function in dedupe/api.py with the following code fixes the issue. I will try submit a patch to the maintainers.

    def readTraining(self, training_file):
        '''
        Read training from previously built training data file object

        Arguments:

        training_file -- file object containing the training data
        '''
        logger.info('reading training from file')
        self.training_pairs = json.load(training_file,
                                        cls=serializer.dedupe_decoder)

regel avatar Jan 21 '23 19:01 regel