Forest Gregg comments

Results 351 comments of


                                            Forest Gregg

Consider making the default classifier a random forest

I ended up not switching to random forest in #992 and continue to use regularized logistic regression. the loss in precision for the canonical case was a bit more than...

use sqlite's fts5 for tf/idf index predicates

if i want to roll my own scoring https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/

use sqlite's fts5 for tf/idf index predicates

got a spike going here: https://github.com/dedupeio/dedupe/tree/sqlite_index_predicate this uses fts5 which comes with bm25 as a default scorer. unfortunately, bm25 is not a normalized score, so we can't have threshold defined...

use sqlite's fts5 for tf/idf index predicates

fts5 matchinfo implementation: https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_test_mi.c

blocking for partitioning

the goal is to be able to partition the data such that we we can treat each partition as a separate dedupe problem because we know that co-referent pairs will...

Performance degrades when loading/training with large labeled training file to prepare_train()

> 910 training files load and train successfully. Training takes around 1 minute. > > 980 training examples gives the following error: > > ``` > Process SpawnProcess-1: > Traceback...

Performance degrades when loading/training with large labeled training file to prepare_train()

the block learning subroutine has not been optimized for training sets this size, so i can believe that it is quite slow.

Performance degrades when loading/training with large labeled training file to prepare_train()

no it is not expected. i suppose that it is possible that could happen if there were a *lot* of predicates, then you could meet the recursion limit or max_calls...

Clustering scores containing 0 fails filtering

hmm... i thought we addressed this before. * [issue](https://github.com/dedupeio/dedupe/issues/833) * [resolving pr](https://github.com/dedupeio/dedupe/pull/809) hmm. something strange happened because @fjsj's commit doesn't seem to be in the git log.

Clustering scores containing 0 fails filtering

or rather, only part of his change (the test in in there) https://github.com/dedupeio/dedupe/commit/55fd8bf9633e09f8200cf28e492c722e8745590a