Forest Gregg
Forest Gregg
I ended up not switching to random forest in #992 and continue to use regularized logistic regression. the loss in precision for the canonical case was a bit more than...
if i want to roll my own scoring https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/
got a spike going here: https://github.com/dedupeio/dedupe/tree/sqlite_index_predicate this uses fts5 which comes with bm25 as a default scorer. unfortunately, bm25 is not a normalized score, so we can't have threshold defined...
fts5 matchinfo implementation: https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_test_mi.c
the goal is to be able to partition the data such that we we can treat each partition as a separate dedupe problem because we know that co-referent pairs will...
> 910 training files load and train successfully. Training takes around 1 minute. > > 980 training examples gives the following error: > > ``` > Process SpawnProcess-1: > Traceback...
the block learning subroutine has not been optimized for training sets this size, so i can believe that it is quite slow.
no it is not expected. i suppose that it is possible that could happen if there were a *lot* of predicates, then you could meet the recursion limit or max_calls...
hmm... i thought we addressed this before. * [issue](https://github.com/dedupeio/dedupe/issues/833) * [resolving pr](https://github.com/dedupeio/dedupe/pull/809) hmm. something strange happened because @fjsj's commit doesn't seem to be in the git log.
or rather, only part of his change (the test in in there) https://github.com/dedupeio/dedupe/commit/55fd8bf9633e09f8200cf28e492c722e8745590a