Nick Crews comments

Results 425 comments of


                                            Nick Crews

Refactor labeler.py

@fgregg I fixed two of your requests, take a look at the final one that I didn't change because I wanted clarification.

Refactor labeler.py

Sweet, thanks for keeping momentum going @fgregg !

Document how to create a new variable type

@lnoel-everlaw If it's still relevant, you can see how the optional variables (such as names and address) are implemented and you can follow their example. See https://github.com/search?q=org%3Adedupeio+dedupe-variable But still this...

Document how to create a new variable type

Maybe we say that https://docs.dedupe.io/en/latest/Troubleshooting.html#extending-dedupe is an adequate explanation, and we can close this? @jade-feret and @lnoel-everlaw did you ever figure this out? Were those examples adequate for you?

Consider removing distinction between `Static` and non-`Static` APIs

I like the idea of splitting up the ActiveLearner portion and the Dedupe/RecordLinkage/Gazeteer model portion. Not very familiar, but it looks like this is what https://github.com/scikit-activeml/scikit-activeml does. Instead of ActiveLearner.train()...

benchmark runs with training separately than runs that use settings file

Yes that makes sense and is possible. Will try to get to this in next couple days. Also do you think adding a larger dataset would be useful? Not sure...

Consider holding data in sqlite table

In general, I agree that trying to get things out of memory and moving to a disk-based model seems like the next logical step. Multiple ways to do this of...

Consider holding data in sqlite table

Yes, oops, you're totally right it doesn't change the complexity. I was treating each "call into c code" as my atomic operation that I was counting, which sometimes for wall...

Consider holding data in sqlite table

> i don't follow how you can throw out duplicate operations? for example with the price comparator: price1 = 0 other_prices = [0,0,2,3,3,2,0,2] only need to do the comparison for...

Consider holding data in sqlite table

Another option that seems very intriguing: [vaex](https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-with-vaex-version-4) - supports memmapping as core feature, so disk size is limit. No serialization costs - Uses arrow as backing array, so you can...