Nick Crews

Results 425 comments of Nick Crews

@fgregg I fixed two of your requests, take a look at the final one that I didn't change because I wanted clarification.

Sweet, thanks for keeping momentum going @fgregg !

@lnoel-everlaw If it's still relevant, you can see how the optional variables (such as names and address) are implemented and you can follow their example. See https://github.com/search?q=org%3Adedupeio+dedupe-variable But still this...

Maybe we say that https://docs.dedupe.io/en/latest/Troubleshooting.html#extending-dedupe is an adequate explanation, and we can close this? @jade-feret and @lnoel-everlaw did you ever figure this out? Were those examples adequate for you?

I like the idea of splitting up the ActiveLearner portion and the Dedupe/RecordLinkage/Gazeteer model portion. Not very familiar, but it looks like this is what https://github.com/scikit-activeml/scikit-activeml does. Instead of ActiveLearner.train()...

Yes that makes sense and is possible. Will try to get to this in next couple days. Also do you think adding a larger dataset would be useful? Not sure...

In general, I agree that trying to get things out of memory and moving to a disk-based model seems like the next logical step. Multiple ways to do this of...

Yes, oops, you're totally right it doesn't change the complexity. I was treating each "call into c code" as my atomic operation that I was counting, which sometimes for wall...

> i don't follow how you can throw out duplicate operations? for example with the price comparator: price1 = 0 other_prices = [0,0,2,3,3,2,0,2] only need to do the comparison for...

Another option that seems very intriguing: [vaex](https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-with-vaex-version-4) - supports memmapping as core feature, so disk size is limit. No serialization costs - Uses arrow as backing array, so you can...