Dirk Groeneveld
Dirk Groeneveld
Not sure how to interpret those graphs. Does that say that after de-duping a single snapshot, we should expect less than 30% of the original content to remain? The fact...
Is it measuring by number of paragraphs removed, or number of characters? It makes sense that small paragraphs (1-2 words) would be duplicated a lot. On Thu, Mar 9, 2023,...
We can also make the false positive rate smaller by using a bigger filter. 150GB is not very big. On Fri, Apr 7, 2023, 15:27 Rodney Kinney ***@***.***> wrote: >...
Wait, the 0.3% false positive rate is per ngram. But a paragraph needs to have 80% of it's ngrams come up positive to be removed. That should result in a...
One more thought: The false positive rate it shows the rate at the end of the filtering, i.e., for the last ngram it puts in. For the first ngram the...
I gave this "medium" difficulty because you have to figure out how to run in LUMI for this.
@epwalsh, is this done? Scaling logits, do we care?
https://github.com/pytorch/pytorch/issues/97436
Shane and I found that we may just be able to run this on Python 3.12 without the GIL and it might magically be fast!
As far as I know this was a ROCm-only problem, and AMD already fixed it. If it works with Torch+ROCm 5.7, I would consider this finished. In fact, we could...