Dirk Groeneveld

Results 200 comments of Dirk Groeneveld

Not sure how to interpret those graphs. Does that say that after de-duping a single snapshot, we should expect less than 30% of the original content to remain? The fact...

Is it measuring by number of paragraphs removed, or number of characters? It makes sense that small paragraphs (1-2 words) would be duplicated a lot. On Thu, Mar 9, 2023,...

We can also make the false positive rate smaller by using a bigger filter. 150GB is not very big. On Fri, Apr 7, 2023, 15:27 Rodney Kinney ***@***.***> wrote: >...

Wait, the 0.3% false positive rate is per ngram. But a paragraph needs to have 80% of it's ngrams come up positive to be removed. That should result in a...

One more thought: The false positive rate it shows the rate at the end of the filtering, i.e., for the last ngram it puts in. For the first ngram the...

I gave this "medium" difficulty because you have to figure out how to run in LUMI for this.

@epwalsh, is this done? Scaling logits, do we care?

https://github.com/pytorch/pytorch/issues/97436

Shane and I found that we may just be able to run this on Python 3.12 without the GIL and it might magically be fast!

As far as I know this was a ROCm-only problem, and AMD already fixed it. If it works with Torch+ROCm 5.7, I would consider this finished. In fact, we could...