Catherine Koshka
Catherine Koshka
This is also an issue for me.
Hi everyone, I have a notebook with a temporary solution to this issue here: https://github.com/fastelectronicvegetable/aitextgen_notebooks/blob/main/Encoding_very_large_text_files%20(2).ipynb It uses a much more efficient training process and tokenisation process, I was able to...
Never mind, all you need to do is train the tokeniser using YTTM, take the vocab file it outputs, strip out the numbers, and use it as the training file...
I've been following this project for quite a while now and I'm happy to see v3 finally happen. I will see if I can port it to rust and from...
v3, that's right. Though I could start with v2 since it might help me understand the changes in context. And ya, I think I know that feeling. Sometimes when I'm...
Just piggybacking on this - it was interesting seeing Finnish, Hungarian and Polish documents in the samples. I sent them to a couple of friends and so far as they...
(disclaimer for the following: I am dyscalculic so I stumbled into this one more or less accidentally at 2am while making [a combinatorial analogue of Anki that represents concepts latently...
@ddh0 My understanding is that contrastive search decoding just does this at each step: 1. Take all the tokens in the input and mean_pool their embeddings 2. Look at the...