Catherine Koshka comments

Results 8 comments of


                                            Catherine Koshka

How to generate vocab.json and merges.txt for YTTM tokenizer?

This is also an issue for me.

Building TokenDataset consumes excessive amounts of RAM

Hi everyone, I have a notebook with a temporary solution to this issue here: https://github.com/fastelectronicvegetable/aitextgen_notebooks/blob/main/Encoding_very_large_text_files%20(2).ipynb It uses a much more efficient training process and tokenisation process, I was able to...

Is there a way to integrate YouTokenToMe's tokenisation and training process?

Never mind, all you need to do is train the tokeniser using YTTM, take the vocab file it outputs, strip out the numbers, and use it as the training file...

Request for comment: Ebisu v3 API

I've been following this project for quite a while now and I'm happy to see v3 finally happen. I will see if I can port it to rust and from...

Request for comment: Ebisu v3 API

v3, that's right. Though I could start with v2 since it might help me understand the changes in context. And ya, I think I know that feeling. Sometimes when I'm...

Is there a chinese model available?

Just piggybacking on this - it was interesting seeing Finnish, Hungarian and Polish documents in the samples. I sent them to a couple of friends and so far as they...

Dev diary: single-atom Beta power law

(disclaimer for the following: I am dyscalculic so I stumbled into this one more or less accidentally at 2am while making [a combinatorial analogue of Anki that represents concepts latently...

Does this lib support contrastive search decoding ?

@ddh0 My understanding is that contrastive search decoding just does this at each step: 1. Take all the tokens in the input and mean_pool their embeddings 2. Look at the...