dnadesign icon indicating copy to clipboard operation
dnadesign copied to clipboard

Add tokenizer

Open Koeng101 opened this issue 1 year ago • 2 comments

This PR is creating a tokenizer in the dnadesign lib. This is primarily for tokenizing amino acids for consumption of an LLM - in particular, llm.c.

Koeng101 avatar Jun 19 '24 22:06 Koeng101

I'd like to make the shard-writer to be a little smaller, and more specific to just receive tokens and write em. Maybe as a concurrent process.

I want to be able to encode pfam in the lead-up to peptides. [PFAM][AA seq][EOS]. The idea here is that you could throw a PFAM to predict the next tokens.

Koeng101 avatar Jun 23 '24 05:06 Koeng101

according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model."

If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models.

In uniref90 there are roughly 65B tokens. Encoded as uint8, that's like 60GB, plus I bet I could shave off a little if I zstd encoded it.

Koeng101 avatar Jul 04 '24 04:07 Koeng101

I don't really have the time to pursue this. The code works, but I'd like more documentation, so I'm going to close for now.

Koeng101 avatar Oct 24 '24 22:10 Koeng101