Various Scale Things

Open mitchellgordon95 opened this issue 2 years ago • 8 comments

Hey Lucid,

I've been working on scaling the DB up to contain the whole Pile in my free time. En route to this, I've made a few changes that you might be interested in merging:

Ditch autofaiss in favor of manually constructing the FAISS index to be as close to SCANN as possible
Add support for training the index on GPUs
Parallelize chunks_to_embeddings_ by adding a "worker_id" param
chunks_to_precalculated_knn_ should be able to reuse pre-computed embeds etc.
Add a BertEmbeds class that supports memmap'd files > 16 TB (max file size on ext4)
Add support for going from jsonl->chunks (which is what the Pile is) in addition to txt->chunks (just for convenience)

Index creation at scale isn't really tested (but I have tested it at smaller scales). I'm running embedding at scale right now and I think it works. Anyway, I don't really expect you to just merge this but figured I'd mention it before I get too far off master.

May 07 '22 23:05 mitchellgordon95

Also random, but did you work on Bliss at Uber? I think I might have been an intern on your team lol.

May 07 '22 23:05 mitchellgordon95

@mitchellgordon95 ohhh my god, yes, I worked on Bliss ... I think I remember you now! Lol 🦦 🦦 🦦 will never forget the otter branding

May 08 '22 02:05 lucidrains

@mitchellgordon95 this looks good, but i hesitate to merge it because of so many changes. we could always keep it open and i can gradually incorporate some of the ideas (parallelizing chunks to embedding is a great idea!)

thank you for sharing regardless!

May 08 '22 02:05 lucidrains

@mitchellgordon95 how much of a speed up are you seeing removing autofaiss in favor of something closer to scann? some benchmarks would definitely help sway me towards more complicated code :)

May 08 '22 02:05 lucidrains

Sure! I'm planning on testing out the indexing as soon as the embedding is finished, so I can run some benchmarks on auto-faiss in parallel.

It's actually been so many weekends since I switched away from auto-faiss that I forget exactly why I did it 😅, but I know I must have had a good reason because it was a PITA to set up.

I think one reason was that auto-faiss doesn't support index training on GPUs (which can be very slow for 64M training examples). Another reason was I wanted to make sure that we fully optimized memory usage since The Pile is around 5.8B chunks. Even with the full PQ compression etc., it still ends up being 8 bytes per embedding ~= 43 GB of RAM to store the index.

May 08 '22 05:05 mitchellgordon95

@mitchellgordon95 ok cool! i'll take a look at the new faiss indexing method this coming week. benchmarks would definitely help! i definitely like your other changes, so if you are willing to break them up into separate PRs, we could merge them in before bringing in https://github.com/lucidrains/RETRO-pytorch/pull/22/files#diff-91e76e2663878e2a72d63398db46c8fa835e402b4f44c0c87010b48f790fc021R320

thank you for all this!

May 09 '22 15:05 lucidrains

@mitchellgordon95 who are you training RETRO for btw? are you working at latitude games? not still at Uber I hope lol

May 09 '22 17:05 lucidrains

Yeah I'm at Latitude. It's not a priority project, but I've used my last 3 hackathons to work on it lol

May 09 '22 17:05 mitchellgordon95

@lucidrains @mitchellgordon95 hi gang! I have been working on a training loop for this code, which you can see here: https://github.com/artificialwisdomai/origin/pull/50

We are doing something atypical, and may need to fork and reimplement both of your implementations. I am aware of the minimum requirements of the license, but I wanted to ask, would you take offense? We will reference the upstreams that generated our ideas (ie this repo, and this PR).

We may not submit the work upstream as it would be a fresh start, although our code is licensed with ASL2.

Also, do either of you have any examples of inferencing or use cases where this type of retroformer is of use?

Thank you, -steve

Jul 04 '23 18:07 sdake

Hello,

I have done a basic benchmark of training. If you can provide other benchmark suggestions, happy to provide A/B/A comparisons and report in this PR.

Overview

Artificial Wisdom™ Retrieval Transformer benchmark results

Test A: PR #22
Test B: https://github.com/lucidrains/RETRO-pytorch@ab3c4a6

System under test

faiss built with Artificial Wisdom™ Wiki Faiss Build
Indexing content: Istio.io documentation
glob: **/*.md
NVIDIA CUDA: 12.1
One NVIDIA A40
Three NVIDIA AA30
Debian 12
Training loop: Artificial Wisdom Initial Retroformer proof of concept

baseline

wallclock

(baseline) sdake@beast-06:~/repos/origin/retrieval$ REPROCESS=1 python train.py
Artificial Wisdom™ Retreival Transformer Training
• retrieval_model=artificialwisdomai/retroformer • foundation_model=mosaicml/mpt30b •
Epoch 0 100%   ━━━━━━━━━━━━━━━━━━━━━━ • retrieved=65568 • loss=3.89 • 0:14:15 • 0:00:00

observation

GPU consumed memory: 33.2 GB @ 1 A40
GPU utilization: 50-70% @ 1 A40

PR rework

wallclock:

Artificial Wisdom™ Retreival Transformer Training
• retrieval_model=artificialwisdomai/retroformer • foundation_model=mosaicml/mpt30b •
Epoch 0 100%   ━━━━━━━━━━━━━━━━━━━━━━ • retrieved=65568 • loss=3.82 • 0:14:25 • 0:00:00

Observation

GPU consumed memory: 33.2 GB @ 1 A40
GPU utilization: 50-70% @ 1 A40
Three other GPUs: 2 GB @ 3 A30
GPU utilization on three A30s = 0%.
a system without autofaiss is valued for many reasons.

My general observation is that faiss does not appear to use compute in the GPUs, only memory. There may be a defect in our build, or in the implemetnation.

cc @rstarmer @MostAwesomeDude.

Thank you, -steve

Jul 04 '23 19:07 sdake

yes absolutely! the giving is unconditional, thus MIT

Jul 04 '23 19:07 lucidrains

have you seen Nvidia's follow-up Retro2 paper yet?

Jul 04 '23 19:07 lucidrains

@lucidrains I haven't, if you can share or have the title, i would love to see it! Robert found [Megtrron RETRO](https://github.com/NVIDIA/Megatron-LM#retro], although i don't know if this is what you were referencing.

We are building a library that composes the following things:

Three retrieval transformer architectures
Three large language models
Three vector stores

You pick 1 from each of the three categories, and can use them in composition. Unlike langchain, our work is designed around using shared memory for API communication, instead of HTTPS. IE. More like the monolithic kernel kernel.org and less like the microkernel Windows NT.

Would love to have further suggestions for the idea proposed here. I understand if the MIT license requirements are met, then the software is licensed with those terms. As an open source dev, as long as someone using software I wrote complies with the terms, I was always good with however they used it.

What I am asking is a little different. If I were to use your code as a reference for ABA testing, and also to learn from, but didn't integrate your library, would you take offense? I never did, but I was and AM all in open-source. Many don't understand the finer mechanics of ASL/MIT/BSD-revised, and do take offense. It sounds like you are very experienced in this area, which is awesome!

Your code is an invaluable resource to the engineering community. Thank you for your gifts.

Thank you, -steve

Jul 05 '23 01:07 sdake

I did notice, after switching to a larger dataset, specifically https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample, the act of building the embedding indexes is "different", and possibly faster. I will publish an update comparison when i have one to give.

Jul 05 '23 01:07 sdake

thanks for the kind words

you are free to use this repository however you wish, no conditions

https://arxiv.org/abs/2304.06762 this is the paper. I haven't gone through it but the author apparently found some further simplications. what they have in the Megatron repo should be this Retro++

Jul 05 '23 01:07 lucidrains

RETRO-pytorch RETRO-pytorch copied to clipboard

Various Scale Things

Overview

System under test

baseline

wallclock

observation

PR rework

wallclock:

Observation

RETRO-pytorch
RETRO-pytorch copied to clipboard