Open-Assistant
Open-Assistant copied to clipboard
Index subset of Pile dataset with doc2query + BM25
Create an index of doc2query + BM25 on a small subset of Pile for testing. I'm working with @ontocord to get this going.
The plan is to use 1 shard of the Pile. Use doc2query to generate 15-25 queries per document using msmarco-tf-small. I will use Elastic Search for BM25 indexing.
Once the data is indexed, there will be a python script to talk to the elastic server and the data can be hosted in google drive for importing. Setting up elastic using docker is quite straightforward.
Welcome any comments.
thank you @lakshaykc . This is a good first experiment to see if we can use this as a basis for testing adding background context to our model. thank you!
I indexed the whole pile once with Elasticsearch, i needed 1.2 TB for the index, it took 3 days on 4 cores 32GB ram. Single queries took on the order of minutes. I transfered the data to a 12 core, 128GB ram machine + 1x pcie gen4 nvme ssd and it was still around a minute. Not sure i had the best setup, but it seemed pretty much that you'd need High-memory machines or/and alot of io (multiple nvme ssds) . Amazons elastic-search instances are pretty much configured like this. That was all single node ofc.
I would recommend against using gdrive for anything serious. Btw. the pile v2 is nearing its completion. The pile is also oversampled, so you have duplicated documents. If we build such a search backend, i think we should use regularily ingest the most recent dumps of the components (like stackexchange). The processing code should be present and could be ported from the pile/ pile-v2 repos afaict.
Thanks @flowpoint. These are very helpful comments. Currently, we are planning to index only 1 shard of the pile dataset. This is mostly to build and test out doc2query for retrieval. For actual use cases of hooking up elastic search with open assistant for retrieval, you are right, it won't be trivial. DeepMind has an interesting paper on a similar problem.
And I agree eventually we would need to incorporate sources on a regular basis to keep up to date.
Closing old data issue.