Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Index subset of Pile dataset with doc2query + BM25

Open lakshaykc opened this issue 2 years ago • 3 comments

Create an index of doc2query + BM25 on a small subset of Pile for testing. I'm working with @ontocord to get this going.

The plan is to use 1 shard of the Pile. Use doc2query to generate 15-25 queries per document using msmarco-tf-small. I will use Elastic Search for BM25 indexing.

Once the data is indexed, there will be a python script to talk to the elastic server and the data can be hosted in google drive for importing. Setting up elastic using docker is quite straightforward.

Welcome any comments.

lakshaykc avatar Jan 12 '23 03:01 lakshaykc

thank you @lakshaykc . This is a good first experiment to see if we can use this as a basis for testing adding background context to our model. thank you!

huu4ontocord avatar Jan 12 '23 03:01 huu4ontocord

I indexed the whole pile once with Elasticsearch, i needed 1.2 TB for the index, it took 3 days on 4 cores 32GB ram. Single queries took on the order of minutes. I transfered the data to a 12 core, 128GB ram machine + 1x pcie gen4 nvme ssd and it was still around a minute. Not sure i had the best setup, but it seemed pretty much that you'd need High-memory machines or/and alot of io (multiple nvme ssds) . Amazons elastic-search instances are pretty much configured like this. That was all single node ofc.

I would recommend against using gdrive for anything serious. Btw. the pile v2 is nearing its completion. The pile is also oversampled, so you have duplicated documents. If we build such a search backend, i think we should use regularily ingest the most recent dumps of the components (like stackexchange). The processing code should be present and could be ported from the pile/ pile-v2 repos afaict.

flowpoint avatar Feb 05 '23 13:02 flowpoint

Thanks @flowpoint. These are very helpful comments. Currently, we are planning to index only 1 shard of the pile dataset. This is mostly to build and test out doc2query for retrieval. For actual use cases of hooking up elastic search with open assistant for retrieval, you are right, it won't be trivial. DeepMind has an interesting paper on a similar problem.

And I agree eventually we would need to incorporate sources on a regular basis to keep up to date.

lakshaykc avatar Feb 07 '23 22:02 lakshaykc

Closing old data issue.

andreaskoepf avatar Jun 14 '23 08:06 andreaskoepf