pinecone-datasets icon indicating copy to clipboard operation
pinecone-datasets copied to clipboard

Add dataset_validation tests

Open daverigby opened this issue 1 year ago • 2 comments

Problem

We have at least one dataset which has inconsistencies - langchain-python-docs-text-embedding-ada-002 has an extra duplicated .parquet file which means the dataset ends up with 2x the number of vectors it should have.

Solution

Add test to validate all public datasets are valid. These are added to their own directory as they can be slow to run and need a large amount of RAM to hold each dataset.

The first test added (test_all_datasets_valid) performs some basic validation of each dataset:

  • Does the number of vectors in the data files match what the metadata says?

  • Are there any duplicate ids?

This only checks datasets with 2M or fewer vectors, as larger ones require more than 32GB of RAM to load and validate. This currently means 2 datasets are skipped:

  • Skipping dataset 'ANN_DEEP1B_d96_angular which is larger than 2,000,000 vectors (has 9,990,000)

  • Skipping dataset 'msmarco-v1-bm25-allMiniLML6V2 which is larger than 2,000,000 vectors (has 8,841,823)

Type of Change

  • [x] None of the above: new tests

daverigby avatar Feb 09 '24 16:02 daverigby