pinecone-datasets
pinecone-datasets copied to clipboard
[Bug] Unable to load yfcc-10M-filter-euclidean dataset
Is this a new bug?
- [X] I believe this is a new bug
- [X] I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
I get the error FileNotFoundError: Dataset does not exist. Please check the path or dataset_id when trying to load the yfcc-10M-filter-euclidean dataset.
Expected Behavior
The dataset should be loaded as its available within list_datasets().
Steps To Reproduce
from pinecone_datasets import list_datasets, load_dataset
datasets = list_datasets()
dataset_name = "yfcc-10M-filter-euclidean"
assert dataset_name in datasets, "Dataset does not exists!"
dataset = load_dataset(dataset_name)
Relevant log output
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 load_dataset('yfcc-10M-filter-euclidean')
File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/public.py:59, in load_dataset(dataset_id, **kwargs)
57 raise FileNotFoundError(f"Dataset {dataset_id} not found in catalog")
58 else:
---> 59 return Dataset.from_catalog(dataset_id, **kwargs)
File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:89, in Dataset.from_catalog(cls, dataset_id, catalog_base_path, **kwargs)
83 catalog_base_path = (
84 catalog_base_path
85 if catalog_base_path
86 else os.environ.get("DATASETS_CATALOG_BASEPATH", cfg.Storage.endpoint)
87 )
88 dataset_path = os.path.join(catalog_base_path, f"{dataset_id}")
---> 89 return cls(dataset_path=dataset_path, **kwargs)
File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:190, in Dataset.__init__(self, dataset_path, **kwargs)
188 self._dataset_path = dataset_path
189 if not self._fs.exists(self._dataset_path):
--> 190 raise FileNotFoundError(
191 "Dataset does not exist. Please check the path or dataset_id"
192 )
193 else:
194 self._fs = None
FileNotFoundError: Dataset does not exist. Please check the path or dataset_id
Environment
- **OS**: macOS 14.4.1
- **Language version**: Python 3.10.10
- **Pinecone client version**: 0.7.0
Additional Context
Looking at the metadata about the datasets
from pinecone_datasets import list_datasets, load_dataset
datasets = list_datasets(as_df=True)
dataset_name = "yfcc-10M-filter-euclidean"
datasets.query('name == @dataset_name').to_dict()
Results show that the data is not in the bucket:
{'name': {27: 'yfcc-10M-filter-euclidean'},
'created_at': {27: '2023-08-24 13:51:29.136759'},
'documents': {27: 10000000},
'queries': {27: 100000},
'source': {27: 'big-ann-challenge 2023'},
'license': {27: None},
'bucket': {27: None},
'task': {27: None},
'dense_model': {27: {'name': 'yfcc', 'tokenizer': None, 'dimension': 192}},
'sparse_model': {27: None},
'description': {27: 'Dataset from the 2023 big ann challenge - filter track. Distance: Euclidean. see https://big-ann-benchmarks.com/neurips23.html'},
'tags': {27: None},
'args': {27: None}}
Hello, I'm also having the same issue, is this issue currently resolved?