Using custom dataset fails with error None type object
Hello, I am trying to execute vectordbbench with a custom dataset. For ease of use, I have used the openAI50k dataset parquet files that were downloaded from a previous run into from the /var/vectordb_bench/datasets.
When i execute this : vectordbbench pgvectorhnsw --config-file custom_config.yml it throws the following error:
INFO: INIT_SEARCH_RUNNER (task_runner.py:259) (3057207)
WARNING: test_data None (task_runner.py:260) (3057207)
WARNING: Failed to run performance case, reason = 'NoneType' object has no attribute 'columns' (task_runner.py:192) (3057207)
Traceback (most recent call last):
File "/home/VectorDBBench/vectordb_bench/backend/task_runner.py", line 179, in _run_perf_case
self._init_search_runner()
File "/home/VectorDBBench/vectordb_bench/backend/task_runner.py", line 261, in _init_search_runner
log.info(f"test_data {self.ca.dataset.test_data.columns}")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'columns'
How do I go ahead and use the custom dataset?
@nsai1 It looks like you downloaded the code via git clone. Could you show me the current git log? I couldn't find the line in the latest code.
File "/home/VectorDBBench/vectordb_bench/backend/task_runner.py", line 261, in _init_search_runner log.info(f"test_data {self.ca.dataset.test_data.columns}")
Maybe fetching the latest code directly will help~
This is the git log
commit 6a634163101396f79272908f7261e69886b0d91e (HEAD -> main, origin/main, origin/HEAD)
Author: Teynar <[email protected]>
Date: Wed Dec 4 03:09:57 2024 +0100
Add Milvus auth support through user_name and password fields (#416)
* tweak(milvus): add auth support through user_name and password fields (like zilliz cloud)
* tweak(frontend): make username and password for milvus optional fields
commit 4bb299424099190e6e833746bb74a34c9eb9361a
Author: Sheharyar Ahmad <[email protected]>
Date: Fri Nov 29 18:59:44 2024 +0500
fix: invalid value for --max-num-levels when using CLI.
This is the error from line 258 on file task_runner.py
2024-12-11 04:50:46,807 | WARNING: Failed to run performance case, reason = 'NoneType' object is not subscriptable (task_runner.py:191) (3188648)
Traceback (most recent call last):
File "/data1/nsai/vectordbbench2/vectordb_bench/backend/task_runner.py", line 178, in _run_perf_case
self._init_search_runner()
File "/data1/nsai/vectordbbench2/vectordb_bench/backend/task_runner.py", line 258, in _init_search_runner
test_emb = np.stack(self.ca.dataset.test_data["emb"])
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
TypeError: 'NoneType' object is not subscriptable
This is a snippet of my config file
case_type: PerformanceCustomDataset
custom_case_name: test_case
custom_case_description: this is a customized case
custom_dataset_name: test_openai
custom_dataset_dir: /tmp/vectordb_bench/dataset/openai/openai_small_50k/
custom_dataset_size: 50000
custom_dataset_dim: 1536
custom_dataset_metric_type: "COSINE"
custom_dataset_file_count: 1
custom_dataset_use_shuffled: False
custom_dataset_with_gt: False
load_timeout: 1000000
optimize_timeout: 1000000
If i try the same with a completely different dataset, with files that are populate based on the requirement I get this None type error.
Closing the issue, directly getting the code helped.
Has this problem been solved?
@LoveYou3000 could you please share more details about the issue you are experiencing?
version:
log:
milvus-config.yml:
milvushnsw: case_type: PerformanceCustomDataset db_label: custom_128D_10K custom_case_name: custom_128D_10K custom_case_description: custom_128D_10K custom_dataset_name: custom_128D_10K custom_dataset_dir: /Users/zhang/Code/github.com/vectordb-bench/data/custom/vector_128D_10000C custom_dataset_size: 10000 custom_dataset_dim: 128 custom_dataset_metric_type: L2 custom_dataset_file_count: 1 uri: http://localhost:19530 m: 30 ef_construction: 360 ef_search: 100
data: random vectors generate by np.random, code: ` import os import numpy as np import pandas as pd
def get_custom_case_file_path(dim, total, file_path, file_name): directory = f'{file_path}/custom/vector_{dim}D_{total}C' os.makedirs(directory, exist_ok=True) return os.path.join(directory, file_name)
def generate_custom_case(num, dim, file_path): train_file_path = get_custom_case_file_path(dim, num, file_path, 'train.parquet') test_file_path = get_custom_case_file_path(dim, num, file_path, 'test.parquet') if not os.path.exists(train_file_path) or not os.path.exists(test_file_path): generate_parquet_file(num, dim, train_file_path, test_file_path)
def generate_parquet_file(total_num, dim, train_path, test_path, train_ratio=0.8): np.random.seed(42) data = np.random.random((total_num, dim)).astype(np.float32) df = pd.DataFrame({ 'id': np.arange(total_num), 'emb': data.tolist() })
train_df = df.sample(frac=train_ratio, random_state=42)
test_df = df.drop(train_df.index)
train_df.to_parquet(train_path)
test_df.to_parquet(test_path)
generate_custom_case(10000, 128, './data') `
cmd: vectordbbench milvushnsw --config-file milvus-config.yml --skip-custom-dataset-with-gt
@alwayslove2013 Have you got a clue about this problem? What should I do to use this function normally.
@LoveYou3000 Thank you for bringing these issues.
-
vectordbbenchcurrently does not supportcustom_dataset_dircontaining uppercase letters. I will address this later, but for now, please switch it to lowercase. -
There is a bug with
custom_datasetwhenwith_gt=False, which has been fixed in PR:#511. You can fetch the latest PR. Or I will release v0.0.27 when it is merged.
Thank you for your understanding!
Sure, thanks.