VectorDBBench icon indicating copy to clipboard operation
VectorDBBench copied to clipboard

Using custom dataset fails with error None type object

Open nsai1 opened this issue 1 year ago • 9 comments

Hello, I am trying to execute vectordbbench with a custom dataset. For ease of use, I have used the openAI50k dataset parquet files that were downloaded from a previous run into from the /var/vectordb_bench/datasets. When i execute this : vectordbbench pgvectorhnsw --config-file custom_config.yml it throws the following error:

INFO: INIT_SEARCH_RUNNER (task_runner.py:259) (3057207)
WARNING: test_data None (task_runner.py:260) (3057207)
WARNING: Failed to run performance case, reason = 'NoneType' object has no attribute 'columns' (task_runner.py:192) (3057207)
Traceback (most recent call last):
  File "/home/VectorDBBench/vectordb_bench/backend/task_runner.py", line 179, in _run_perf_case
    self._init_search_runner()
  File "/home/VectorDBBench/vectordb_bench/backend/task_runner.py", line 261, in _init_search_runner
    log.info(f"test_data {self.ca.dataset.test_data.columns}")
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'columns'

How do I go ahead and use the custom dataset?

nsai1 avatar Dec 10 '24 22:12 nsai1

@nsai1 It looks like you downloaded the code via git clone. Could you show me the current git log? I couldn't find the line in the latest code.

File "/home/VectorDBBench/vectordb_bench/backend/task_runner.py", line 261, in _init_search_runner log.info(f"test_data {self.ca.dataset.test_data.columns}")

Maybe fetching the latest code directly will help~

alwayslove2013 avatar Dec 11 '24 02:12 alwayslove2013

This is the git log

commit 6a634163101396f79272908f7261e69886b0d91e (HEAD -> main, origin/main, origin/HEAD)
Author: Teynar <[email protected]>
Date:   Wed Dec 4 03:09:57 2024 +0100

    Add Milvus auth support through user_name and password fields (#416)

    * tweak(milvus): add auth support through user_name and password fields (like zilliz cloud)

    * tweak(frontend): make username and password for milvus optional fields

commit 4bb299424099190e6e833746bb74a34c9eb9361a
Author: Sheharyar Ahmad <[email protected]>
Date:   Fri Nov 29 18:59:44 2024 +0500

    fix: invalid value for --max-num-levels when using CLI.

This is the error from line 258 on file task_runner.py

2024-12-11 04:50:46,807 | WARNING: Failed to run performance case, reason = 'NoneType' object is not subscriptable (task_runner.py:191) (3188648)
Traceback (most recent call last):
  File "/data1/nsai/vectordbbench2/vectordb_bench/backend/task_runner.py", line 178, in _run_perf_case
    self._init_search_runner()
  File "/data1/nsai/vectordbbench2/vectordb_bench/backend/task_runner.py", line 258, in _init_search_runner
    test_emb = np.stack(self.ca.dataset.test_data["emb"])
                        ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
TypeError: 'NoneType' object is not subscriptable

This is a snippet of my config file

  case_type: PerformanceCustomDataset
  custom_case_name: test_case
  custom_case_description: this is a customized case
  custom_dataset_name: test_openai
  custom_dataset_dir: /tmp/vectordb_bench/dataset/openai/openai_small_50k/
  custom_dataset_size: 50000
  custom_dataset_dim: 1536
  custom_dataset_metric_type: "COSINE"
  custom_dataset_file_count: 1
  custom_dataset_use_shuffled: False
  custom_dataset_with_gt: False
  load_timeout: 1000000
  optimize_timeout: 1000000

If i try the same with a completely different dataset, with files that are populate based on the requirement I get this None type error.

nsai1 avatar Dec 11 '24 04:12 nsai1

Closing the issue, directly getting the code helped.

nsai1 avatar Dec 11 '24 05:12 nsai1

Has this problem been solved?

LoveYou3000 avatar Apr 29 '25 07:04 LoveYou3000

@LoveYou3000 could you please share more details about the issue you are experiencing?

alwayslove2013 avatar Apr 29 '25 07:04 alwayslove2013

version: Image

log: Image

milvus-config.yml: milvushnsw: case_type: PerformanceCustomDataset db_label: custom_128D_10K custom_case_name: custom_128D_10K custom_case_description: custom_128D_10K custom_dataset_name: custom_128D_10K custom_dataset_dir: /Users/zhang/Code/github.com/vectordb-bench/data/custom/vector_128D_10000C custom_dataset_size: 10000 custom_dataset_dim: 128 custom_dataset_metric_type: L2 custom_dataset_file_count: 1 uri: http://localhost:19530 m: 30 ef_construction: 360 ef_search: 100

data: random vectors generate by np.random, code: ` import os import numpy as np import pandas as pd

def get_custom_case_file_path(dim, total, file_path, file_name): directory = f'{file_path}/custom/vector_{dim}D_{total}C' os.makedirs(directory, exist_ok=True) return os.path.join(directory, file_name)

def generate_custom_case(num, dim, file_path): train_file_path = get_custom_case_file_path(dim, num, file_path, 'train.parquet') test_file_path = get_custom_case_file_path(dim, num, file_path, 'test.parquet') if not os.path.exists(train_file_path) or not os.path.exists(test_file_path): generate_parquet_file(num, dim, train_file_path, test_file_path)

def generate_parquet_file(total_num, dim, train_path, test_path, train_ratio=0.8): np.random.seed(42) data = np.random.random((total_num, dim)).astype(np.float32) df = pd.DataFrame({ 'id': np.arange(total_num), 'emb': data.tolist() })

train_df = df.sample(frac=train_ratio, random_state=42)
test_df = df.drop(train_df.index)

train_df.to_parquet(train_path)
test_df.to_parquet(test_path)

generate_custom_case(10000, 128, './data') `

cmd: vectordbbench milvushnsw --config-file milvus-config.yml --skip-custom-dataset-with-gt

LoveYou3000 avatar Apr 29 '25 07:04 LoveYou3000

@alwayslove2013 Have you got a clue about this problem? What should I do to use this function normally.

LoveYou3000 avatar Apr 30 '25 01:04 LoveYou3000

@LoveYou3000 Thank you for bringing these issues.

  1. vectordbbench currently does not support custom_dataset_dir containing uppercase letters. I will address this later, but for now, please switch it to lowercase.

  2. There is a bug with custom_dataset when with_gt=False, which has been fixed in PR:#511. You can fetch the latest PR. Or I will release v0.0.27 when it is merged.

Thank you for your understanding!

alwayslove2013 avatar Apr 30 '25 05:04 alwayslove2013

Sure, thanks.

LoveYou3000 avatar Apr 30 '25 09:04 LoveYou3000