ann-benchmarks icon indicating copy to clipboard operation
ann-benchmarks copied to clipboard

new dataset generated with write_output()

Open francomarianardini opened this issue 4 years ago • 8 comments

Hello,

thanks for your framework. very interesting!

we want to define a new dataset and we use your "write_output()" fuction defined in ann_benchmark/datasets.py to convert numpy arrays in a h5df file. After building the data files, the framework fails in testing the algorithms on it. I got

Trying to instantiate ann_benchmarks.algorithms.n2.N2(['angular', {'M': 36, 'efConstruction': 500}]) got a train set of size (1000 * 128) got 1000 queries Traceback (most recent call last): File "run_algorithm.py", line 3, in run_from_cmdline() File "/home/app/ann_benchmarks/runner.py", line 215, in run_from_cmdline run(definition, args.dataset, args.count, args.runs, args.batch) File "/home/app/ann_benchmarks/runner.py", line 117, in run X_train, X_test = dataset_transform(D) TypeError: 'dict' object is not callable

seems like the dictionary structure in the dataset leads to problems in running the framework. Maybe it is a old format that is no more supported?

thanks for your help.

best regards,

Franco Maria

francomarianardini avatar Jun 28 '21 16:06 francomarianardini

Hi @francomarianardini.

I think I need more details to give detailed feedback. Could you share parts of the code before calling write_output?

What comes to mind is that dataset_transform underwent a change recently. Could it be that you created the dataset with an old version, but ran it with a more recent version?

maumueller avatar Jun 28 '21 18:06 maumueller

Thanks @maumueller for the prompt reply.

We are building the dataset with the code below. As you see, we are using the write_output() method to write the dataset. The dataset is created using the same version that I then use to run the framework.

What do you think? Thanks,

FM

--

import sys
import numpy
import torch
from ann_benchmarks.datasets import train_test_split, write_output

in_fn = sys.argv[1] # input filename with .pt extension (pytorch)
out_fn = sys.argv[2] # output filename

# vectors should be a numpy array of numpy arrays
vectors = torch.load(in_fn)
vectors = numpy.float32(vectors)
print('dataset size: %9d * %4d' % vectors.shape)

X_train, X_test = train_test_split(vectors, test_size=10000)
write_output(X_train, X_test, out_fn + "_angular.hdf5", 'angular')
write_output(X_train, X_test, out_fn + "_euclidean.hdf5", 'euclidean')

francomarianardini avatar Jun 29 '21 08:06 francomarianardini

Hello @maumueller,

what do you think of the code above? let me know how if you need something more from my side.

thanks, best,

FM

francomarianardini avatar Jul 05 '21 17:07 francomarianardini

Franco,

sorry for the late reply, I indeed missed your first response.

Running python test.pt data/test with

$ cat test.py
import sys
import numpy
from ann_benchmarks.datasets import train_test_split, write_output

out_fn = sys.argv[1] # output filename

# vectors should be a numpy array of numpy arrays
vectors = numpy.array([[1.0 + i for _ in range(100)] for i in range(2000)])
print('dataset size: %9d * %4d' % vectors.shape)

X_train, X_test = train_test_split(vectors, test_size=10)
write_output(X_train, X_test, out_fn + "-angular.hdf5", 'angular')
write_output(X_train, X_test, out_fn + "-euclidean.hdf5", 'euclidean')

produces hdf5 files that work on my setup. Could you check?

There was indeed a bug #251 that I've fixed, but it didn't give me the error message you received above.

maumueller avatar Jul 06 '21 14:07 maumueller

Hello @maumueller, what do you mean by "hdf5 files that work on my setup" exactly?

The above script "test.py" produces correct hdf5 files but, however, I was unable to figure out how to run the benchmark specifying a custom dataset without modifying the code itself.

Is there a better way? Have you tried to actually run the benchmark on such test data after its creation?

I think everyone would expect to run "python3 run.py --dataset [dataset_name]" after creating the hdf5 file in data/ but unfortunately that does not work for me.

Thanks!

jermp avatar Jul 21 '21 16:07 jermp

Technically you are supposed to write your code in https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/datasets.py and add your dataset to DATASETS in the very bottom. Maybe that's the confusion?

For the test above, I commented out https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/main.py#L60 and the other occurrence to make it load any kind of file.

maumueller avatar Jul 21 '21 19:07 maumueller

Ah ok, that's actually what I did. Perhaps, you could consider to comment the code you reference above in the argument parser and just suggest predefined datasets if the use does not provide any. Thanks.

jermp avatar Jul 22 '21 08:07 jermp

But isn't it exactly doing that? It seems to me that if anything should be added, then some clean documentation of how to add a dataset? :-)

maumueller avatar Jul 22 '21 09:07 maumueller