ann-benchmarks
ann-benchmarks copied to clipboard
new dataset generated with write_output()
Hello,
thanks for your framework. very interesting!
we want to define a new dataset and we use your "write_output()" fuction defined in ann_benchmark/datasets.py to convert numpy arrays in a h5df file. After building the data files, the framework fails in testing the algorithms on it. I got
Trying to instantiate ann_benchmarks.algorithms.n2.N2(['angular', {'M': 36, 'efConstruction': 500}])
got a train set of size (1000 * 128)
got 1000 queries
Traceback (most recent call last):
File "run_algorithm.py", line 3, in
seems like the dictionary structure in the dataset leads to problems in running the framework. Maybe it is a old format that is no more supported?
thanks for your help.
best regards,
Franco Maria
Hi @francomarianardini.
I think I need more details to give detailed feedback. Could you share parts of the code before calling write_output?
What comes to mind is that dataset_transform underwent a change recently. Could it be that you created the dataset with an old version, but ran it with a more recent version?
Thanks @maumueller for the prompt reply.
We are building the dataset with the code below. As you see, we are using the write_output() method to write the dataset. The dataset is created using the same version that I then use to run the framework.
What do you think? Thanks,
FM
--
import sys
import numpy
import torch
from ann_benchmarks.datasets import train_test_split, write_output
in_fn = sys.argv[1] # input filename with .pt extension (pytorch)
out_fn = sys.argv[2] # output filename
# vectors should be a numpy array of numpy arrays
vectors = torch.load(in_fn)
vectors = numpy.float32(vectors)
print('dataset size: %9d * %4d' % vectors.shape)
X_train, X_test = train_test_split(vectors, test_size=10000)
write_output(X_train, X_test, out_fn + "_angular.hdf5", 'angular')
write_output(X_train, X_test, out_fn + "_euclidean.hdf5", 'euclidean')
Hello @maumueller,
what do you think of the code above? let me know how if you need something more from my side.
thanks, best,
FM
Franco,
sorry for the late reply, I indeed missed your first response.
Running python test.pt data/test with
$ cat test.py
import sys
import numpy
from ann_benchmarks.datasets import train_test_split, write_output
out_fn = sys.argv[1] # output filename
# vectors should be a numpy array of numpy arrays
vectors = numpy.array([[1.0 + i for _ in range(100)] for i in range(2000)])
print('dataset size: %9d * %4d' % vectors.shape)
X_train, X_test = train_test_split(vectors, test_size=10)
write_output(X_train, X_test, out_fn + "-angular.hdf5", 'angular')
write_output(X_train, X_test, out_fn + "-euclidean.hdf5", 'euclidean')
produces hdf5 files that work on my setup. Could you check?
There was indeed a bug #251 that I've fixed, but it didn't give me the error message you received above.
Hello @maumueller, what do you mean by "hdf5 files that work on my setup" exactly?
The above script "test.py" produces correct hdf5 files but, however, I was unable to figure out how to run the benchmark specifying a custom dataset without modifying the code itself.
Is there a better way? Have you tried to actually run the benchmark on such test data after its creation?
I think everyone would expect to run "python3 run.py --dataset [dataset_name]" after creating the hdf5 file in data/ but unfortunately that does not work for me.
Thanks!
Technically you are supposed to write your code in https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/datasets.py and add your dataset to DATASETS in the very bottom. Maybe that's the confusion?
For the test above, I commented out https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/main.py#L60 and the other occurrence to make it load any kind of file.
Ah ok, that's actually what I did. Perhaps, you could consider to comment the code you reference above in the argument parser and just suggest predefined datasets if the use does not provide any. Thanks.
But isn't it exactly doing that? It seems to me that if anything should be added, then some clean documentation of how to add a dataset? :-)