big-ann-benchmarks
big-ann-benchmarks copied to clipboard
Custom dataset functionality
I need to implement a custom dataset and its handling and been thinking about the easiest way to approach it.
I've implemented something half-way through it that kept me going and allowed to plug-in a custom dataset -- in fact it is a dataset derived from BIGANN by reducing dimensionality using a neural network.
I will show the code of what I needed to change and happy to discuss this further!
@DmitryKey I'm not sure where the actual dimensionality reduction happens in #43? It seems that you just needed to place the entry into datasets.py, which is the right approach.
What is the pipeline that you had in mind that could improve the process?
@maumueller there is no implementation for dim reduction step -- it is done elsewhere, in a separate neural network produced by my team peer.
In order for me to try this new dataset, with reduced dimensionality, I need to treat it as a 7th dataset, if this makes sense. Because it will have different dtype compared to original (non reduced) and different (lower) number of dimensions.
So I was thinking that in addition to changing datasets.py, I'd need to change the I/O, because my dataset can live somewhere else, like local disk / blob storage.
One other issue I experienced was that I had to still name my dataset with some recognized name, known to the framework -- ideally I would like to control this part as well, but by changing DATASETS dictionary I don't see how that connects to the I/O, like dataset path.