big-ann-benchmarks icon indicating copy to clipboard operation
big-ann-benchmarks copied to clipboard

Custom dataset functionality

Open DmitryKey opened this issue 4 years ago • 2 comments

I need to implement a custom dataset and its handling and been thinking about the easiest way to approach it.

I've implemented something half-way through it that kept me going and allowed to plug-in a custom dataset -- in fact it is a dataset derived from BIGANN by reducing dimensionality using a neural network.

I will show the code of what I needed to change and happy to discuss this further!

DmitryKey avatar Sep 30 '21 09:09 DmitryKey

@DmitryKey I'm not sure where the actual dimensionality reduction happens in #43? It seems that you just needed to place the entry into datasets.py, which is the right approach.

What is the pipeline that you had in mind that could improve the process?

maumueller avatar Oct 04 '21 07:10 maumueller

@maumueller there is no implementation for dim reduction step -- it is done elsewhere, in a separate neural network produced by my team peer.

In order for me to try this new dataset, with reduced dimensionality, I need to treat it as a 7th dataset, if this makes sense. Because it will have different dtype compared to original (non reduced) and different (lower) number of dimensions.

So I was thinking that in addition to changing datasets.py, I'd need to change the I/O, because my dataset can live somewhere else, like local disk / blob storage. One other issue I experienced was that I had to still name my dataset with some recognized name, known to the framework -- ideally I would like to control this part as well, but by changing DATASETS dictionary I don't see how that connects to the I/O, like dataset path.

DmitryKey avatar Oct 05 '21 14:10 DmitryKey