ramp-workflow icon indicating copy to clipboard operation
ramp-workflow copied to clipboard

[RFC] Handling predictions too big for the database

Open aboucaud opened this issue 6 years ago • 0 comments

This is a summary of a discussion we just had with @kegl on which we'd like to have comments, opinions, ideas @jorisvandenbossche @glemaitre @agramfort.

In the close future, we might be faced with RAMP problems whose target dimension is too big to be handled by the existing workflow without making the database explode. Simple example is an image-to-image workflow. These problems need a huge training / testing sample, making each predictions equally as big (order of a few Gb), while the current database size is 100 Gb.

Which brings us down to two options:

  1. modify the database model and migrate it,
  2. find a smart way of storing and scoring the predictions for these specific problems.

We would like for now to avoid option 1 if possible, so here is our take on option 2.

Since the target is a pixel-by-pixel prediction, we would sample the prediction, e.g. take a sub-grid of pixels to compute the score. To avoid cheating, we would use a different random sub-grid for the public and the backend datasets. Practically, this would mean creating a specific SamplingScore class which uses a hash of the input dataset as a seed to generate the scoring grid. It then passes the grid to the scoring method in y_pred.

aboucaud avatar Sep 24 '18 13:09 aboucaud