examples icon indicating copy to clipboard operation
examples copied to clipboard

many-examples: remove kaggle dependency

Open alexcg1 opened this issue 4 years ago • 4 comments

As discussed in various meetings with @lusloher , @aga11313 , @FionnD

Kaggle is a lot of hoops for a user to jump through just to get an example working: install, set up key, run data getter script.

It's also work for us: We have to ensure datasets haven't moved or changed a lot, and we sometimes have to perform extra steps to process them.

These datasets are generally under creative commons licenses or similar. There's no reason why we can't:

  • Download a subset for example purposes (this keeps things light)
  • Process that subset ourselves (saves users time and effort)
  • Store it either in data/ (for light stuff like text which can go directly in repo) or use get_data.sh to download from somewhere we control (for larger stuff like images)

Affected examples

  • [ ] wikipedia-sentences
  • [ ] multires-lyrics-search
  • [ ] cross-modal-search
  • [ ] query-while-indexing

alexcg1 avatar May 04 '21 10:05 alexcg1

Thanks for creating the issue Alex!

Just to clarify to any engineer. ⚠️This issue should not be worked until https://github.com/jina-ai/examples/issues/447 and https://github.com/jina-ai/examples/issues/512 are completed. ⚠️

FionnD avatar May 04 '21 11:05 FionnD

audio-search has no longer dependency on kaggle

nan-wang avatar May 16 '21 14:05 nan-wang

Where could we store the example data? Do we have "somewhere we control" to download from?

jakobkruse1 avatar Aug 05 '21 13:08 jakobkruse1

I propose to use, when possible, huggingface datasets. They are extremely easy to use, and very performant too.

tadejsv avatar Aug 05 '21 13:08 tadejsv