magpie icon indicating copy to clipboard operation
magpie copied to clipboard

Add the ability to train from a Pandas dataframe

Open somnathrakshit opened this issue 7 years ago • 5 comments

Currently, training is done from files by default. But sometimes, it is easier to train directly from a Pandas dataframe if the training set is small. I think it would be helpful to train on small data sets directly from the RAM instead of having to create files.

somnathrakshit avatar Jan 23 '18 13:01 somnathrakshit

@somnathrakshit thanks for the suggestion! You're right - even though deep learning models usually need vast amounts of data in order to produce good representations, a method for training from RAM would be useful here as well.

I'm not sure if anyone from the CERN team is working on it though, if you need it and feel like contributing - a PR is more than welcome!

jstypka avatar Jan 23 '18 15:01 jstypka

Sure! But I don't have time now. I'll start working on it from next month. But I may need some guidance from you in order to understand the code base.

somnathrakshit avatar Jan 23 '18 17:01 somnathrakshit

Sure thing, feel free to reach out if I can help in any way! :)

jstypka avatar Jan 23 '18 18:01 jstypka

@somnathrakshit I actually have a large training set ~3.8M examples which amount to ~ 7.6M files at ~35GB of data. This makes it nearly impossible to store in memory or even process the documents. IT would be amazing to be able to just train from a pandas DF, have you made any headway here? Can anyone thing of alternatives for large datasets?

dorg-ekrolewicz avatar Aug 29 '18 23:08 dorg-ekrolewicz

You can store each document in a file. This is the easiest as well as the cleanest way out in my opinion. Else, your RAM is going to explode.

somnathrakshit avatar Sep 05 '18 06:09 somnathrakshit