magpie
magpie copied to clipboard
Add the ability to train from a Pandas dataframe
Currently, training is done from files by default. But sometimes, it is easier to train directly from a Pandas dataframe if the training set is small. I think it would be helpful to train on small data sets directly from the RAM instead of having to create files.
@somnathrakshit thanks for the suggestion! You're right - even though deep learning models usually need vast amounts of data in order to produce good representations, a method for training from RAM would be useful here as well.
I'm not sure if anyone from the CERN team is working on it though, if you need it and feel like contributing - a PR is more than welcome!
Sure! But I don't have time now. I'll start working on it from next month. But I may need some guidance from you in order to understand the code base.
Sure thing, feel free to reach out if I can help in any way! :)
@somnathrakshit I actually have a large training set ~3.8M examples which amount to ~ 7.6M files at ~35GB of data. This makes it nearly impossible to store in memory or even process the documents. IT would be amazing to be able to just train from a pandas DF, have you made any headway here? Can anyone thing of alternatives for large datasets?
You can store each document in a file. This is the easiest as well as the cleanest way out in my opinion. Else, your RAM is going to explode.