metaflow
metaflow copied to clipboard
In-memory large dataframe processing
Metaflow tries to make the life of data scientists easier; this sometimes means providing ways to optimize certain common but expensive operations. Processing large dataframes in memory can be difficult and Metaflow could provide ways to do this more efficiently.
@romain-intel is the idea to support this locally or on an AWS instance?
Wondering if the idea is just making it more integrated with Apache Spark (via pyspark), or finding a way like an IterableDataset in Pytorch, to split loading among workers and have them loaded at model time.
I imagine the difficulty might be in the atomicity of a @step
given that a feature selection & engineering step would be wholly separated from the model step. Know from experience that there are still a lot of pandas
fans out there.
Would be curious to hear your thoughts on this.
@tduffy000 We have an in-house implementation of dataframe
which provides faster primitive operations with a lower memory footprint than Pandas. This is supported both on local instance and in the cloud. One can use this implementation inside a step
or even outside of Metaflow (just like the metaflow.s3
client).
Maybe the use of Metaflow could be somehow combined with Dask, which supports bigger-than-memory dataframes to solve this issue. I am not sure if/how it would be possible to serialize and restore Dasks big and lazy-evaluated dataframes between steps though.
Maybe something like a dataflow transfered between steps like Bonobo.
Also here is other example of software product that uses datapickle and Dask to run dataflows clusterized in cloud.
I think the possibility to use Apache Spark within Metaflow would be extremely useful. When you have your feature engineering workflow written in pyspark it's kind of a pain to translate everything to pandas and also it's hard to predict how well this will work on large datasets.
Would something like https://vaex.readthedocs.io/en/latest/index.html be a possible solution here?
@savingoyal any update to release the dataframe implementation?
Adding modin as a distributed drop-in for pandas dfs
another mention Spark Pandas https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
Agree the Pandas on Spark reference by @talebzeghmi would be valuable, but you would still need a Spark context. I think being able to declare that your task run in AWS Glue would potentially allow for both Pandas on Spark or just vanilla pySpark as a step.
@savingoyal any update to release the dataframe implementation?
Still interested! Would appreciate any update, especially if it's "yeah we're not going to do this in the forseeable future after all"