vaex icon indicating copy to clipboard operation
vaex copied to clipboard

Feature: Tensorflow Integration

Open JovanVeljanoski opened this issue 4 years ago • 4 comments

This PR realises #616 The idea is enable vaex DataFrames to be used as input data sources for tensorflow (tf) data sources.

This is made possible in large part thanks to tensorflow-io via arrow.

  • [x] A method to convert vaex DataFrames to tf datasets, to be used as input for tensorflow Estimators. This is now possible via df.ml.tensorflow.to_dataset()
  • [x] A convenience function meant as a direct data source for tf Estimators. Note that tf Estimator consume an input function that returns a dataset object, rather than the dataset object itself.
  • [x] Enabling the to_dataset() method do make the dataset compatible with Keras, so that one can use the dataset as an input for the Keras model .fit() method.
  • [ ] A high level wrapper for the tf Estimators, which will make them easily consumable on the vaex side, while also turning them into vaex transformers, and also serialise them, and make them compatible with the state transfer system.
  • [ ] A high level wrapper that will turn a pre-defined keras model into a vaex transformer.
  • [ ] Unit tests (providing enough coverage & passing)
  • [ ] All new methods have helpful docstrings
  • [ ] Update the CHANGELOG when/if this PR is considered to be accepted
  • [ ] Code review approval

Important notes:

  • It is not readily obvious what the API of the high level wrapper for the tf Estimators should be. Typically, such wrappers need a features argument, used to select which features/columns of the vaex DataFrame should be used. To instantiate tf Estimators one need to provide one or more (in the case of DNNLinearCombinedEstimator) lists of feature_columns which give more detail to tf about the features, i.e. options on whether the features and numerical, categorical, providing various options of handing categorical variables, as well as options for engineering new features from the primary ones. In addition, some tf Estimators require various kwargs to be instantiated. Thus, over the next few days I will add some example code as comments here, from the user side, so we can decide on the type of API we should go with.
  • The test currently named test_to_dataset_keras_model_train that tests the successful passing of a vaex DataFrame to keras model via a tf dataset fails with a segmentation fault. At this point, I do not know whether this arises due to problems in vaex, tensorflow, tensorflow-io, or the code committed in this PR. However, my independent testing have also found other segmentation faults happening when making an input function or a dataset to be used for tf Estimators. Thus it is important that we investigate this issue.
  • At this point in time, we are using tensorflow-io.arrow to pass vaex data to keras, but this is not strictly necessary. It might be useful to investigate performance when using the tensorflow-io.arrow connection vs simply using vaex iterators to pass numpy arrays to the keras models .fit method, since it can consume generators that return numpy arrays,

JovanVeljanoski avatar Mar 01 '20 01:03 JovanVeljanoski

Substantial update to this PR. Now we have support for multi-class classification and multi-target regression. One can pass to_dataset() datasets to both tf estimators and keras models. For keras models, one can construct the dataset as dicts (similar as in the case of the tf estimators), or as tensors.

For passing data to Keras models, now we can also just use vaex as the main generator, no need to go through tensorflow-io. (I have not checked which one is faster).

For comprehensive examples on how all this works, please see the unit-tests. There are now more detailed unit tests, which test both the creation of the data sources, and how they are passed on to models as well.

JovanVeljanoski avatar Mar 04 '20 15:03 JovanVeljanoski

This would be such a killer feature! Any news on this?

solalatus avatar Jun 30 '20 13:06 solalatus

Hi @solalatus

We are happy you think so!

Yeah, it is ready. Perhaps some cleanup is needed. We are doing some internal testing on big-ish data together with an involved (vaex based) pre-processing pipeline to make sure everything will actually work as intended.

If you are comfortable with a dev install you can try this branch. Otherwise, if all goes well I estimate 1-2 weeks before it is merged and released.

JovanVeljanoski avatar Jun 30 '20 15:06 JovanVeljanoski

Marvelous news! I will be talking about data iterators soon in a teaching context, so I am now more confident to hint at Vaex! Keep pushing guys, it is cool stuff!

solalatus avatar Jun 30 '20 15:06 solalatus