machinelearning
machinelearning copied to clipboard
Fit() method should be optimized when used with DatabaseLoader or IEnumerable (with sqldatareader)
Is your feature request related to a problem? Please describe. Fit() method is not optimized when used with databaseloader or ienumerable(sqldatareader->yield return) and is causing delays in processing.
The training data that our ML.Net(v 1.7) project refers to resides in Sql Server database
I do not want to load the entire data in memory for fitting In order to fetch this data while training I have tried 2 ways:
- using Databaseloader which has a stored procedure as command text
- using IEnumerable which calls sqldatareader(with yield return)(this also calls a stored procedure) and then creating IDataView using LoadFromEnumerable()
The training data has 6 columns which are transformed as follows: FeaturizeText is being called for 4 columns Onehotencoding is being called for 2 columns Concatenation of all above transformed columns
Issue In both above ways of fetching: The underlying stored procedure is getting called 15 times during each “Fit” call I observed that the calls are due to the transforms being applied If I reduce the number of transforms, the calls to stored procedure reduce accordingly This causes the Fit() method to take considerable amount of time
Describe the solution you'd like
- Ideally the data should be fetched just once and all preprocessing done on the fetched data 2.Please share a sample that uses IEnumerable with sqldatareader, as this is my preferred approach I would compare it with my implementation
Describe alternatives you've considered None
Additional context None
Can you share the pipeline you are using? The number of times it gets the data is going to directly relate to the transforms (as you mentioned) and if the transform itself needs to access all the data. If the transform does need to access all the data, or loop through the data multiple times, it may end up calling your stored procedure multiple times. There is no way for us to know if 15 is too many calls without seeing the actual pipeline. With the pipeline though, we can look more into it.
One workaround would be to split your pipeline into 2 different pipelines, 1 for dataprep and 1 for the actual training. Once the first pipeline is done you can either cache the data (though this would keep it all in memory), or write out the transformed data to a binary file and then load that back in for the actual training pipeline.
Thanks! Appreciate the prompt response.
Here's the code snippet:
var trainingDataView = mlContext.Data.LoadFromEnumerable(ienumerableData);//ienumerableData is a method that calls stored procedure and uses sqldatareader(with yield return)
IEstimator<ITransformer> dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Col1Featurized", 'Col1') .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Col2Featurized", 'Col2') .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Col3Featurized", 'Col3') .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Col4Featurized", 'Col4') .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Col5Featurized", 'Col5') .Append(mlContext.Transforms.Categorical.OneHotEncoding(new InputOutputColumnPair[] { new InputOutputColumnPair("Col6Encoded",'Col6') })) .Append(mlContext.Transforms.Concatenate(outputColumnName: "Features", "Col1Featurized", "Col2Featurized", "Col3Featurized", "Col4Featurized", "Col5Featurized", "Col6Encoded")); //we have now 5 columns that are featurized and 1 encoded
var trainer = mlContext .BinaryClassification .Trainers .FastTree(labelColumnName: "Label", featureColumnName: "Features");
var trainingPipeline = dataProcessPipeline .Append(trainer); trainedModel = trainingPipeline.Fit(trainingDataView);
Hello Do you have any update on this?
Hi Can you give some inputs to optimize this?
Hi Are there any updates on this?
You said that you don't want to load all the data into memory for fitting, which I completely understand. But you also said that you would expect the data to only be fetched once and then have all the pre-processing done on that data. For that to happen the data would have to be stored in memory once the pre-processing is done (or maybe just after you have fetch it from the database). Otherwise, anytime the data is needed again (say for looping over the data which may happen based on the transformers you pick), it will have to fetch it again and re-run the transforms on it. Would caching the data locally be OK before the fit call in this case?
Have you tried seeing what happens if you have the same number of transformers but a different number of columns? I'm wondering if its being called extra based on the number of columns you have (I haven't verified that, its just a guess).
@luisquintanilla I haven't done a ton of investigation into this yet, but we probably have a chance to optimize the DB code a fair amount. Will need more investigation to figure out exactly, but if 7 transformer and fast tree are calling the stored procedure 15 times, it seems like something isn't quite right there.