machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Add ability to concatenate 2 IDataViews.

Open torronen opened this issue 2 years ago • 8 comments

I have:

  • IDataView trainingData
  • IDataView testData

I want to combine trainingData and testData IDataView combinedData = trainingData + testData;

Finally, I want to retrain the model with it. Q1. Is there are way to combine multiple IDataViews without converting to DataFrame?

Q2. What is the syntax for combining IDataViews converted as DataFrame. I am trying to do something like this:

DataFrame dfTrain = trainingData.ToDataFrame(-1);
DataFrame df2 = testdata.ToDataFrame(-1);
df.Add(df2.Rows);

The sample in machinelearning-samples re-reads the data from a file source. However, it the data might not always be in a file, or the big may be very big

torronen avatar Mar 20 '22 14:03 torronen

One way to make file reading faster is to used binary IDV files.

However, this also fails: var combinedDataView = mlContext.Data.LoadFromBinary(new MultiFileSource(TrainDataPath + ".idv", TestDataPath + ".idv")); throws 'binary loader must be created with one file Arg_ParamName_Name'

Q 3: How to read and combine multiple IDV files?

torronen avatar Mar 21 '22 07:03 torronen

Q1. Is there are way to combine multiple IDataViews without converting to DataFrame?

I'm not actually sure about that. Let me take a look and see. @luisquintanilla are you aware of anyway to do it? If not we should do a feature request for it.

Q2. What is the syntax for combining IDataViews converted as DataFrame. I am trying to do something like this:

You should just be able to do var newDf = df.Append(df2.Rows);, By default the append call returns a dataframe allocated with new memory. If you set inplace to true though it will update the original one.

An IDataView lazily loads the data when its asked for it and doesn't keep it in memory unless it's specifically cached. Thats why you can stream datasets that don't fit in memory using it. That, currently, is not true for DataFrame's. The data needs to be able to fit into memory, so this approach will only work when the data is small enough to do that.

Q 3: How to read and combine multiple IDV files?

This is the same as answer 1. Since the IDataViews are lazy, as long as we have a way to combine them than reading them in and concatting them would be the same for the processor as reading them while combining them. I'm still checking if we do have a way to combine them (you are the first person I remember has asked me this)

michaelgsharp avatar Mar 23 '22 00:03 michaelgsharp

Ok, one workaround for now is turn turn the 2 dataframes into IEnumerables, combine them with linq, and them load that IEnumerable into a new dataview. It works, but its not really straightforward. You also need to create the schema as an actual class first.

var enumerable1 = mlContext.Data.CreateEnumerable<Type>(IDV1, false);
var enumerable2 = mlContext.Data.CreateEnumerable<Type>(IDV2, false);

var enumerable3 = enumerable1.Concat(enumerable2);
var combinedIDV = mlContext.Data.LoadFromEnumerable(enumerable3);

michaelgsharp avatar Mar 23 '22 00:03 michaelgsharp

Thanks for looking into this @michaelgsharp . I looked into it a bit and didn't find an ML.NET Transform that achieves this task.

@torronen I think Michael's solution to concat two IEnumerables is probably the "simplest" way of doing it without using DataFrames. This also in part solves the problem of data being potentially too big which DataFrames may have trouble handling.

luisquintanilla avatar Mar 29 '22 17:03 luisquintanilla

So @torronen @luisquintanilla I was looking into something completely different and actually realized that we do have code that does this already in ML.NET. It works as long as all the dataviews share the same schema anyways. BUT, we don't have it exposed publicly and only use it internally in a few places. https://github.com/dotnet/machinelearning/blob/510f0112d4fbb4d3ee233b9ca95c83fae1f9da91/src/Microsoft.ML.Data/DataView/AppendRowsDataView.cs

It is currently not set up as a transformer, it is just a utility method that does it. If we just want to expose the utility method it is a trivial amount of work (just have to figure out exactly how we want to represent the API since basically everything we have are transformers and this is not that). If we want to make it an actual transformer you could throw in a pipeline then it would take more work, but not a lot as the core code is already implemented.

I think @luisquintanilla that we should for sure expose it as just a utility method. I think that it would be used more as a utility method than something people want in the pipeline personally. If we want to do that, I think we could even move it up and get it in the next release instead of just "Future" as the work required should be pretty small. Thoughts?

michaelgsharp avatar May 03 '22 19:05 michaelgsharp

Thanks @michaelgsharp for the additional investigation. @torronen I'm curious if in your testing, you tried loading multiple binary files using wildcards like the snippet in this article describes.

https://docs.microsoft.com/dotnet/machine-learning/how-to-guides/load-data-ml-net#load-data-from-multiple-files

If so, what limitations did you run into there?

luisquintanilla avatar May 03 '22 20:05 luisquintanilla

@luisquintanilla I dont recall trying wildcard. I only tried with list of filenames as parameters: https://github.com/dotnet/machinelearning/issues/6134#issuecomment-1073554230 I will put it on my task list to test, but the error message would point to only allowing one file.

torronen avatar May 03 '22 21:05 torronen

+1 from me on this one!

beccamc avatar May 13 '22 20:05 beccamc

@luisquintanilla Sorry for the long delay, was busy with other projects. This seems to be already tracked in todo, but thought still to update as I am working on this part.

Wildcard does not work for LoadFromBinary, at least not on Linux, and due to the error message I would expect it neither to work on Windows. I may try on Windows later again just in case.

Microsoft.ML is from daily nuget previe feed.

Ubuntu 22.04 var multisource = new MultiFileSource(trainingDataFiles.ToArray()); data = ctx.Data.LoadFromBinary(multisource);

Unhandled exception. System.ArgumentOutOfRangeException: binary loader must be created with one file (Parameter 'files')
   at Microsoft.ML.Runtime.Contracts.CheckParam(Boolean f, String paramName, String msg)
   at Microsoft.ML.Data.IO.BinaryLoader.OpenStream(IMultiStreamSource files)
   at Microsoft.ML.Data.IO.BinaryLoader..ctor(IHostEnvironment env, Arguments args, IMultiStreamSource file)
   at Microsoft.ML.BinaryLoaderSaverCatalog.LoadFromBinary(DataOperationsCatalog catalog, IMultiStreamSource fileSource)
   at Kwork.MLTrainer2023.Program.Train(List`1 trainingDataFiles, String LabelColumn, String saveTo, UInt32 trainingTimeSeconds) in /root/torronen/Kwork.MLTrainer2023/Kwork.MLTrainer2023/Program.cs:line 128
   at Kwork.MLTrainer2023.Program.Main(String[] args) in /root/torronen/Kwork.MLTrainer2023/Kwork.MLTrainer2023/Program.cs:line 52
   at Kwork.MLTrainer2023.Program.<Main>(String[] args)

Ubuntu 22.04 data = ctx.Data.LoadFromBinary("/mnt/xshare/prepared/*.idv"); (multple .idv files exists at this path)

Unhandled exception. System.ArgumentOutOfRangeException: File does not exist at path: /mnt/xshare/prepared/*.idv (Parameter 'path')
   at Microsoft.ML.BinaryLoaderSaverCatalog.LoadFromBinary(DataOperationsCatalog catalog, String path)
   at Kwork.MLTrainer2023.Program.Train(List`1 trainingDataFiles, String LabelColumn, String saveTo, UInt32 trainingTimeSeconds) in /root/torronen/Kwork.MLTrainer2023/Kwork.MLTrainer2023/Program.cs:line 150
   at Kwork.MLTrainer2023.Program.Main(String[] args) in /root/torronen/Kwork.MLTrainer2023/Kwork.MLTrainer2023/Program.cs:line 73
   at Kwork.MLTrainer2023.Program.<Main>(String[] args)

torronen avatar May 08 '23 20:05 torronen

@luisquintanilla @michaelgsharp could you please actually yes expose some functionality that takes x IDataViews with the same schema and concatenates them together?

superichmann avatar Jun 25 '23 08:06 superichmann