machinelearning
machinelearning copied to clipboard
Easyer way to create dynamic DataViews
Is your feature request related to a problem? Please describe. In my company we want add ML blocks to our arsenal (made with Blockly) with witch you could train and run models. I've read in the docs and some issues that the model must be defined beforehand declaring a Class with some Attributes. And apparently it's not easy to create a dynamic model. Our idea is to feed SQL DataSets to the Trainer.
Describe the solution you'd like As a user I would like to define the model based on the shape of the input data. For example, a SQL DataSet, a CSV etc. After that, each column metadata could be added programatically.
Describe alternatives you've considered Both seem overcomplicated to me: https://stackoverflow.com/questions/56761728/add-custom-column-to-idataview-in-ml-net https://stackoverflow.com/questions/66893993/ml-net-create-prediction-engine-using-dynamic-class/66913705#66913705
Additional context Experienced c# developer, new to ML.NET
@luisquintanilla @briacht I have seen this come up several times now, so its obviously something people want. Not sure how it will fit with our priorities, but its definitely something we should look more into.
@vgb1993 could you give an example of what you would like to see/what you are thinking?
Yes, I've seen this request quite a few times now too! I think this would be good to investigate
Here's some raw brainstorming:
- Create a sample showing how to build a dynamic model programatically with the existing tools.
- Create a new FluentApi to replace (or complement rather) the Attribute configuration (like in EF Core). Expose it as a Nuget package and document it.
- Reference the columns by name or by index instead of strongly typed property access.
- Create a nuget that encapsulates the solutions exposed in stackoverflow links above.
- Create a connector for simple cases, like SQL datasets and CSVs. The connector could figure out the input schema types on it's own based on the data.
- Create a schema specification using json. Perhaps even the pipeline?
Any thoughts? Any preferences? Any draw backs? I'm not aware of the internal implementation of ML.Net so perhaps someone could have better ideas.
At the end of the day what we want is to create and run ML models at runtime. If we can define a dynamic model we can build a software to make it work. Wich ultimately makes ML.Net more accessible.
Yes! This problem I encountered today.
For @vgb1993's points above,
1 - Yes 3 - Yes 5 - Yes, CSV for us
Others are maybe/sure.
Currently the easiest way to do a dynamic dataview is using the Microsoft.Data.Analysis.DataFrame because it can dynamically load in a text file and create the schema automatically and then use that in ML.NET. Take a look at this for an example.
That being said, some of the other approaches mentioned above are things we are considering, but don't have a timeline for them as of yet.
@michaelgsharp, I do agree that a Microsoft.Data.Analysis.DataFrame
may be a solution, the issue is that it cannot be streamed (the entire dataframe has to fit into memory to use it).
Are you aware of any partitioning work on dataframes such as the Python dask library for pandas?
Personally I am not but I haven't really looked into it much. @luisquintanilla are you aware of anything?
Yeah, the memory thing can be an issue. I'm not aware of a workaround for now, but this is something we are keeping in mind for future work.
@michaelgsharp we have same issue with dynamic data loading (mostly from SQL db), and dynamicaly creating models for each labeling/value prediction scenario (I guess this is main problem, we need separate model for each scenario). Predict C from A, B Predict B from C, A Predict X from D, F Every form field combination which user could potentionaly want is new scenario, and need its own classes, model, and project.
Here ideas so far
- Building classes, models using "dynamic"/runtime.
- Reflections
- Create c# code dynamicaly (stringbuilders etc), then compile it and get output model to db
I guess all of mentioned should work but all of them are ridiculous... Do you know any API update/ETA, for common model creation?
I was hoping this would have been added to 2.0.0. I'm currently using one huge class that contains all my potential features, loading my data from stored procedures and then using ML.Transforms.DropColumns to remove the fields not found before training. It works, but it's far from ideal, and has become stifling. Has anyone found a better work around?
I am using only Microsoft.AutoML and I am able to do workaroud by load data dynamically with loading from SQL, and using input columns as input/labels,column names. I also use "sql placeholders like" SELECT '0' as InputOrLabelColumn /* for text */ SELECT CAST(0 as real) as InputOrLabelColumn /*for numbers */
And loading IDataView from normal SQL query (with mapping SQL columns to ML.NET data type columns for IDataView) SELECT Text, CAST(intcol as real) intcol, Result as FROM Table
For prediction just need to match datatypes (fake 1 row SQL SELECT statement) SELECT 'text', CAST(0 as real) intcol to got Result
Works for training AutoML API training and prediction. I never used casual Featurize/Fit/Transform methods yet
Not sure how many people struggled with this but i managed to solve one of the RunTime type problems without having to use json or csv or textfiles as my DataView, thus a RunTime dynamic DataView of a live IEnumerable of objects. In my case I already have defined classes but I have hundred of them, they are all used in the same way but having to create a training method etc seperatley and not being able to just implement a common interface was the difference between essentially rewriting my whole program each time a new class was added, or just implementing an interface and using the new class the same as all the existing classes. With this each time I hard code a new type I just implement an interface an it can train using the same method as my existing types and my columns labels etc are dynamically generated. Any common variables I dont want in my DataView I can put in my interface and use reflection to ignore by comparing properties to interface properies. https://github.com/BurnOutTrader/DynamicMLPipeline/tree/main [email protected]:BurnOutTrader/DynamicMLPipeline.git
A similar approach could probably be taken to just create objects from .csv files etc and apply schema properties dynamically but I think ML.Net has that built in now.