Add Support for Microsoft.Data.Analysis.DataFrame
Micrsoft has included in .Net a class for DataFrame in the spirit of Pandas for Python. It is a natural way of exchanging data with Excel. It would be nice to be able to pass DataFrame as argument to C# functions through ExcdlDna. Is it an addition we could hope to see anytime soon ?
I'm not sure how widely the Microsoft DataFrame is used, but it's certainly relevant to Microsoft-friendly data processing in Excel.
I think there are different use cases.
One like to make a function like:
public static DataFrame ProcessData(DataFrame input) {...}
instead of
public static object[,] ProcessData(object[,] input) {...}
For this we just need to define the two conversions object[,] -> DataFrame and DataFrame -> object[,].
Then the wrapper function does
public static object[,] ProcessDataExcel(object[,] input) =>
ConvertToDataFrame(ProcessData(ConvertFromDataFrame(input));
These wrappers can be automatically generated at runtime, using the ExcelDna.Registration extension. So if you can defined the two conversions, the rest is done already.
Another interpretation is that you want to work with large DataFrames that might not be sitting in an Excel sheet. So you want some kind of handle mechanism to use Excel function to define processing on DataFrames, but the DataFrames live in memory behind the scenes.
So you'd have function like
DataFrameHandle LoadDataFrmaeFromFiles(string path) { ... }
This returns an opaque handle (maybe just a string that is a key into a Dictionary) that can be used further
DataFrameHandle SliceDataFrame(DataFrameHandle input, object[,] indices) {...}
The main issue with the object handle schemes is to manage the lifetime of the objects. Excel-DNA can help with this through an RTD-based handle mechanism. There are some example, but the best write-up is still the ACQ add-in that I described here: https://excel-dna.net/2016/06/10/add-in-spotlight-acq-for-interpolation/
There is a more sophisticated, but perhaps now too complicated, sample which allows external Refresh of the handle object etc. here: https://github.com/Excel-DNA/Samples/tree/master/ObjectHandles
I'd say the problem with DataFrame support in Excel-DNA is not really in the plumbing or implementation, but understanding the requirements better. Do you use these extensively? Have you tried to use them in Excel in ways similar to the above?
@govert thank you for the detailed answer.
Regarding your first question. I am in the first use case you described, passing DataFrame fom/to Excel (the second use case is not an issue as the function DataFrame.LoadCSV already exist and allow to manipulate large csv....).
Regarding your second question, I recently discovered the DataFrame package from Microsoft (which is fairly new) and I see potential to ditch python/xlwings for .net/Exceldna. I quite like python/xlwings for macro but for UDF I see Exceldna much more robust and efficient. I believe this package will be used intensly by data scientist.
These wrappers can be automatically generated at runtime, using the ExcelDna.Registration extension. So if you can define the two conversions, the rest is done already.
I am really not sure I have the technical skills to do this but if you guide me I could give it a try. However, it is likely it will be faster for you to do so than to give me the required background information :-)
I think the main problems are identifying some real use cases, and deciding how it should work for those, and then to answer the many little questions that arise in how to manage the impedance mismatch between Excel and the data frame library.
You might try to think about how to write a function that takes an object[,] of values from an Excel sheet and makes a DataFrame. The Excel array can have any combination of string, double, bool, ExcelEmpty, ExcelError. Now decide how to build a DataFrame, when to make errors, how to get dates (since your input will never have dates, just doubles with the special Excel date values), what to do for column names (always pass in). Do you make up some 'configuration' string format that you also pass in define the schema. And so it goes on.
And then a similar story for DataFrame -> object[,].
And then you need some use for it to try it out.
what about object[,] -> csv string using existing nugget and then use the function DataFrame.LoadCSV(string csv) ?
That would be much simpler, right ?
I'm not sure which NuGet package you mean, but you can try it. I have no idea how DataFrame.LoadCsv works.
@govert as you advised, in order to export DataFrame instead of returning the DataFrame df, I am simply returning CustomMarshaler.dataFrameToObjects(df) with the code below.
Is it the kind of conversion function you were mentioning above? Would you consider this as a wrapper than can be implemented in ExcelDna or is it a too naive approach?
Otherwise, is there something I can do on my side to have this wrapper automatically applied so I can directly return DataFrame object to Excel ?
public static object[,] dataFrameToObjects(DataFrame df)
{
var iRows = df.Rows.Count;
var iCols = df.Columns.Count;
var res = new object[iRows + 1, iCols];
for (var j = 0; j < iCols; j++)
{
res[0, j] = df.Columns[j].Name;
}
for (var i = 0; i < iRows; i++)
{
for (var j = 0; j < iCols; j++)
{
res[i + 1, j] = df[i, j];
}
}
return res;
}
This kind of marshaling is all you need. You might consider whether you need to export the column names.
The part that can be done with the Registration extension is not the marshaling, but making the wrapper that takes this unsupported function
public static DataFrame CalculateDataFrame(DateFrame input) {...}
into this supported function (this is what I'm calling the wrapper)
public static object[,] CalculateDataFrameExcel(object[,] input)
{
return dataFrameToObjects(CalculateDataFrame(objectsToDataFrame(input));
}
The work you still need to do is to define those two functions dataFrameToObjects and objectsToDataFrame. Nothing does that part automatically.
For the first ten or twenty function, writing the wrapper explicitly is easiest. But if you have hundreds of functions, it's nicer to have the little wrappers that call the conversion to be generated automatically. That's what the Registration extensions could do. There are type conversion examples in the Registration library source code: https://github.com/Excel-DNA/Registration But again I would suggest leaving that part for later, and just try to figure out the basic conversions, and then see how you would actually use this.
The function dataFrameToObjects is very easy, it is the one I put in my previous post.
I also implemented objectsToDataFrame in my code. To do so I had also to have as parameters the type of object I want.
I doubt my implementation is very efficient (but if you are interested I can share it with you), but it allows me to interact with DataFrame from/to excel ans I havent detected any performance issue.
Maybe you can make a GitHub repository with your code and some samples?
I believe this package will be used intensely by data scientists.
You just have to start somewhere.
It sounds strange – maybe some dependencies that are different. Does it make a difference whether you are running the debug version under the debugger or directly?
If you can make a small project that has this problem and post it somewhere then I can try to reproduce.
-Govert
From: akasolace @.> Sent: 13 March 2022 12:13 To: Excel-DNA/ExcelDna @.> Cc: Govert van Drimmelen @.>; Mention @.> Subject: Re: [Excel-DNA/ExcelDna] Add Support for Microsoft.Data.Analysis.DataFrame (Issue #440)
@goverthttps://github.com/govert I would like to do so but I have been strugling with a bug I never experienced before with ExcelDna. My code work as expected in Debug mode (e.g calling my functions from Excel load and export DataFrame as expected). But When I build my addin in release mode the functions are throwing an Error. This is the first time I have a different behaviour between Debug and Release mode. Hence, I don't know how to fix it because there is nothing to debug ... @goverthttps://github.com/govert any idea what could cause this ? Or how I can investigate what is going on ....
— Reply to this email directly, view it on GitHubhttps://github.com/Excel-DNA/ExcelDna/issues/440#issuecomment-1066067525, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADFHQ5Y5YY6M2YG7KO2KALU7W5R7ANCNFSM5QL25KQA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.@.>>