machinelearning
machinelearning copied to clipboard
Microsoft.Data.Analysis DataFrameColumns.SetName method is weird
.Net Core 3.1 Microsoft.Data.Analysis Nuget package version: 0.19.1
This would be the logical way to call the method:
dataFrame.Columns["Date_left"].SetName("Date");
But, this does not result in the column being properly renamed.
The method needs to be called like this in order to work properly:
dataFrame.Columns["Date_left"].SetName("Date", dataFrame);
Here's a program to reproduce the issue:
using Microsoft.Data.Analysis;
using System;
using System.Linq;
namespace TestDataFRame
{
internal class Program
{
static void Main(string[] args)
{
DateTime?[] dates1 = { new DateTime(2022, 03, 01), new DateTime(2022, 03, 02), new DateTime(2022, 03, 03) };
double?[] closePrices = { 10.5, 12.4, 11.3 };
DateTime?[] dates2 = { new DateTime(2022, 03, 01), new DateTime(2022, 03, 02), new DateTime(2022, 03, 03), new DateTime(2022, 03, 04) };
double[] shortPercentages = { 2.34, 2.36, 3.01, 3.04 };
DataFrame dataFrame1 = new DataFrame();
dataFrame1.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("Date", dates1));
dataFrame1.Columns.Add(new DoubleDataFrameColumn("ClosePrice", closePrices));
var numbers1 = dataFrame1.Columns.GetDoubleColumn("ClosePrice").ToArray();
DataFrame dataFrame2 = new DataFrame();
dataFrame2.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("Date", dates2));
dataFrame2.Columns.Add(new DoubleDataFrameColumn("ShortPercentage", shortPercentages));
var numbers2 = dataFrame2.Columns.GetDoubleColumn("ShortPercentage").ToArray();
DataFrame dataFrame = dataFrame1.Merge<DateTime>(dataFrame2, "Date", "Date", joinAlgorithm: JoinAlgorithm.Left);
dataFrame.Columns["Date_left"].SetName("Date");
var dates = dataFrame.Columns.GetPrimitiveColumn<DateTime>("Date").ToArray();
}
}
}
Looks like the columns themselves don't have a reference to the parent DataFrame, hence this behavior. I think the way to fix this would be to have the column have a reference to the DataFrame so it can perform the update as you point out. I dont think it would be a big change. @luisquintanilla
Ok, the other option (or we could do both), is that dataFrame.Columns does have a SetColumnName method as well, it just for some reason requires you to pass in the whole column SetColumnName(DataFrameColumn column, string newName). We could just overload that method to just take the name of the current column and the name you want to set it to SetColumnName(string curName, string newName). Then you could just call dataFrame.Columns.SetColumnName("cur", "new");. Thoughts?
Thanks for reporting this issue @olavt.
@michaelgsharp Thanks for looking into this and providing those solutions. Tracking this issue as part of DataFrame improvements.
Note that, because SetName has a default value of null for the DataFrame parameter, if a user tries dataFrame["oldColumnName"]("newColumnName"), not only will this fail, but it will also not throw an exception, resulting in a bug that is hard to catch.
You can guess who the user is... 😀