machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Microsoft.Data.Analysis DataFrameColumns.SetName method is weird

Open olavt opened this issue 3 years ago • 4 comments
trafficstars

.Net Core 3.1 Microsoft.Data.Analysis Nuget package version: 0.19.1

This would be the logical way to call the method:

dataFrame.Columns["Date_left"].SetName("Date");

But, this does not result in the column being properly renamed.

The method needs to be called like this in order to work properly:

dataFrame.Columns["Date_left"].SetName("Date", dataFrame);

Here's a program to reproduce the issue:

using Microsoft.Data.Analysis;
using System;
using System.Linq;

namespace TestDataFRame
{
    internal class Program
    {
        static void Main(string[] args)
        {
            DateTime?[] dates1 = { new DateTime(2022, 03, 01), new DateTime(2022, 03, 02), new DateTime(2022, 03, 03) };
            double?[] closePrices = { 10.5, 12.4, 11.3 };

            DateTime?[] dates2 = { new DateTime(2022, 03, 01), new DateTime(2022, 03, 02), new DateTime(2022, 03, 03), new DateTime(2022, 03, 04) };
            double[] shortPercentages = { 2.34, 2.36, 3.01, 3.04 };

            DataFrame dataFrame1 = new DataFrame();
            dataFrame1.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("Date", dates1));
            dataFrame1.Columns.Add(new DoubleDataFrameColumn("ClosePrice", closePrices));

            var numbers1 = dataFrame1.Columns.GetDoubleColumn("ClosePrice").ToArray();

            DataFrame dataFrame2 = new DataFrame();
            dataFrame2.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("Date", dates2));
            dataFrame2.Columns.Add(new DoubleDataFrameColumn("ShortPercentage", shortPercentages));

            var numbers2 = dataFrame2.Columns.GetDoubleColumn("ShortPercentage").ToArray();

            DataFrame dataFrame = dataFrame1.Merge<DateTime>(dataFrame2, "Date", "Date", joinAlgorithm: JoinAlgorithm.Left);
            dataFrame.Columns["Date_left"].SetName("Date");

            var dates = dataFrame.Columns.GetPrimitiveColumn<DateTime>("Date").ToArray();
        }
    }
}

olavt avatar Mar 14 '22 20:03 olavt

Looks like the columns themselves don't have a reference to the parent DataFrame, hence this behavior. I think the way to fix this would be to have the column have a reference to the DataFrame so it can perform the update as you point out. I dont think it would be a big change. @luisquintanilla

michaelgsharp avatar Mar 18 '22 19:03 michaelgsharp

Ok, the other option (or we could do both), is that dataFrame.Columns does have a SetColumnName method as well, it just for some reason requires you to pass in the whole column SetColumnName(DataFrameColumn column, string newName). We could just overload that method to just take the name of the current column and the name you want to set it to SetColumnName(string curName, string newName). Then you could just call dataFrame.Columns.SetColumnName("cur", "new");. Thoughts?

michaelgsharp avatar Mar 18 '22 20:03 michaelgsharp

Thanks for reporting this issue @olavt.

@michaelgsharp Thanks for looking into this and providing those solutions. Tracking this issue as part of DataFrame improvements.

luisquintanilla avatar Mar 29 '22 17:03 luisquintanilla

Note that, because SetName has a default value of null for the DataFrame parameter, if a user tries dataFrame["oldColumnName"]("newColumnName"), not only will this fail, but it will also not throw an exception, resulting in a bug that is hard to catch.

You can guess who the user is... 😀

chrisxfire avatar Aug 10 '22 21:08 chrisxfire