machinelearning Accessing data in a DataFrameColumn is insanely slow.

Accessing data in a DataFrameColumn is insanely slow.

Open DrDryg opened this issue 3 years ago • 1 comments

System Information (please complete the following information):

Win 10
Microsoft.Data.Analysis 0.18.0
.net framework 4.7.2

Describe the bug Accessing data in a PrimitiveDataFrameColumn<> is very very very slow.

To Reproduce int n = 1000_000; PrimitiveDataFrameColumn column = new PrimitiveDataFrameColumn("Name", n);

for (int i = 0; i <n; i++) column[i] = 1;

Expected behavior I filling in values in a column should cost a few clock cycles per value. So perhaps at least 100 million values per second should be achievable on a normal computer. But 1 million elements take around 0.5s on a high performance new laptop.

Is it simply that nullable objects are this slow? If that is the case, why did you go for such a technology for a data processing library where performance is a key factor?

For perspective, writing the data to disk is 10 times faster!

Oct 10 '21 09:10 DrDryg

Is it simply that nullable objects are this slow?

Just my initial impression here: Are you able to test this by doing the following?

int n = 1000_000;
Int32DataFrameColumn column = new Int32DataFrameColumn("Name", n);

for (int i = 0; i <n; i++)
column[i] = 1;

My guess is that it will be much faster.

Jan 28 '22 06:01 pgovind

Compare indexing for read of double column and double array. Reading array is 50 times faster:

[GlobalSetup]
public void SetUp()
{
     var values = Enumerable.Range(1, ItemsCount).ToArray();

     _doubleColumn = new DoubleDataFrameColumn("Column2", values.Select(v => (double)v));
    _doubleArr = new double[ItemsCount];
}
[Benchmark]
public void Indexing_Double_Column()
{
    double? a = 0;
    for (int i = 0; i < _doubleColumn1.Length; i++)
        a = _doubleColumn[i];
}

[Benchmark]
public void Indexing_Double_Array()
{
    double a = 0;
    for (int i = 0; i < _doubleColumn.Length; i++)
        a = _doubleArr[i];
}

Method	Mean	Error	StdDev
Indexing_Double_Column	10,970.0 us	190.83 us	178.50 us
Indexing_Double_Array	291.5 us	0.62 us	0.55 us

Oct 01 '23 08:10 asmirnov82

machinelearning machinelearning copied to clipboard

Accessing data in a DataFrameColumn is insanely slow.

machinelearning
machinelearning copied to clipboard