machinelearning
machinelearning copied to clipboard
Accessing data in a DataFrameColumn is insanely slow.
System Information (please complete the following information):
- Win 10
- Microsoft.Data.Analysis 0.18.0
- .net framework 4.7.2
Describe the bug Accessing data in a PrimitiveDataFrameColumn<> is very very very slow.
To Reproduce
int n = 1000_000;
PrimitiveDataFrameColumn
for (int i = 0; i <n; i++) column[i] = 1;
Expected behavior I filling in values in a column should cost a few clock cycles per value. So perhaps at least 100 million values per second should be achievable on a normal computer. But 1 million elements take around 0.5s on a high performance new laptop.
Is it simply that nullable objects are this slow? If that is the case, why did you go for such a technology for a data processing library where performance is a key factor?
For perspective, writing the data to disk is 10 times faster!
Is it simply that nullable objects are this slow?
Just my initial impression here: Are you able to test this by doing the following?
int n = 1000_000;
Int32DataFrameColumn column = new Int32DataFrameColumn("Name", n);
for (int i = 0; i <n; i++)
column[i] = 1;
My guess is that it will be much faster.
Compare indexing for read of double column and double array. Reading array is 50 times faster:
[GlobalSetup]
public void SetUp()
{
var values = Enumerable.Range(1, ItemsCount).ToArray();
_doubleColumn = new DoubleDataFrameColumn("Column2", values.Select(v => (double)v));
_doubleArr = new double[ItemsCount];
}
[Benchmark]
public void Indexing_Double_Column()
{
double? a = 0;
for (int i = 0; i < _doubleColumn1.Length; i++)
a = _doubleColumn[i];
}
[Benchmark]
public void Indexing_Double_Array()
{
double a = 0;
for (int i = 0; i < _doubleColumn.Length; i++)
a = _doubleArr[i];
}
Method | Mean | Error | StdDev |
---|---|---|---|
Indexing_Double_Column | 10,970.0 us | 190.83 us | 178.50 us |
Indexing_Double_Array | 291.5 us | 0.62 us | 0.55 us |