Pandas.NET icon indicating copy to clipboard operation
Pandas.NET copied to clipboard

Should DataFrame really be a NDArray child?

Open dotChris90 opened this issue 5 years ago • 13 comments

Sorry I already rise an issue while all is under construction >.<.

We should not let dataframe be a child of ndarray. In Pandas the dataframe is a child of a general pandas object and has no inheritance connection to NDArray. I think we will face same problems if we in heritage from ndarray.

  • Data frame is a collection of multiple NDArrays so it is not a array itself it is a collection
  • the behaviour is different from an array especially in indexing
  • frame indexing is like a dictionary df['column1'] or df['col2']
  • honestly spoken I think we need a dictionary object as property so we can store the different arrays
  • the indexing of dataframe must be via strings like in dictionary (not sure if possible but i think it must be possible)

dotChris90 avatar Nov 05 '18 20:11 dotChris90

What do you think fo inheriting from DynamicObject?

Oceania2018 avatar Nov 05 '18 23:11 Oceania2018

I am not sure. Dynamic object gives much flexibility to a class.

Maybe should not inheritage at moment from anything.

I think in future it could implement stuff like IEnumerable so we get something like index columns pairs.

But for start maybe should not inheritage from anything and see while implementing if we need inheritage.

But if u already found reason for inheritage let me know :)

dotChris90 avatar Nov 06 '18 07:11 dotChris90

@Oceania2018 if you do not mind I add a branch for play around with the data frame. Just because the frame class is the most important class in pandas.

dotChris90 avatar Nov 06 '18 16:11 dotChris90

@dotChris90 Go ahead. DataFrame is critical. Will see your experiment.

Oceania2018 avatar Nov 06 '18 16:11 Oceania2018

@Oceania2018 haha ah! now got your point why need DynamicObject! I just saw that each dataframe object has some properties which are dynamic. If the columns are A, B, C, D so it will have properties A,B,C,D. Ok - yes totally agree now with you.

dotChris90 avatar Nov 06 '18 16:11 dotChris90

But the cons is we won't get the strong type tips once we inherit from DynamicObject.

Oceania2018 avatar Nov 06 '18 17:11 Oceania2018

@Oceania2018 thats true --> at the end I experimented little bit without DynamicObj. Performance counts since this is the strongest benefit of Numsharp and the corresponding projects (static types are better and faster). ;)

You can still reach the columns with df['column1']

dotChris90 avatar Nov 06 '18 20:11 dotChris90

Hi, Great idea and promising project! It is very meaningful for the people who is familiar with pandas API, and want to use C# to do data analysis. I have watch this project since I found it, and hope I could contribute some code in the future.

Here's a advice for this issue:

Actually, pandas DataFrame could store different types of columns in a DataFrame. So I think it may be not appropriate to define the TData generic type for the class DataFrame, neither to use the whole NDArray as a internal data container of DataFrame.

There are two libraries which could be reference for you:

  1. Library Deedle only defines <TRowKey, TColumnKey> as generic type, but no TData. It implements most of common operations for DataFrame and Series.
  2. Machinelearning_DataFrame store different type data in different data chunk (a List of IDataColumn in the project, and we can use NDArray<T>). As far as I know, pandas also store data like this.

VanyTang avatar Nov 07 '18 14:11 VanyTang

@VanyTang thanks for the advice. Is this true? O.o Omg - honestly I had no idea that pandas is so dynamic .... this explains why their performance is so bad. Before I was really hoping that the columns have at least the same data type. Yes actually that is quite critical information.

@Oceania2018 maybe we should at least look Deedle + ML.Dataframe and also pandas source code itself. Honestly spoken I really hate this pandas "my columns can be anything" for performance reason. But at least we should again think about the pros and cons.

@VanyTang by the way - thanks for the kind words. A Numerical Stack is really something that is missing .NET world. Java and Python were the key languages in this area but I think it is time to show we .NET developers are also interested into this. ;)

dotChris90 avatar Nov 07 '18 16:11 dotChris90

@dotChris90 Deedle is designed for F# and complained for performance. Let's do DataFrame<TIndex,TData>, we might add a new type for Y(label) column, think about DataFrame<TIndex, Tx, Ty>

@VanyTang Thanks for you information and welcome to discuss and contribute.

Our goal is mocking python pandas in .NET, transfer python machine learning code into C# in no effort as less as possible.

Oceania2018 avatar Nov 07 '18 17:11 Oceania2018

@VanyTang and one more to mention. :) no matter if sharing codes, ideas, discussions, articles, considerations, links,....

We welcome everybody to share their knowledge. We are dotnet developers, we are open source nerds, we are all just humans and if we really want to make our dotnet framework great in machine learning , our ideas and our wishes come true, so we need every possible suggestion, hint, etc from everybody of you. So please feel always free to post issues and suggestions.

😊

dotChris90 avatar Nov 08 '18 10:11 dotChris90

@dotChris90 I think the dynamic column data type is necessary.

Oceania2018 avatar Nov 08 '18 13:11 Oceania2018

Yeah probably.... But I have no idea how we shall handle this in a clean way.

need some more investigation

dotChris90 avatar Nov 08 '18 21:11 dotChris90