datachain icon indicating copy to clipboard operation
datachain copied to clipboard

DataChain objects nomeclature

Open tibor-mach opened this issue 1 year ago • 1 comments

There are a few terms used in DataChain which I think we need to define more clearly/consistently, especially in docs and any blogposts etc (or maybe I just missed the fact that have this nomeclature already :-))

DataChain:

  • the product name and the python library :-)
  • But also the object we work with in DataChain, basically what elsewhere is usually called a dataframe or a table ... maybe we can call these DataChain tables?

Dataset :

  • A dataset is a persisted DataChain? Should we call it a DataChain dataset? I would probably just say a persisted datachain or a persisted table (if we call the instances of DataChain class tables)

Column vs signal:

  • we have tables with hierarchical columns and we sometimes call them columns and sometimes signals
  • I would just use the word columns everywhere and forget about signals because everyone knows what table columns and signals are more vague.
  • We need to be able to clearly differentiate between a column like file which contains a collection of lower level columns ( file.path, file.version , or even file.foo.bar) and single level columns (e.g. ones created by the users). Pandas has a similar concept with the index and they then call it a MultiIndex (or a hierarchical index). So we could then perhaps call this a multicolumn vs a column?
  • but then we also have the DataModel class which basically corresponds to a group of columns or a subschema (specifying a collection of column names and their types) ... also the built-in File class is used that way
  • So what should we call the instances of DataModel (and File)? If we used MultiColumn instead of DataModel (that would mean renaming it which is a bit annoying ... and I know it is not technically a column, but from the user perspective that's how you work with it) then we could just call those all multicolumns (even if there is some ambiguity whether we mean the actual columns or the instance of this class) and we could call File instances something like "built-in" multicolumns.

tibor-mach avatar Oct 22 '24 11:10 tibor-mach

DataChain: I like DataChain tables. Certainly on first pass in docs or presentation to signify that the are special and distinct from dataframes or other tables.

Dataset I like DataChain dataset and persisted datachain - makes the term used more. Solidified in user's mind.

Column vs. signal I'm not sure here on this. I need to actually work with it to understand better. First impression is that unlike Multi-index, Multicolumn does not really mentally indicate something different than the plural of column. Index carries with it a different mentality of a layer of some kind, so Mulit-index implies something greater is happening.

jendefig avatar Oct 22 '24 15:10 jendefig

Closing this for now, it we have improved it and can keep improving as we go.

shcheklein avatar Sep 10 '25 22:09 shcheklein