DataChain objects nomeclature
There are a few terms used in DataChain which I think we need to define more clearly/consistently, especially in docs and any blogposts etc (or maybe I just missed the fact that have this nomeclature already :-))
DataChain:
- the product name and the python library :-)
- But also the object we work with in DataChain, basically what elsewhere is usually called a dataframe or a table ... maybe we can call these DataChain tables?
Dataset :
- A dataset is a persisted DataChain? Should we call it a DataChain dataset? I would probably just say a persisted datachain or a persisted table (if we call the instances of DataChain class tables)
Column vs signal:
- we have tables with hierarchical columns and we sometimes call them columns and sometimes signals
- I would just use the word columns everywhere and forget about signals because everyone knows what table columns and signals are more vague.
- We need to be able to clearly differentiate between a column like
filewhich contains a collection of lower level columns (file.path,file.version, or evenfile.foo.bar) and single level columns (e.g. ones created by the users). Pandas has a similar concept with the index and they then call it aMultiIndex(or a hierarchical index). So we could then perhaps call this a multicolumn vs a column? - but then we also have the
DataModelclass which basically corresponds to a group of columns or a subschema (specifying a collection of column names and their types) ... also the built-inFileclass is used that way - So what should we call the instances of
DataModel(andFile)? If we usedMultiColumninstead ofDataModel(that would mean renaming it which is a bit annoying ... and I know it is not technically a column, but from the user perspective that's how you work with it) then we could just call those all multicolumns (even if there is some ambiguity whether we mean the actual columns or the instance of this class) and we could call File instances something like "built-in" multicolumns.
DataChain: I like DataChain tables. Certainly on first pass in docs or presentation to signify that the are special and distinct from dataframes or other tables.
Dataset I like DataChain dataset and persisted datachain - makes the term used more. Solidified in user's mind.
Column vs. signal I'm not sure here on this. I need to actually work with it to understand better. First impression is that unlike Multi-index, Multicolumn does not really mentally indicate something different than the plural of column. Index carries with it a different mentality of a layer of some kind, so Mulit-index implies something greater is happening.
Closing this for now, it we have improved it and can keep improving as we go.