etable
etable copied to clipboard
Data table structure in Go, now developed at https://github.com/cogentcore/core/tree/main/tensor
etable: DataTable / DataFrame structure in Go
etable (or eTable) provides a DataTable / DataFrame structure in Go (golang), similar to pandas and xarray in Python, and Apache Arrow Table, using etensor n-dimensional columns aligned by common outermost row dimension.
The e-name derives from the emergent neural network simulation framework, but e is also extra-dimensional, extended, electric, easy-to-use -- all good stuff.. :)
See examples/dataproc for a full demo of how to use this system for data analysis, paralleling the example in Python Data Science using pandas, to see directly how that translates into this framework.
See Wiki for how-to documentation, etc. and Cheat Sheet below for quick reference.
As a general convention, it is safest, clearest, and quite fast to access columns by name instead of index (there is a map that caches the column indexes), so the base access method names generally take a column name argument, and those that take a column index have an Idx suffix. In addition, we adopt the GoKi Naming Convention of using the Try suffix for versions that return an error message. It is a bit painful for the writer of these methods but very convenient for the users..
The following packages are included:
-
bitsliceis a Go slice of bytes[]bytethat has methods for setting individual bits, as if it was a slice of bools, while being 8x more memory efficient. This is used for encoding null entries inetensor, and as a Tensor of bool / bits there as well, and is generally very useful for binary (boolean) data. -
etensoris a Tensor (n-dimensional array) object.etensor.Tensoris an interface that applies to many different type-specific instances, such asetensor.Float32. A tensor is just aetensor.Shapeplus a slice holding the specific data type. Our tensor is based directly on the Apache Arrow project's tensor, and it fully interoperates with it. Arrow tensors are designed to be read-only, and we needed some extra support to make ouretable.Tablework well, so we had to roll our own. Our tensors also interoperate fully with Gonum's 2D-specific Matrix type for the 2D case. -
etablehas theetable.TableDataTable / DataFrame object, which is useful for many different data analysis and database functions, and also for holding patterns to present to a neural network, and logs of output from the models, etc. Aetable.Tableis just a slice ofetensor.Tensorcolumns, that are all aligned along the outer-most row dimension. Index-based indirection, which is essential for efficient Sort, Filter etc, is provided by theetable.IdxViewtype, which is an indexed view into a Table. All data processing operations are defined on the IdxView. -
eplotprovides an interactive 2D plotting GUI in GoGi for Table data, using the gonum plot plotting package. You can select which columns to plot and specify various basic plot parameters. -
etviewprovides an interactive tabular, spreadsheet-style GUI using GoGi for viewing and editingetable.Tableandetable.Tensorobjects. Theetview.TensorGridalso provides a colored grid display higher-dimensional tensor data. -
aggprovides standard aggregation functions (Sum,Mean,Var,Stdetc) operating overetable.IdxViewviews of Table data. It also defines standardAggFuncfunctions such asSumFuncwhich can be used forAggfunctions on either a Tensor or IdxView. -
tsraggprovides the same agg functions as inagg, but operating on all the values in a givenTensor. Because of the indexed, row-based nature of tensors in a Table, these are not the same as theaggfunctions. -
splitsupports splitting a Table into any number of indexed sub-views and aggregating over those (i.e., pivot tables), grouping, summarizing data, etc. -
metricprovides similarity / distance metrics such asEuclidean,Cosine, orCorrelationthat operate on slices of[]float64or[]float32. -
simatprovides similarity / distance matrix computation methods operating onetensor.Tensororetable.Tabledata. TheSimMattype holds the resulting matrix and labels for the rows and columns, which has a specialSimMatGridview inetviewfor visualizing labeled similarity matricies. -
pcaprovides principal-components-analysis (PCA) and covariance matrix computation functions. -
clustprovides standard agglomerative hierarchical clustering including ability to plot results in an eplot. -
minmaxis home of basic Min / Max range struct, andnormhas lots of good functions for computing standard norms and normalizing vectors. -
utilshas various table-related utility command-line utility tools, includingetcatwhich combines multiple table files into one file, including option for averaging column data.
Cheat Sheet
et is the etable pointer variable for examples below:
Table Access
Scalar columns:
val := et.CellFloat("ColName", row)
str := et.CellString("ColName", row)
Tensor (higher-dimensional) columns:
tsr := et.CellTensor("ColName", row) // entire tensor at cell (a row-level SubSpace of column tensor)
val := et.CellTensorFloat1D("ColName", row, cellidx) // idx is 1D index into cell tensor
Set Table Value
et.SetCellFloat("ColName", row, val)
et.SetCellString("ColName", row, str)
Tensor (higher-dimensional) columns:
et.SetCellTensor("ColName", row, tsr) // set entire tensor at cell
et.SetCellTensorFloat1D("ColName", row, cellidx, val) // idx is 1D index into cell tensor
Find Value(s) in Column
Returns all rows where value matches given value, in string form (any number will convert to a string)
rows := et.RowsByString("ColName", "value", etable.Contains, etable.IgnoreCase)
Other options are etable.Equals instead of Contains to search for an exact full string, and etable.UseCase if case should be used instead of ignored.
Index Views (Sort, Filter, etc)
The IdxView provides a list of row-wise indexes into a table, and Sorting, Filtering and Splitting all operate on this index view without changing the underlying table data, for maximum efficiency and flexibility.
ix := etable.NewIdxView(et) // new view with all rows
Sort
ix.SortColName("Name", etable.Ascending) // etable.Ascending or etable.Descending
SortedTable := ix.NewTable() // turn an IdxView back into a new Table organized in order of indexes
or:
nmcl := et.ColByName("Name") // nmcl is an etensor of the Name column, cached
ix.Sort(func(t *Table, i, j int) bool {
return nmcl.StringVal1D(i) < nmcl.StringVal1D(j)
})
Filter
nmcl := et.ColByName("Name") // column we're filtering on
ix.Filter(func(t *Table, row int) bool {
// filter return value is for what to *keep* (=true), not exclude
// here we keep any row with a name that contains the string "in"
return strings.Contains(nmcl.StringVal1D(row), "in")
})
Splits ("pivot tables" etc)
Create a table of mean values of "Data" column grouped by unique entries in "Name" column, resulting table will be called "DataMean"
byNm := split.GroupBy(ix, []string{"Name"}) // column name(s) to group by
split.Agg(byNm, "Data", agg.AggMean) //
gps := byNm.AggsToTable(etable.AddAggName) // etable.AddAggName or etable.ColNameOnly for naming cols