Select columns from DynamicTables using slice-indexing by column name
Feature Request
Select a column from a DynamicTable using slice-indexing by column name (e.g. table['column_name']).
Problem
Currently, you have to use list-copy notation (e.g. table['column_name'][:]), which is non-intuitive for such a common operation. Slicing by column name alone returns a VectorIndex or VectorData object, which the typical end user will likely not need to think about.
Example
- Build a simple DynamicTable with both indexed and non-indexed columns.
import pynwb
dt = pynwb.core.DynamicTable(name='test', description='a table with both indexed and non-indexed columns')
dt.add_column(name='indexed_col', description='an indexed column', index=True)
dt.add_column(name='non_indexed_col', description='a non-indexed column') # index defaults to False
dt.add_row(indexed_col=[1, 2, 3, 4, 5], non_indexed_col='hello')
dt.add_row(indexed_col=[6, 7, 8, 9, 10], non_indexed_col='world')
- Slicing into this VectorIndex/VectorData object returns the data from an individual row.
dt['indexed_col'][0]
>> [1, 2, 3, 4, 5]
dt['non_indexed_col'][0]
>> 'hello'
- Currently, if the user wants to get the entire column they have to use the list-copy syntax.
dt['indexed_col'][:]
>> [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
dt['non_indexed_col'][:]
>> ['hello', 'world']
- Whereas, selecting a column using slice notation returns a VectorIndex object, or a VectorData object if the column is not indexed.
dt['indexed_col']
>> indexed_col_index <class 'pynwb.core.VectorIndex'>
dt['non_indexed_col']
>> non_indexed_col <class 'pynwb.core.VectorData'>
Proposed solution
It would be nice if selecting a column by name using slice notation would return the column, instead of returning a VectorIndex/VectorData object. In this way, working with DynamicTables will feel more intuitive and Pythonic to the end user.
dt['indexed_col']
>> [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
dt['non_indexed_col']
>> ['hello', 'world']
Alternatively, we could just convert all DynamicTables into Pandas DataFrames before beginning analysis, but that seems like it could lead to a lot of boilerplate.
@tjd2002
I agree! This 'leakage' of the underlying DynamicTables VectorData internals was confusing to me, especially as compared to DynamicTables columns that are not of type VectorData (these just behave like columns of a table).
Only downside I can see is that I suppose there could be some performance penalty for constructing the return value for very large VectorData objects.
@angevineMiller I agree that the usability is not great here, but I also agree with the concern @tjd2002 raised -- If this means we read the entire column into memory for e.g. table['column_name'][0], it would be too much of a performance hit. I like using [:] to read all data because it has precedence in h5py. Maybe we can change the behavior of table['column_name'] to something more user-friendly, but that doesn't read all the data. Maybe an iterator object?
If this means we read the entire column into memory for e.g. table['column_name'][0], it would be too much of a performance hit
I don't think there's any reason to think we'd need read the whole column in just to access one row. Indexing into a single 'cell' already works well, and should continue to use the VectorData/VectorIndex lookup machinery.
If the list comprehension for the whole column's worth of data is only run when the user explicitly requests the entire column of data, then I think that is reasonable.
As the examples above show, when you request either a complete row, or single row/column entry from the table, you currently get the actual data back in a list, no matter whether that data column is stored as Indexed VectorData, Unindexed VectorData, or regular column type under the hood. However, this breaks down when you index just using the column name: you get either a VectorData object, VectorIndex object, or raw data column, respectively.
I think this is just a rough edge that needs to be rounded off.
I agree with the concern over reading a huge column into memory, especially if this were to be done prior to every time you index a single element of a column.
We can also use dt[:, 'column_name'] to access whole columns, which is syntactically similar to the Pandas-styletable.loc[:, 'column_name']. This approach seems intuitive to me, and it naturally extends to getting single elements with dt[0, 'column_name'].
Under the hood, I think this approach is doing the same thing as dt['column_name'][0], but it doesn't involve a step that reveals the VectorData/VectorIndex internals to the user, who might think that removing the outer index will return the column.
Oh now I get Ben’s concern. Sorry I missed that. Yes, returning something like an iterable seems like a good solution
Sent from my phone
On Dec 11, 2018, at 4:12 PM, Eric Angevine Miller [email protected] wrote:
I agree with the concern over reading a huge column into memory, especially if this were to be done prior to every time you index a single element of a column.
We can also use dt[:, 'column_name'] to access whole columns, which is syntactically similar to the Pandas-styletable.loc[:, 'column_name']. This approach seems intuitive to me, and it naturally extends to getting single elements with dt[0, 'column_name'].
Under the hood, I think this approach is doing the same thing as dt['column_name'][0], but it doesn't involve a step that reveals the VectorData/VectorIndex internals to the user, who might think that removing the outer index will return the column.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Yes, I am happy with dt[:, 'column_name'], and I like that it matches pandas style. In the end I think dt['column_name'] for indexed vectors should probably return an object that signals to the user that it's a lazy read object, like h5py does. Even better if we can make that an iterator.