towhee
towhee copied to clipboard
[DesignProposal]: Iterating `df` and `df._iterable` should return the same result
Background and Motivation
No response
Design
We might need to modify following class's __repr__
and __iter__
function:
- DataFrame
- Entity
- ChunkedTable
- WritableTable
Expected results:
- When iterating a row-based
DataFrame
and its_iterable
(a list or a generator of Entity):
[<Entity dict_keys(['a', 'b' ...])>, <Entity dict_keys(['a', 'b', ...])>, ...]
- When iterating a column-based
DataFrame
and its_iterable
(a WritableTable):
[<EntityView dict_keys(['a', 'b' ...])>, <EntityView dict_keys(['a', 'b', ...])>, ...]
- When iterating a chunked
DataFrame
and its_iterable
(a ChunkedTable consists of a series of WritableTable):
[<EntityView dict_keys(['a', 'b' ...])>, <EntityView dict_keys(['a', 'b', ...])>, ...]
What is the expected behavior? I suggest that we have a detailed description of the API definition.
Check result:
- Row-based DataFrame
>>> from towhee import Entity, DataFrame
>>> e = [Entity(a=a, b=b) for a,b in zip(range(3), range(3))]
>>> df = DataFrame(e)
>>> df.to_list()
[<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>], [<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>])
- Column-based DataFrame
>>> df = df.to_column()
>>> df.to_list()
[<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>], [<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>])
- Chunked DataFrame
>>> df = DataFrame(e)
>>> df = df.set_chunksize(2)
>>> df.to_list()
[<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>], [<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>])
For Chunked DataFrame, when executing, we need chunks as the basic computational unit, so we should iterate df._iterable.chunks()
instead of df itself:
>>> df._iterable.chunks()
[pyarrow.Table
a: int64
b: int64
----
a: [[0,1]]
b: [[0,1]], pyarrow.Table
a: int64
b: int64
----
a: [[2]]
b: [[2]]]
As showed, all the behavior has achieved our expectation with https://github.com/towhee-io/towhee/pull/1510
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Close the stale issues and pull requests after 7 days of inactivity. Reopen the issue with /reopen
.