towhee [DesignProposal]: Iterating `df` and `df._iterable` should return the same result

[DesignProposal]: Iterating `df` and `df._iterable` should return the same result

Open Chiiizzzy opened this issue 1 year ago • 3 comments

Background and Motivation

No response

Design

We might need to modify following class's __repr__ and __iter__ function:

DataFrame
Entity
ChunkedTable
WritableTable

Expected results:

When iterating a row-based DataFrame and its _iterable (a list or a generator of Entity):

[<Entity dict_keys(['a', 'b' ...])>, <Entity dict_keys(['a', 'b', ...])>, ...]

When iterating a column-based DataFrame and its _iterable (a WritableTable):

[<EntityView dict_keys(['a', 'b' ...])>, <EntityView dict_keys(['a', 'b', ...])>, ...]

When iterating a chunked DataFrame and its _iterable (a ChunkedTable consists of a series of WritableTable):

[<EntityView dict_keys(['a', 'b' ...])>, <EntityView dict_keys(['a', 'b', ...])>, ...]

Jul 07 '22 06:07 Chiiizzzy

What is the expected behavior? I suggest that we have a detailed description of the API definition.

Jul 07 '22 07:07 reiase

Check result:

Row-based DataFrame

>>> from towhee import Entity, DataFrame
>>> e = [Entity(a=a, b=b) for a,b in zip(range(3), range(3))]
>>> df = DataFrame(e)
>>> df.to_list()
[<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>], [<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>])

Column-based DataFrame

>>> df = df.to_column()
>>> df.to_list()
[<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>], [<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>])

Chunked DataFrame

>>> df = DataFrame(e)
>>> df = df.set_chunksize(2)
>>> df.to_list()
[<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>], [<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>])

For Chunked DataFrame, when executing, we need chunks as the basic computational unit, so we should iterate df._iterable.chunks() instead of df itself:

>>> df._iterable.chunks()
[pyarrow.Table
a: int64
b: int64
----
a: [[0,1]]
b: [[0,1]], pyarrow.Table
a: int64
b: int64
----
a: [[2]]
b: [[2]]]

As showed, all the behavior has achieved our expectation with https://github.com/towhee-io/towhee/pull/1510

Jul 07 '22 09:07 Chiiizzzy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Close the stale issues and pull requests after 7 days of inactivity. Reopen the issue with /reopen.

Aug 06 '22 11:08 stale[bot]

towhee towhee copied to clipboard

[DesignProposal]: Iterating `df` and `df._iterable` should return the same result

Background and Motivation

Design

Expected results:

towhee
towhee copied to clipboard