towhee icon indicating copy to clipboard operation
towhee copied to clipboard

[DesignProposal]: Iterating `df` and `df._iterable` should return the same result

Open Chiiizzzy opened this issue 1 year ago • 3 comments

Background and Motivation

No response

Design

We might need to modify following class's __repr__ and __iter__ function:

  • DataFrame
  • Entity
  • ChunkedTable
  • WritableTable

Expected results:

  • When iterating a row-based DataFrame and its _iterable (a list or a generator of Entity):
[<Entity dict_keys(['a', 'b' ...])>, <Entity dict_keys(['a', 'b', ...])>, ...]
  • When iterating a column-based DataFrame and its _iterable (a WritableTable):
[<EntityView dict_keys(['a', 'b' ...])>, <EntityView dict_keys(['a', 'b', ...])>, ...]
  • When iterating a chunked DataFrame and its _iterable (a ChunkedTable consists of a series of WritableTable):
[<EntityView dict_keys(['a', 'b' ...])>, <EntityView dict_keys(['a', 'b', ...])>, ...]

Chiiizzzy avatar Jul 07 '22 06:07 Chiiizzzy

What is the expected behavior? I suggest that we have a detailed description of the API definition.

reiase avatar Jul 07 '22 07:07 reiase

Check result:

  • Row-based DataFrame
>>> from towhee import Entity, DataFrame
>>> e = [Entity(a=a, b=b) for a,b in zip(range(3), range(3))]
>>> df = DataFrame(e)
>>> df.to_list()
[<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>], [<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>])
  • Column-based DataFrame
>>> df = df.to_column()
>>> df.to_list()
[<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>], [<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>])
  • Chunked DataFrame
>>> df = DataFrame(e)
>>> df = df.set_chunksize(2)
>>> df.to_list()
[<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>]
>>> list(df), list(df._iterable)
([<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>], [<EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>, <EntityView dict_keys(['a', 'b'])>])

For Chunked DataFrame, when executing, we need chunks as the basic computational unit, so we should iterate df._iterable.chunks() instead of df itself:

>>> df._iterable.chunks()
[pyarrow.Table
a: int64
b: int64
----
a: [[0,1]]
b: [[0,1]], pyarrow.Table
a: int64
b: int64
----
a: [[2]]
b: [[2]]]

As showed, all the behavior has achieved our expectation with https://github.com/towhee-io/towhee/pull/1510

Chiiizzzy avatar Jul 07 '22 09:07 Chiiizzzy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Close the stale issues and pull requests after 7 days of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Aug 06 '22 11:08 stale[bot]