data Ability to manipulate columns and fields

Ability to manipulate columns and fields

Open VitalyFedyunin opened this issue 1 year ago • 8 comments

🚀 The feature

[x] Add ability to drop specific column / field

For list and tuple

list(dp) # [ (0, 1, 2), (3, 4, 5), (6,7, 8) ]
dp = dp.drop(1)
list(dp) # [ (0, 2), (3, 5), (6, 8) ]

Similar for dict types

list(dp) # [ {a:1, b:2}, {a:3, b:4}, {a:5, b:6}]
dp = dp.drop('a')
list(dp) # [ {b:2}, {b:4}, {b:6}]

[ ] Add ability to slice fields

For list and tuple

list(dp) # [ (0, 1, 2, 0, 1, 2), (3, 4, 5, 3,4,5), (6,7, 8,6,7,8) ]
dp = dp[1:-1]
list(dp) # [ ( 1, 2, 0, 1), ( 4, 5, 3,4), (7, 8,6,7) ]

For dict

list(dp) # [ {a:1, b:2, c:3, d:4}, {a:3, b:4, c:3, d:4}, {a:5, b:6, c:3, d:4}]
dp = dp['a','b']
list(dp) #  [ {a:1, b:2}, {a:3, b:4}, {a:5, b:6}]

[ ] Add ability to flatten structures

list(dp) # [ (1, (2,3), 4), (5, (6,7), 8) ]
dp = dp.flatten(1)
list(dp) # [ (1, 2, 3,4), (5,6,7,8) ]

Note: No arguments flatten() should cover use-case from #648 Note: Exception if structures are different length

list(dp) # [ { a: 1, b: {e: 2, f:3 }, c: 4} , { a: 1, b: {e: 2, f:3 }, c: 4} ]
dp = dp.flatten('b')
list(dp) # [ { a: 1, e: 2, f:3 , c: 4}  , { a: 1, e: 2, f:3, c: 4}]

Note: Exception if keys overlaps

CC @NivekT @ejguan

Jul 14 '22 20:07 VitalyFedyunin

Each task can be separated as easy first issue PR.

Jul 26 '22 15:07 VitalyFedyunin

CC @dbish

Jul 26 '22 15:07 VitalyFedyunin

Slicing is the hardest one on the list.

Jul 26 '22 15:07 VitalyFedyunin

If we want to support slicing, don't you think indexing would be an additional feature? For list and tuple Single indexing

list(dp) # [ (0, 1, 2, 0, 1, 2), (3, 4, 5, 3,4,5), (6,7, 8,6,7,8) ]
dp = dp[1]
list(dp) # [ 1, 4, 7  ]

Advanced indexing

list(dp) # [ (0, 1, 2, 0, 1, 2), (3, 4, 5, 3,4,5), (6,7, 8,6,7,8) ]
dp = dp[[1, 3]]
list(dp) # [ (1, 0), (4, 3), (7, 6) ]

Jul 26 '22 16:07 ejguan

Agree that indexing could be a subclass of slicing.

Jul 26 '22 22:07 VitalyFedyunin

@VitalyFedyunin and @ejguan, a few questions around the requirements for slicing and indexing. Right now slicing/indexing at a datapipe level is already implemented for map datapipes because of the implementation of get_item(). This means that there is already expected functionality for using this. It is implemented different then the requirements here, as it just returns the element or elements referenced at the locations like you're indexing a list, instead of a filter that does this for a datapipe of iterables.

Should we overwrite that functionality and make it this filter-like functionality described, or should we (my recommendation) keep that and add a new datapipe that filters each element in the slicing or indexing way described. Then instead of dp[1:2] you could do something like dp.filter([1:2]) and get back a datapipe where this is run on each element. We could also make sure this works only for iterators which don't implement __get_item() today, but then the functionality might be a bit confusing if it does one thing on one set of datapipes and another on a different.

Thoughts?

Aug 10 '22 22:08 dbish

should we (my recommendation) keep that and add a new datapipe that filters each element in the slicing or indexing way described.

It sounds like a reasonable design to me.

Even for MapDataPIpe, users might expect two different behaviors when passing slices into __getitem__ function:

Get corresponding elements defined by slices.
Select slice from each of element from MapDataPipe.

Aug 11 '22 01:08 ejguan

accidentally automatically closed when part 1 was landed

Aug 12 '22 16:08 dbish

data data copied to clipboard

Ability to manipulate columns and fields

🚀 The feature

data
data copied to clipboard