data
data copied to clipboard
Ability to manipulate columns and fields
🚀 The feature
- [x] Add ability to drop specific column / field
For list
and tuple
list(dp) # [ (0, 1, 2), (3, 4, 5), (6,7, 8) ]
dp = dp.drop(1)
list(dp) # [ (0, 2), (3, 5), (6, 8) ]
Similar for dict
types
list(dp) # [ {a:1, b:2}, {a:3, b:4}, {a:5, b:6}]
dp = dp.drop('a')
list(dp) # [ {b:2}, {b:4}, {b:6}]
- [ ] Add ability to slice fields
For list
and tuple
list(dp) # [ (0, 1, 2, 0, 1, 2), (3, 4, 5, 3,4,5), (6,7, 8,6,7,8) ]
dp = dp[1:-1]
list(dp) # [ ( 1, 2, 0, 1), ( 4, 5, 3,4), (7, 8,6,7) ]
For dict
list(dp) # [ {a:1, b:2, c:3, d:4}, {a:3, b:4, c:3, d:4}, {a:5, b:6, c:3, d:4}]
dp = dp['a','b']
list(dp) # [ {a:1, b:2}, {a:3, b:4}, {a:5, b:6}]
- [ ] Add ability to flatten structures
list(dp) # [ (1, (2,3), 4), (5, (6,7), 8) ]
dp = dp.flatten(1)
list(dp) # [ (1, 2, 3,4), (5,6,7,8) ]
Note: No arguments flatten()
should cover use-case from #648
Note: Exception if structures are different length
list(dp) # [ { a: 1, b: {e: 2, f:3 }, c: 4} , { a: 1, b: {e: 2, f:3 }, c: 4} ]
dp = dp.flatten('b')
list(dp) # [ { a: 1, e: 2, f:3 , c: 4} , { a: 1, e: 2, f:3, c: 4}]
Note: Exception if keys overlaps
CC @NivekT @ejguan
Each task can be separated as easy first issue PR.
CC @dbish
Slicing is the hardest one on the list.
If we want to support slicing, don't you think indexing would be an additional feature? For list and tuple Single indexing
list(dp) # [ (0, 1, 2, 0, 1, 2), (3, 4, 5, 3,4,5), (6,7, 8,6,7,8) ]
dp = dp[1]
list(dp) # [ 1, 4, 7 ]
Advanced indexing
list(dp) # [ (0, 1, 2, 0, 1, 2), (3, 4, 5, 3,4,5), (6,7, 8,6,7,8) ]
dp = dp[[1, 3]]
list(dp) # [ (1, 0), (4, 3), (7, 6) ]
Agree that indexing could be a subclass of slicing.
@VitalyFedyunin and @ejguan, a few questions around the requirements for slicing and indexing. Right now slicing/indexing at a datapipe level is already implemented for map datapipes because of the implementation of get_item(). This means that there is already expected functionality for using this. It is implemented different then the requirements here, as it just returns the element or elements referenced at the locations like you're indexing a list, instead of a filter that does this for a datapipe of iterables.
Should we overwrite that functionality and make it this filter-like functionality described, or should we (my recommendation) keep that and add a new datapipe that filters each element in the slicing or indexing way described. Then instead of dp[1:2] you could do something like dp.filter([1:2]) and get back a datapipe where this is run on each element. We could also make sure this works only for iterators which don't implement __get_item() today, but then the functionality might be a bit confusing if it does one thing on one set of datapipes and another on a different.
Thoughts?
should we (my recommendation) keep that and add a new datapipe that filters each element in the slicing or indexing way described.
It sounds like a reasonable design to me.
Even for MapDataPIpe
, users might expect two different behaviors when passing slices into __getitem__
function:
- Get corresponding elements defined by slices.
- Select slice from each of element from
MapDataPipe
.
accidentally automatically closed when part 1 was landed