data
data copied to clipboard
[BE] Unify Map style DataPipes to support input_col and output_col syntax
Surely doing it for flatmap now, plus will attempt to convert several inner DataPipes to this approach.
UPD: Temporary make a copy of FlatMapper
UPD: Rewrote JsonParser as example.
I think it's better to move this class to PyTorch Core for
Mapper
andFilter
.
To be honest I prefer to keep as much as possible inside this repo. This will make future deprecation easier.
Direct use of abstract leads to:
torchdata/datapipes/iter/util/jsonparser.py:40:5: error: Signature of "_map"
incompatible with supertype "MapTemplateIterDataPipe" [override]
def _map(self, stream: IO):
@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
nit: This PR is fine (with nit comments) but there is something unsatisfying about how
input_col
andoutput_col
work in the case where I want to select an element from a tuple/list?Note that we do not have to address the below within this PR.
A common return pattern that we have is
(Path, Stream)
from archives/streams/etc. Let's say I opened a file and now I have a DataPipe of tuples of(Path, Stream)
, and I want to throw away the path and parse the stream. I will have to do something like thisdatapipe = FileOpener(json_files, mode="b") datapipe = datapipe.parse_json_files() datapipe = datapipe.map(lambda x: x[1]) # Ideally this should not be needed, and handled by `input_col`, `output_col`
I feel like ideally the
input_col
andoutput_col
argument should allow for selection (unpack when the elements are list of length 1 or tuple of length 1).We likely also want to allow the case where the input is not a list/tuple:
datapipe = IterableWrapper([stream1, stream2]) datapipe = datapipe.parse_json_files() # I think we should support this, but it currently doesn't work
What do you two think?
cc: @ejguan
I think we have discussed about it earlier about the expected behavior of the combination of input_col
and output_col
. Currently, when input_col
is specified buy output_col
is not, we will do in-place operation. In order to support selection, i can't find a good value for output_col to indicate that we want to select a single column.
datapipe = datapipe.parse_json_files() # I think we should support this, but it currently doesn't work
Do you mean you want it still works with stream input rather than a tuple of (pathname, stream)
I think we have discussed about it earlier about the expected behavior of the combination of
input_col
andoutput_col
. Currently, wheninput_col
is specified buyoutput_col
is not, we will do in-place operation. In order to support selection, i can't find a good value foroutput_col
to indicate that we want to select a single column.
One potential solution is to have output_col=[]
or output_col=tuple()
as selection. There may be side effects that I currently cannot think of.
Do you mean you want it still works with
stream
input rather than a tuple of(pathname, stream)
Yes, it may require some adjustment to how input_col
and MapTemplate
works to handle non-list/tuple inputs.
Sorry I accidentally edited your comment while quoting it.
Yes, it may require some adjustment to how
input_col
andMapTemplate
works to handle non-list/tuple inputs.
I think it's doable after this PR. We can let JsonParser
be a MapTemplate
with input_col=1
by default to prevent BC breaking.
Users can specify input_col=None
to make it directly working with stream
.
Hi @VitalyFedyunin!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!