data icon indicating copy to clipboard operation
data copied to clipboard

[BE] Unify Map style DataPipes to support input_col and output_col syntax

Open VitalyFedyunin opened this issue 2 years ago • 9 comments

Stack from ghstack (oldest at bottom):

  • #652
  • -> #562

Differential Revision: D40146676

VitalyFedyunin avatar Jun 29 '22 20:06 VitalyFedyunin

Surely doing it for flatmap now, plus will attempt to convert several inner DataPipes to this approach.

VitalyFedyunin avatar Jun 29 '22 21:06 VitalyFedyunin

UPD: Temporary make a copy of FlatMapper UPD: Rewrote JsonParser as example.

VitalyFedyunin avatar Jun 30 '22 20:06 VitalyFedyunin

I think it's better to move this class to PyTorch Core for Mapper and Filter.

To be honest I prefer to keep as much as possible inside this repo. This will make future deprecation easier.

VitalyFedyunin avatar Jul 06 '22 20:07 VitalyFedyunin

Direct use of abstract leads to:

torchdata/datapipes/iter/util/jsonparser.py:40:5: error: Signature of "_map"
incompatible with supertype "MapTemplateIterDataPipe"  [override]
        def _map(self, stream: IO):

VitalyFedyunin avatar Oct 05 '22 19:10 VitalyFedyunin

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

VitalyFedyunin avatar Oct 06 '22 15:10 VitalyFedyunin

nit: This PR is fine (with nit comments) but there is something unsatisfying about how input_col and output_col work in the case where I want to select an element from a tuple/list?

Note that we do not have to address the below within this PR.

A common return pattern that we have is (Path, Stream) from archives/streams/etc. Let's say I opened a file and now I have a DataPipe of tuples of (Path, Stream), and I want to throw away the path and parse the stream. I will have to do something like this

datapipe = FileOpener(json_files, mode="b")
datapipe = datapipe.parse_json_files()
datapipe = datapipe.map(lambda x: x[1])  # Ideally this should not be needed, and handled by `input_col`, `output_col`

I feel like ideally the input_col and output_col argument should allow for selection (unpack when the elements are list of length 1 or tuple of length 1).

We likely also want to allow the case where the input is not a list/tuple:

datapipe = IterableWrapper([stream1, stream2])
datapipe = datapipe.parse_json_files()  # I think we should support this, but it currently doesn't work

What do you two think?

cc: @ejguan

I think we have discussed about it earlier about the expected behavior of the combination of input_col and output_col. Currently, when input_col is specified buy output_col is not, we will do in-place operation. In order to support selection, i can't find a good value for output_col to indicate that we want to select a single column.

datapipe = datapipe.parse_json_files()  # I think we should support this, but it currently doesn't work

Do you mean you want it still works with stream input rather than a tuple of (pathname, stream)

ejguan avatar Oct 06 '22 20:10 ejguan

I think we have discussed about it earlier about the expected behavior of the combination of input_col and output_col. Currently, when input_col is specified buy output_col is not, we will do in-place operation. In order to support selection, i can't find a good value for output_col to indicate that we want to select a single column.

One potential solution is to have output_col=[] or output_col=tuple() as selection. There may be side effects that I currently cannot think of.

Do you mean you want it still works with stream input rather than a tuple of (pathname, stream)

Yes, it may require some adjustment to how input_col and MapTemplate works to handle non-list/tuple inputs.

NivekT avatar Oct 06 '22 20:10 NivekT

Sorry I accidentally edited your comment while quoting it.

NivekT avatar Oct 06 '22 20:10 NivekT

Yes, it may require some adjustment to how input_col and MapTemplate works to handle non-list/tuple inputs.

I think it's doable after this PR. We can let JsonParser be a MapTemplate with input_col=1 by default to prevent BC breaking. Users can specify input_col=None to make it directly working with stream.

ejguan avatar Oct 06 '22 20:10 ejguan

Hi @VitalyFedyunin!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Dec 02 '22 16:12 facebook-github-bot