data
data copied to clipboard
Using datapipes on only parts of the input
🚀 The feature
I want to apply the functionality of a already existing datapipe to only parts of my input. Below I listed some solutions. I would like to know the "best" solution for this problem.
Motivation, pitch
Simple example:
I have a tuple containing URLs and additional information (e.g. a text, id, ...).
I want to use the HttpReader to load images behind the URLs.
Currently the HttpReader
takes URLs and yields tuples of URLs and Filestreams.
Additional information in the datapipe is not permitted.
Alternatives
- Use
Unzipper
then apply the datapipe you need and then using eitherZipper
orIterKeyZipper
zip the datapipes back together. Depending on your use-case you might also needForker
or copy a column beforehand to have your key present in both datapipes. - Use
Mapper
withinput_col
using a function that does what I need. If I also want to delete elements (e.g. settingskip_on_error = true
inHTTPReader
) I also need to add a filter. This leads to some code redundancy as the functionality already exists as a datapipe. In addition the datapipe is tested while my function is not. - "Add a new datapipe which accepts
(source_data_datapipe, function_datapipe, input_selector, output_merge_fn)
Each batchb
ofsource_data_datapipe
gets passed intoinput_selector(b)
to get a processed inputb'
. Thenb'
is passed intofunction_datapipe
for processing to get outputc
. Finally,output_merge_fn
takesb
andc
and combine them into any output." While this works well with my simple example, many datapipes are not compatible with this approach. In addtition theoutput_merge_fn
might get quite complicated depending on your use-case. "One option may be to restrict it to only work for DataPipes that do not change the cardinality of the data." Credits to @NivekT for coming up with this solution - Add
input_col
parameter to the Datapipes where necessary/appliccable (A lot of work and maintaining...)
Additional context
No response