data icon indicating copy to clipboard operation
data copied to clipboard

Using datapipes on only parts of the input

Open SvenDS9 opened this issue 2 years ago • 2 comments

🚀 The feature

I want to apply the functionality of a already existing datapipe to only parts of my input. Below I listed some solutions. I would like to know the "best" solution for this problem.

Motivation, pitch

Simple example: I have a tuple containing URLs and additional information (e.g. a text, id, ...). I want to use the HttpReader to load images behind the URLs. Currently the HttpReader takes URLs and yields tuples of URLs and Filestreams. Additional information in the datapipe is not permitted.

Alternatives

  • Use Unzipper then apply the datapipe you need and then using either Zipper or IterKeyZipper zip the datapipes back together. Depending on your use-case you might also need Forker or copy a column beforehand to have your key present in both datapipes.
  • Use Mapper with input_col using a function that does what I need. If I also want to delete elements (e.g. setting skip_on_error = true in HTTPReader) I also need to add a filter. This leads to some code redundancy as the functionality already exists as a datapipe. In addition the datapipe is tested while my function is not.
  • "Add a new datapipe which accepts (source_data_datapipe, function_datapipe, input_selector, output_merge_fn) Each batch b of source_data_datapipe gets passed into input_selector(b) to get a processed input b' . Then b' is passed into function_datapipe for processing to get output c . Finally, output_merge_fn takes b and c and combine them into any output." While this works well with my simple example, many datapipes are not compatible with this approach. In addtition the output_merge_fn might get quite complicated depending on your use-case. "One option may be to restrict it to only work for DataPipes that do not change the cardinality of the data." Credits to @NivekT for coming up with this solution
  • Add input_col parameter to the Datapipes where necessary/appliccable (A lot of work and maintaining...)

Additional context

No response

SvenDS9 avatar Feb 17 '23 15:02 SvenDS9