data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Add merge transform

Open roytman opened this issue 8 months ago • 7 comments
trafficstars

Why are these changes needed?

This transform merges two or more tables, assuming that while the tables have different sets of columns, their rows contain the same data. It facilitates embarrassingly parallel data processing by merging the results.

roytman avatar Feb 25 '25 21:02 roytman

LGTM. btw you also need to update transforms/pyproject.toml

revit13 avatar Feb 27 '25 06:02 revit13

@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms.

shahrokhDaijavad avatar Feb 28 '25 20:02 shahrokhDaijavad

@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms.

@shahrokhDaijavad , @touma-I , I checked the existing notebooks, and I have some questions. The Python notebooks have a step ***** Setup runtime parameters for this transform and the next step, is called ***** Use python runtime to invoke the transform

Actually, the execution is done in the first step, and the second one is empty. Is it the desired behavior?

roytman avatar Mar 09 '25 12:03 roytman

@roytman Which notebook are you looking at? A typical one to look at is this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb

shahrokhDaijavad avatar Mar 09 '25 15:03 shahrokhDaijavad

For example: https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/rep_removal.ipynb https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/doc_id.ipynb

and others

roytman avatar Mar 09 '25 16:03 roytman

You are right, @roytman. Definitely, rep_removal is not a good example, by not having a cell that shows the table of input parameters. It was written under the time pressure of delivering the Gneissweb transforms. I will create an issue to improve it. Doc-id is not too bad. In any case, I still suggest making it like the notebook for gneissweb_classification.

shahrokhDaijavad avatar Mar 09 '25 17:03 shahrokhDaijavad

thank you @shahrokhDaijavad

roytman avatar Mar 09 '25 18:03 roytman