data-prep-kit
data-prep-kit copied to clipboard
Add merge transform
Why are these changes needed?
This transform merges two or more tables, assuming that while the tables have different sets of columns, their rows contain the same data. It facilitates embarrassingly parallel data processing by merging the results.
LGTM. btw you also need to update transforms/pyproject.toml
@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms.
@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms.
@shahrokhDaijavad , @touma-I , I checked the existing notebooks, and I have some questions.
The Python notebooks have a step ***** Setup runtime parameters for this transform
and the next step, is called ***** Use python runtime to invoke the transform
Actually, the execution is done in the first step, and the second one is empty. Is it the desired behavior?
@roytman Which notebook are you looking at? A typical one to look at is this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb
For example: https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/rep_removal.ipynb https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/doc_id.ipynb
and others
You are right, @roytman. Definitely, rep_removal is not a good example, by not having a cell that shows the table of input parameters. It was written under the time pressure of delivering the Gneissweb transforms. I will create an issue to improve it. Doc-id is not too bad. In any case, I still suggest making it like the notebook for gneissweb_classification.
thank you @shahrokhDaijavad