polars icon indicating copy to clipboard operation
polars copied to clipboard

Add optional validation output to joins

Open zundertj opened this issue 3 years ago • 4 comments

(originally by https://github.com/pola-rs/polars/issues/2292#issuecomment-1007962185 )

@austospumanto: Separately, on the problem of two columns becoming one column in the join result: it would be great if polars could retain both columns in the join like pandas does (when the two columns have different names). This is useful for checking for nulls in non-inner joins to see which rows found matches, and also for situations like the one you stated. I find myself duplicating+suffixing columns before joining to get this behavior in polars.

Suggestion by me: It may be easier if we have indicator as an optional output as in Pandas https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge

zundertj avatar Jan 14 '22 17:01 zundertj

The indicator seems like a good option to me. If we choose that to be a boolean than it will also be very memory efficient.

ritchie46 avatar Jan 15 '22 07:01 ritchie46

If boolean, that would have to be two indicators, one for whether it occurs in the original left dataframe, and one for the right dataframe. I see Pandas opts for a categorical to cover left/right/both.

zundertj avatar Jan 16 '22 10:01 zundertj

Still would save 2/8s of RAM. Two indicators sounds good to me.

ritchie46 avatar Jan 16 '22 11:01 ritchie46

I think the validate option in pandas would be good also. (originally suggested in #5883, which was closed as a dupe of this issue)

eutwt avatar Dec 23 '22 14:12 eutwt

Note that the validate option has recently been added in https://github.com/pola-rs/polars/pull/9278.

zundertj avatar Jun 17 '23 13:06 zundertj

Hi. I'm not an expert on suggesting changes, so I apologize if this isn't the correct method. Although the validation option has been added, it would be very helpful to include an 'indicator' option, similar to what is available in pandas. As far as I know, this feature hasn't been added yet, and the issue explicitly requesting this option ( #5983 ) has been closed. Thanks!

mauricioabur avatar Jan 23 '24 03:01 mauricioabur