orca icon indicating copy to clipboard operation
orca copied to clipboard

Broadcast erroneously claims cast_on and onto_on are equivalents of left_on and right_on parameters of pandas.merge.

Open toliwaga opened this issue 6 years ago • 5 comments

They used to be, since there were passed through as of at least orca 1.1.

But now, they are assumed to be type str (or None) whereas pandas merge accepts label or list, or array-like.

So the following, which worked under orca 1.1, no longer works:

orca.broadcast(
               cast='persons',
               onto='trips',
               cast_on=['hh_id', 'person_num'],
               onto_on=['hh_id', 'person_num'])

At least the documentation should be revise, whcih currently reads:

def broadcast(cast, onto, cast_on=None, onto_on=None,
              cast_index=False, onto_index=False):
    """
    Register a rule for merging two tables by broadcasting one onto
    the other.

    Parameters
    ----------
    cast, onto : str
        Names of registered tables.
    cast_on, onto_on : str, optional
        Column names used for merge, equivalent of ``left_on``/``right_on``
        parameters of pandas.merge.
    """

toliwaga avatar Apr 26 '18 02:04 toliwaga

Are you getting a specific error?

The cast_on and onto_on values are passed straight through to pandas.merge here:

onto_table = pd.merge(
    onto_table, cast_table,
    suffixes=['_'+onto, '_'+cast],
    left_on=bc.onto_on, right_on=bc.cast_on,
    left_index=bc.onto_index, right_index=bc.cast_index)

However, using sequences may break other parts of the code, for example this looks like it might not properly handle lists of column names.

jiffyclub avatar Apr 26 '18 20:04 jiffyclub

The problem is on line 1808. I don't recall the exact error, but this is where things go wrong:

                # intersection is ok if it's the join key
                intersection.discard(bc.onto_on)
                intersection.discard(bc.cast_on)

The comment for broadcast pretty much says it all:

    cast_on, onto_on : str, optional
        Column names used for merge, equivalent of ``left_on``/``right_on``
        parameters of pandas.merge.

It wants expects that cast_on and onto_on are strings, but pandas.merge also accepts lists.

If you want the exact error, I can easily generate it.

toliwaga avatar Apr 26 '18 23:04 toliwaga

The error is:

  File ".../orca/orca.py", line 1803, in merge_tables
    intersection.discard(bc.onto_on)
TypeError: unhashable type: 'list'

generated by passing lists of strs to broadcast:

orca.broadcast(cast='persons_merged',
               onto='disaggregate_trips',
               cast_on=['hh_id', 'person_idx'],
               onto_on=['hh_id', 'person_idx'])

toliwaga avatar Apr 26 '18 23:04 toliwaga

Thanks, that's helpful. Looks like that code was added in e52fe6ccc as part of #20. That and the code I mentioned above both look like they wouldn't properly handle lists as cast_on/onto_on arguments.

jiffyclub avatar Apr 26 '18 23:04 jiffyclub

Yes.

toliwaga avatar Apr 26 '18 23:04 toliwaga