signac icon indicating copy to clipboard operation
signac copied to clipboard

No more KeyError after changing groupbydoc to groupby

Open cbkerr opened this issue 3 years ago • 3 comments

Description

When I was adding tests for fixing the deprecation of groupbydoc, I noticed that there is a test that looks for a KeyError that doesn't raise one with the new doc query namespace. This leads me to wonder if the sp query namespace should also raise a KeyError

Additionally, should an empty project.groupby() raise an error?

To reproduce

import signac

project = signac.init_project("project")

for i in range(3):
    job = project.open_job(
        dict(
            a=1,
            b=i
        )
    ).init()
    job.doc = dict(
        a=1,
        b=2,
        c=i
    )


# This raise a KeyError
for k,g in project.groupbydoc('ff'):
    print(k)


# These all don't raise errors
for k,g in project.groupby('doc.ff'):
    print(k, list(g))

for k, g in project.groupby('sp.ff'):
    print(k, list(g))

for k,g in project.groupby('ff'):
    print(k, list(g))
    
for k,g in project.groupbydoc():
    print(k, list(g))

for k,g in project.groupby():
    print(k, list(g))

cbkerr avatar Apr 20 '22 02:04 cbkerr

I think the behavior is fine, and it's more consistent with the removal of groupbydoc. We document this behavior:

If key is None, jobs are grouped by id, placing one job into each group.

https://docs.signac.io/projects/core/en/latest/api.html#signac.contrib.project.JobsCursor.groupby

bdice avatar Apr 20 '22 19:04 bdice

I think it's fine for an empty filter to not raise any error. As @bdice pointed out, that is the expected and documented behavior. I think the reason it is potentially confusing is that this is not how pandas works, and pandas groupby functionality is what a lot of users might be most familiar with. In pandas you have to provide a nonempty groupby key. This contrasts with our underlying implementation, which is itertools.groupby and has the behavior that we have. I don't think that there's much we can do here other than make the documentation even clearer, but so far we haven't received many questions about this behavior and I don't expect to, so I'm not worried.

The first three cases do look erroneous though. They should raise a KeyError in my opinion. Inside Project.groupby we call JobsCursor.groupby on a JobsCursor that is constructed here with a filter constructed here. My best guess for why these cases don't raise exceptions is that the filter is prefiltering out jobs that don't have the keys, so the keyfunction defined never sees a job where it is trying to access a non-existent key. I do think that an error would be better to prevent users from accidentally thinking they have a valid groupby. There's also no way that project.groupbydoc('ff') and project.groupbydoc('sp.ff') should behave differently.

vyasr avatar Apr 21 '22 05:04 vyasr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 31 '22 01:07 stale[bot]

@cbkerr @bdice can one of you take another quick look at this?

vyasr avatar Sep 18 '22 16:09 vyasr

The most obvious behavioral difference I see is that groupbydoc("nonexistent_key") raises but groupby("nonexistent_key") does not raise, because signac filters to only jobs where the key exists in groupby but not groupbydoc. https://github.com/glotzerlab/signac/blob/3ffa9609a78ae1eaab0f33c51aa885e107fcd12c/signac/contrib/project.py#L2732-L2734

That is, groupby("nonexistent_key") returns zero groups in the absence of a default value (or if default=None).

I don't really think we should worry about this unless @cbkerr wants to file a PR to make groupbydoc act the same as groupby in 1.x. Otherwise we have a consistent behavior once the groupbydoc feature is removed in 2.0. If everything goes through groupby (including doc.key syntax) then this difference in behavior is resolved.

bdice avatar Sep 19 '22 14:09 bdice

To concretely answer @cbkerr's questions above (partially repeating previous comments for clarity):

This leads me to wonder if the sp query namespace should also raise a KeyError

I don't think the behavior should change. We can add a line to the docs. See https://github.com/glotzerlab/signac/pull/817

Additionally, should an empty project.groupby() raise an error?

Resolved above - @bdice and @vyasr agree the behavior is expected and documented.

I vote to close this with #817.

bdice avatar Sep 19 '22 14:09 bdice

Works for me. I'm fine not raising on nonexistent keys.

vyasr avatar Sep 19 '22 14:09 vyasr

I agree that removing groupbydoc will fix this as long as groupby is documented.

cbkerr avatar Sep 19 '22 15:09 cbkerr