signac
signac copied to clipboard
No more KeyError after changing groupbydoc to groupby
Description
When I was adding tests for fixing the deprecation of groupbydoc, I noticed that there is a test that looks for a KeyError that doesn't raise one with the new doc query namespace. This leads me to wonder if the sp query namespace should also raise a KeyError
Additionally, should an empty project.groupby() raise an error?
To reproduce
import signac
project = signac.init_project("project")
for i in range(3):
job = project.open_job(
dict(
a=1,
b=i
)
).init()
job.doc = dict(
a=1,
b=2,
c=i
)
# This raise a KeyError
for k,g in project.groupbydoc('ff'):
print(k)
# These all don't raise errors
for k,g in project.groupby('doc.ff'):
print(k, list(g))
for k, g in project.groupby('sp.ff'):
print(k, list(g))
for k,g in project.groupby('ff'):
print(k, list(g))
for k,g in project.groupbydoc():
print(k, list(g))
for k,g in project.groupby():
print(k, list(g))
I think the behavior is fine, and it's more consistent with the removal of groupbydoc. We document this behavior:
If key is None, jobs are grouped by id, placing one job into each group.
https://docs.signac.io/projects/core/en/latest/api.html#signac.contrib.project.JobsCursor.groupby
I think it's fine for an empty filter to not raise any error. As @bdice pointed out, that is the expected and documented behavior. I think the reason it is potentially confusing is that this is not how pandas works, and pandas groupby functionality is what a lot of users might be most familiar with. In pandas you have to provide a nonempty groupby key. This contrasts with our underlying implementation, which is itertools.groupby and has the behavior that we have. I don't think that there's much we can do here other than make the documentation even clearer, but so far we haven't received many questions about this behavior and I don't expect to, so I'm not worried.
The first three cases do look erroneous though. They should raise a KeyError in my opinion. Inside Project.groupby we call JobsCursor.groupby on a JobsCursor that is constructed here with a filter constructed here. My best guess for why these cases don't raise exceptions is that the filter is prefiltering out jobs that don't have the keys, so the keyfunction defined never sees a job where it is trying to access a non-existent key. I do think that an error would be better to prevent users from accidentally thinking they have a valid groupby. There's also no way that project.groupbydoc('ff') and project.groupbydoc('sp.ff') should behave differently.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@cbkerr @bdice can one of you take another quick look at this?
The most obvious behavioral difference I see is that groupbydoc("nonexistent_key") raises but groupby("nonexistent_key") does not raise, because signac filters to only jobs where the key exists in groupby but not groupbydoc. https://github.com/glotzerlab/signac/blob/3ffa9609a78ae1eaab0f33c51aa885e107fcd12c/signac/contrib/project.py#L2732-L2734
That is, groupby("nonexistent_key") returns zero groups in the absence of a default value (or if default=None).
I don't really think we should worry about this unless @cbkerr wants to file a PR to make groupbydoc act the same as groupby in 1.x. Otherwise we have a consistent behavior once the groupbydoc feature is removed in 2.0. If everything goes through groupby (including doc.key syntax) then this difference in behavior is resolved.
To concretely answer @cbkerr's questions above (partially repeating previous comments for clarity):
This leads me to wonder if the sp query namespace should also raise a KeyError
I don't think the behavior should change. We can add a line to the docs. See https://github.com/glotzerlab/signac/pull/817
Additionally, should an empty project.groupby() raise an error?
Resolved above - @bdice and @vyasr agree the behavior is expected and documented.
I vote to close this with #817.
Works for me. I'm fine not raising on nonexistent keys.
I agree that removing groupbydoc will fix this as long as groupby is documented.