pandas icon indicating copy to clipboard operation
pandas copied to clipboard

DOC: `value_counts` description doesn't match code logic

Open maciejskorski opened this issue 2 years ago • 8 comments

Pandas version checks

  • [X] I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

dev docs

last release docs

Documentation problem

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

In reality, it only does groupby followed by size, no nunique:

https://github.com/pandas-dev/pandas/blob/ca60aab7340d9989d9428e11a51467658190bb6b/pandas/core/frame.py#L6468-L6592

Suggested fix for documentation

Align the doc string to match the code logic.

I suggest not to mention "unique rows" or "unique combinations" and maybe mention the equivalence to groupby+size with the optional normalization.

Will be happy to submit a PR for that, or share my two cents in the discussion.

maciejskorski avatar Sep 19 '22 08:09 maciejskorski

Let me work on this issue.

WooilKim avatar Sep 19 '22 08:09 WooilKim

Cool. I am reporting this as I have seen quite a few people confused by the behaviour. Let me know if I can be of any help (review, discuss).

maciejskorski avatar Sep 19 '22 08:09 maciejskorski

@maciejskorski Thanks. I'll share the progress soon.

WooilKim avatar Sep 19 '22 09:09 WooilKim

I don't understand the issue here. Why do you think that value_counts does not count unique rows? That this is done via groupby is an implementation detail that should not be visible in the docs

phofl avatar Sep 19 '22 20:09 phofl

Thanks for the report! I think we should stick to a description that doesn't rely on other parts of pandas (e.g. groupby) as much as possible. Otherwise it may be difficult for new users to understand what this method should do.

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

This is not how I read the docstring, but I can see that it can be interpreted that way! To me, Return a Series containing counts of unique rows in the DataFrame. means "For each unique row in the DataFrame, report the number of times it occurs".

What about something like:

Return a Series containing the number of times each unique row occurs in the DataFrame.

rhshadrach avatar Sep 19 '22 20:09 rhshadrach

Thanks for the report! I think we should stick to a description that doesn't rely on other parts of pandas (e.g. groupby) as much as possible. Otherwise it may be difficult for new users to understand what this method should do.

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

This is not how I read the docstring, but I can see that it can be interpreted that way! To me, Return a Series containing counts of unique rows in the DataFrame. means "For each unique row in the DataFrame, report the number of times it occurs".

What about something like:

Return a Series containing the number of times each unique row occurs in the DataFrame.

Better to my taste! This wording better separates “unique row” from “occurrences”. Otherwise, it was too close to “unique occurrences” aka “count distinct”.

Maybe even "unique row -> unique row value". To emphasize counting of row values (tuples), rather than rows themselves (row can be seen unique merely by indexing). Also, row value is a well-known term (from SQL standard).

maciejskorski avatar Sep 19 '22 21:09 maciejskorski

Also, row value is a well-known term (from SQL standard).

But I don't believe it is used much in pandas - the two occurrences I see in the docs both explain the term. It seems better to me to indicate the index is ignored rather than using this terminology.

rhshadrach avatar Sep 20 '22 03:09 rhshadrach

I don't insist. I acknowledge that pandas is used by people with different backgrounds.

So please re-read my request as asking for decoupling “unique” and “count”, like proposed above, that would do the job.

Sorry for that verbose communication, but I come with research background where the choice of words means a lot :-)

maciejskorski avatar Sep 20 '22 08:09 maciejskorski

I would like to take on this issue

adejumoridwan avatar Jun 19 '23 13:06 adejumoridwan

Take, if still available

dbalagula avatar Jul 18 '23 15:07 dbalagula

Take

adejumoridwan avatar Jul 18 '23 15:07 adejumoridwan