vaex
vaex copied to clipboard
[BUG-REPORT] value_counts() on string column returns wrong values
Description
I have a string column with large number of unique values (~20% of number of rows). When I call value_counts() on this column, it returns a wrong result, counting some values only once.
import numpy as np
import vaex
rng = np.random.default_rng(42)
df = vaex.from_arrays(
x=[str(x) for x in rng.integers(low=0, high=10_000, size=100_000)]
)
vc1 = df["x"].value_counts()
print(vc1.sum())
Prints 95790
(or other numbers less than 100 000).
It happens only with string columns and only if a number of unique values is large enough (given code returns 100 000 if we change high to 1_000).
Software information
- Vaex version (
import vaex; vaex.__version__)
: {'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'} - Vaex was installed via: pip
- OS: Ubuntu 22.04
Additional information
groupby with agg='count' works fine.
Hi @adolganov
Thank you for reporting this, and for the clean example! Much appreciated. Will try to fix this soon!