vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[BUG-REPORT] value_counts() on string column returns wrong values

Open vonglod opened this issue 2 years ago • 1 comments

Description

I have a string column with large number of unique values (~20% of number of rows). When I call value_counts() on this column, it returns a wrong result, counting some values only once.

import numpy as np
import vaex

rng = np.random.default_rng(42)
df = vaex.from_arrays(
    x=[str(x) for x in rng.integers(low=0, high=10_000, size=100_000)]
)

vc1 = df["x"].value_counts()
print(vc1.sum())

Prints 95790 (or other numbers less than 100 000).

It happens only with string columns and only if a number of unique values is large enough (given code returns 100 000 if we change high to 1_000).

Software information

  • Vaex version (import vaex; vaex.__version__): {'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}
  • Vaex was installed via: pip
  • OS: Ubuntu 22.04

Additional information

groupby with agg='count' works fine.

vonglod avatar Aug 03 '22 11:08 vonglod

Hi @adolganov

Thank you for reporting this, and for the clean example! Much appreciated. Will try to fix this soon!

JovanVeljanoski avatar Aug 03 '22 20:08 JovanVeljanoski