numpy icon indicating copy to clipboard operation
numpy copied to clipboard

Encoding an empty unicode array would produce an array of the wrong dtype

Open will133 opened this issue 5 years ago • 5 comments

Calling numpy.char.encode on empty unicode array would create a float64 array instead of an array of S dtype.

Reproducing code example:

import numpy
print(numpy.char.encode(numpy.array([], 'U'), 'utf8').dtype)
# This would output:
# float64

I would expect an empty S1 array.

Error message:

The dtype returned seems wrong.

Numpy/Python version information:

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.16.2 3.7.2 (default, Dec 29 2018, 06:19:36)
[GCC 7.3.0]

This is run on a conda environment (I just did a "conda create -n test_numpy python=3.7 numpy"). The problem seems to exist in earlier numpy as well (1.15).

will133 avatar Mar 18 '19 21:03 will133

The shape also seems to get messed up. I.e.:

numpy.char.encode(numpy.array([], 'U').reshape((1, 0, 1)), 'utf8').shape)

Prints (1, 0) instead of the original shape.

newt0311 avatar Nov 08 '19 14:11 newt0311

Decode is also affected by this bug btw.

newt0311 avatar Dec 05 '19 13:12 newt0311

The bug is in _to_string_or_unicode_array, which impacts all of:

  • mod
  • decode
  • encode
  • expandtabs
  • join
  • partition
  • replace
  • rpartition

The fix is probably to work out the correct type ahead of time, rather than guessing from the array contents.

eric-wieser avatar Dec 05 '19 14:12 eric-wieser

This stackoverflow question is another report of the bug: Why does numpy's np.char.encode turn an empty unicode array into an empty float64 array?

WarrenWeckesser avatar Jul 19 '22 14:07 WarrenWeckesser

Here's an older issue that reports the same problem: https://github.com/numpy/numpy/issues/7371

WarrenWeckesser avatar Jul 19 '22 14:07 WarrenWeckesser