deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[BUG] Dataset.Metadata mangles bytes as str with unknown encoding

Open saamc opened this issue 5 months ago • 1 comments

Severity

P1 - Major feature malfunctioning

Current Behavior

Assigning bytes to Dataset.Metadata or Column.Metadata stores the byte sequence as a str not as bytes. For bytes consisting of ascii characters the sequence should be legible, but it isn't. It is not straightforward to recover the bytes from the str object.

Steps to Reproduce

import deeplake

ds = deeplake.create("mem://temp")

byte_sequence = b"This should be legible"
ds.metadata["bytes"] = byte_sequence
eq_or_ne = "=" if ds.metadata["bytes"] == byte_sequence else "!"
print(f"byte_sequence '{byte_sequence}' {eq_or_ne}= ds.metadata['bytes'] '{ds.metadata["bytes"]}'")
print(f"byte_sequence '{type(byte_sequence)}' {eq_or_ne}= ds.metadata['bytes'] '{type(ds.metadata["bytes"])}'")

store in issue.py, execute issue.py, see result

$ python issue.py 
byte_sequence 'b'This should be legible'' != ds.metadata['bytes'] 'VGhpcyBzaG91bGQgYmUgbGVnaWJsZQ=='
byte_sequence '<class 'bytes'>' != ds.metadata['bytes'] '<class 'str'>'

Expected/Desired Behavior

The value returned from a metadata property should retain the type and content that was assigned to it. The output should read:

$ python /tmp/gh_issue.py 
byte_sequence 'b'This should be legible'' == ds.metadata['bytes'] 'b'This should be legible''
byte_sequence '<class 'bytes'>' == ds.metadata['bytes'] '<class 'bytes'>'

This is particularly relevant, given that [BUG]#3061 has a workaround where sequence[text] is replaced by sequence[bytes] (using str.encode). Now, it would be handy to store the list of tokens unique to the collection of all sequence[text] across records in a dataset in the metadata of the col containing the sequence[text]. While it is possible to assign a list[str] to the metadata, list[bytes] will be garbled.

Python Version

python 3.12.0 hab00c5b_0_cpython conda-forge

OS

Ubuntu 24.04.2 LTS

IDE

VS-Code

Packages

deeplake==4.2.14 numpy==2.3.1 pip==25.1.1 setuptools==80.9.0 wheel==0.45.1

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR (Thank you!)

saamc avatar Jul 15 '25 18:07 saamc

@saamc, thanks for reporting the issue, We are looking into the reported problem. I've changed the severity from P0 to P1

activesoull avatar Jul 16 '25 03:07 activesoull