pygit2 Consistent interface to get text and bytes

trafficstars

Follow up from #610 #790 and #893

General policy:

Do not implement str()
To get unicode string use .text
In .text use UTF-8 and replace (the rationale for replace is explained in https://github.com/libgit2/pygit2/pull/790#issuecomment-385906316)
To get the byte string use .data or .raw (this is to be decided)
For attributes the name of the attribute returns text, prefix with raw_ to get bytes. For instance Signature.name and Signature.raw_name
Implement the buffer protocol, bytes(..) where appropriate

Open for discussion.

TODO:

[x] Replace TreeEntry._name by .raw_name
[ ] Replace DiffLine.content by .text
[ ] Inventory all the places where we get bytes, text, or the buffer protocol
[ ] Settle on .data or .raw
[ ] Replace DiffLine.raw_content by .data or .raw
[ ] Replace Object.read_raw() by .data (or .raw), then remove Blob.data (it will inherit from Object)
[ ] Settle on str() bytes() and the buffer protocol

The case of Oid, what we've now:

oid.raw returns the byte string (that's good, unless we decide to settle on .data)
str(oid) and oid.hex both return the hex representation, always <str> (bytes in Python 2 and text in Python 3)
Oid is the only place where we implement str(...)
Object.hex and TreeEntry.hex behave the same, they return always <str>. Apparently these are the only places where we always return <str>.

Mar 31 '19 11:03 jdavid

If I'm not missing anything..

These are the places where we use .data:

Blob.data
Patch.data

These are the places where we use .raw:

Oid.raw

With read_raw():

Object.read_raw()

Mar 31 '19 11:03 jdavid

Just curious, are you wanting to get all this done before doing another release? Also, how do you decide when to do a release?

Mar 31 '19 21:03 jnrbsn

I prefer data as an attribute name, because it's a noun, and it straightforwardly tells you what you're getting. raw is an adjective. Raw what? I think having attributes like raw_name makes sense, but the word "raw" by itself makes less sense. Also, since the word "data" is already kind of generic, it already kind of implies that no assumptions are being made about the structure of the data, and so it already implies "raw".

Mar 31 '19 22:03 jnrbsn

Good point.

No, this doesn't block the release. What I'd like to do before the next release is handling the pycparser issue, #846

The criteria for new releases is: not too big, not too small (but it may be small if it's important). Looking back usually we've a release every 1 to 3 months.

Apr 01 '19 07:04 jdavid

Given that Python 2 will be EOL on January 1 2020, do you want to consider making this a Python 3-only module? That would drastically simplify string & bytes handling.

May 07 '19 21:05 dsully

I'm an engineer on the Bitbucket team, and I just want to point out that we absolutely need an option to get raw bytes version of pretty much everything in the repo. It's not a question of Python2 vs. 3. pygit2's strategy of assuming UTF-8 and failing otherwise doesn't work for us all the time. We host millions of repos, and almost all of that repo data is created outside of Bitbucket and pushed to us. But we still need to display it correctly on our website. In the past, we've encountered issues in pygit2 where we simply have no way of reading a piece of data using pygit2 because it's not UTF-8 encoded, and pygit2 just raises an exception.

Nov 12 '19 17:11 jnrbsn

@jnrbsn Okay, noted. The fastest way to get something done is to make a PR.

Nov 16 '19 08:11 jdavid

@jdavid I know. I wasn't necessarily saying it was urgent. I just wanted to point out that dropping support for Python 2 doesn't mean we no longer need this.

Nov 16 '19 19:11 jnrbsn

pygit2 pygit2 copied to clipboard

Consistent interface to get text and bytes

pygit2
pygit2 copied to clipboard