pygit2
pygit2 copied to clipboard
Consistent interface to get text and bytes
Follow up from #610 #790 and #893
General policy:
- Do not implement
str()
- To get unicode string use
.text
- In
.text
use UTF-8 and replace (the rationale for replace is explained in https://github.com/libgit2/pygit2/pull/790#issuecomment-385906316) - To get the byte string use
.data
or.raw
(this is to be decided) - For attributes the name of the attribute returns text, prefix with
raw_
to get bytes. For instanceSignature.name
andSignature.raw_name
- Implement the buffer protocol,
bytes(..)
where appropriate
Open for discussion.
TODO:
- [x] Replace
TreeEntry._name
by.raw_name
- [ ] Replace
DiffLine.content
by.text
- [ ] Inventory all the places where we get bytes, text, or the buffer protocol
- [ ] Settle on
.data
or.raw
- [ ] Replace
DiffLine.raw_content
by.data
or.raw
- [ ] Replace
Object.read_raw()
by.data
(or.raw
), then removeBlob.data
(it will inherit fromObject
) - [ ] Settle on
str()
bytes()
and the buffer protocol
The case of Oid, what we've now:
-
oid.raw
returns the byte string (that's good, unless we decide to settle on.data
) -
str(oid)
andoid.hex
both return the hex representation, always<str>
(bytes in Python 2 and text in Python 3) - Oid is the only place where we implement
str(...)
-
Object.hex
andTreeEntry.hex
behave the same, they return always<str>
. Apparently these are the only places where we always return<str>
.
If I'm not missing anything..
These are the places where we use .data
:
-
Blob.data
-
Patch.data
These are the places where we use .raw
:
-
Oid.raw
With read_raw()
:
-
Object.read_raw()
Just curious, are you wanting to get all this done before doing another release? Also, how do you decide when to do a release?
I prefer data
as an attribute name, because it's a noun, and it straightforwardly tells you what you're getting. raw
is an adjective. Raw what? I think having attributes like raw_name
makes sense, but the word "raw" by itself makes less sense. Also, since the word "data" is already kind of generic, it already kind of implies that no assumptions are being made about the structure of the data, and so it already implies "raw".
Good point.
No, this doesn't block the release. What I'd like to do before the next release is handling the pycparser issue, #846
The criteria for new releases is: not too big, not too small (but it may be small if it's important). Looking back usually we've a release every 1 to 3 months.
Given that Python 2 will be EOL on January 1 2020, do you want to consider making this a Python 3-only module? That would drastically simplify string & bytes handling.
I'm an engineer on the Bitbucket team, and I just want to point out that we absolutely need an option to get raw bytes version of pretty much everything in the repo. It's not a question of Python2 vs. 3. pygit2's strategy of assuming UTF-8 and failing otherwise doesn't work for us all the time. We host millions of repos, and almost all of that repo data is created outside of Bitbucket and pushed to us. But we still need to display it correctly on our website. In the past, we've encountered issues in pygit2 where we simply have no way of reading a piece of data using pygit2 because it's not UTF-8 encoded, and pygit2 just raises an exception.
@jnrbsn Okay, noted. The fastest way to get something done is to make a PR.
@jdavid I know. I wasn't necessarily saying it was urgent. I just wanted to point out that dropping support for Python 2 doesn't mean we no longer need this.