cpython gh-90533: Implement BytesIO.peek()

Closes gh-90533

Jan 22 '22 22:01 marcelm

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@marcelm

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

Jan 22 '22 22:01 the-knights-who-say-ni

All commit authors signed the Contributor License Agreement.

Apr 10 '22 20:04 cpython-cla-bot[bot]

The "CLA not signed" badge above isn’t correct: I have signed the CLA.

Apr 14 '22 08:04 marcelm

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Aug 05 '22 22:08 bedevere-bot

@AlexWaygood You’ve been the only human to interact with this PR so far, do you possibly have any advice on how to move this forward?

Nov 09 '22 13:11 marcelm

Hi @marcelm — sorry for the delay in anybody looking at this. I haven't studied your PR in detail (or thought about whether the proposal is a good idea), but it looks well put together at first glance.

I'll try to take a look soon. I won't be able to review the C code, but I can comment on whether the proposal seems like a good idea, and I can review the Python implementation and the tests.

Note that if this proposal is accepted, it will also need:

Updates to the docs for BytesIO (Doc/library/io.rst)
An entry in "What's new in Python 3.12" (Doc/whatsnew/3.12.rst)

You can also add yourself to Misc/ACKS as part of this PR, and give yourself credit in the NEWS entry, if you like. (Neither is compulsory.)

Nov 09 '22 13:11 AlexWaygood

Thank you! I pushed a documentation update and will add an entry to the What’s new document in case the PR is reviewed favorably.

Nov 09 '22 21:11 marcelm

I'll defer to @benjaminp's judgement on this one -- I'm really not the right reviewer for this, unfortunately :(

Please ping me again if you still haven't had a review in a few weeks.

Nov 28 '22 12:11 AlexWaygood

Thank you! I appreciate that you took the time.

Nov 28 '22 12:11 marcelm

Please ping me again if you still haven't had a review in a few weeks.

@AlexWaygood Hi! Been 8 months, no review so far :) How can we take this forward? I'm not the original author of the PR but I also have a use-case for this API and would like to see this change land, happy to help progress this.

Jul 08 '23 09:07 awalgarg

Thanks for your interest! I have updated the PR to fix the merge conflict and to reflect that it now needs to target 3.13.

Jul 08 '23 13:07 marcelm

@marcelm That was quick, thank you so much!

Jul 08 '23 13:07 awalgarg

cc: @erlend-aasland @vstinner @benjaminp for the io C code.

A

Sep 21 '23 03:09 AA-Turner

Thank you for the review! I addressed the comments.

I made one other change while revisiting the code: The default value for the size argument is now 1. Previously, the default was to return a slice of the entire buffer from the current position until the end, but since the memory is copied, this can result in large objects and would be wasteful. This makes the function signature different from the one for BufferedReader.peek(), but note that BufferedReader.peek() ignores the size argument anyway (at least in the C version).

All existing peek() functions (in BufferedReader, GzipFile, LZMAFile, BZ2File) have as the only guarantee that they return at least one byte unless EOF is reached (no matter what the size argument is). So because that is the only thing one can rely on as a user of peek(), I think it’s fine for BytesIO.peek() to just return a single byte by default.

Sep 22 '23 08:09 marcelm

Not sure, but the CI test failure on Windows seems to be unrelated.

Sep 22 '23 12:09 marcelm

Not sure, but the CI test failure on Windows seems to be unrelated.

Result: FAILURE then SUCCESS is an unstable test: test_asyncio.test_windows_events failed and then passed when re-run, you can safely ignore this error.

Sep 22 '23 16:09 vstinner

cc @pitrou who converted the Python io implementation to the C io implementation in Python 3.1.

Sep 28 '23 23:09 vstinner

Thanks again for the review! I’ll address your most recent comments (regarding documentation and failing PyBytes_FromStringAndSize) tomorrow.

Sep 28 '23 23:09 marcelm

@marcelm: It seems the CLA is not signed. Can you look into it?

Sep 29 '23 07:09 erlend-aasland

@marcelm: It seems the CLA is not signed. Can you look into it?

That was a false negative (reported here). But I just re-did it and that seems to have fixed it.

Sep 29 '23 07:09 marcelm

That was a false negative (reported here). But I just re-did it and that seems to have fixed it.

Thanks.

Sep 29 '23 07:09 vstinner

BytesIO.peek() has a different semantics than BufferedReader.peek(). It's kind of surprising.

I like how bz2.BZ2File.peek is documented:

Return buffered data without advancing the file position. At least one byte of data will be returned (unless at EOF). The exact number of bytes returned is unspecified.

This appears to me to describe quite well what is common to all the peek functions. Would the following work for BufferedReader.peek()?

      Return bytes from the current position onwards without advancing the position.
      At least one byte of data is returned if not at EOF.
      Return an empty :class:`bytes` object at EOF.
      At most one single read on the underlying raw stream is done to satisfy the call.
      The exact number of bytes returned is unspecified
      (*size* is ignored).

I haven’t turned this into a commit because this change is not quite what you suggested.

Sep 29 '23 08:09 marcelm

To be honest, the contract and motivation for peek() were always a bit fuzzy, so it's a bit difficult to choose a given semantics. I think the current proposed implementation is ok.

cc @gvanrossum @DanielStutzbach

Sep 29 '23 08:09 pitrou

Also, if we were to do it again, we would certainly have peek() return a memoryview rather than a bytes object...

Sep 29 '23 08:09 pitrou

Also, if we were to do it again, we would certainly have peek() return a memoryview rather than a bytes object...

For a new function like BytesIO.peek(), would it make sense to do it?

Sep 29 '23 09:09 vstinner

The complicated part of a memoryview is that BytesIO can be read and written. How do you guarantee that the view is not going to change after peek() if a write() is done later?

Sep 29 '23 09:09 vstinner

For a new function like BytesIO.peek(), would it make sense to do it?

Why not, but it would also deviate from the BufferedReader behaviour, which I think defeats the point of this proposal?

The complicated part of a memoryview is that BytesIO can be read and written. How do you guarantee that the view is not going to change after peek() if a write() is done later?

You wouldn't need to guarantee anything IMHO.

Sep 29 '23 09:09 pitrou

You wouldn't need to guarantee anything IMHO.

I prefer that when I read data form the filesystem, the data doesn't change, even if I use it 10 seconds later, or 1 day later. Or for my needs, apparently, bytes is the good type. It's less efficient, but it works as expected.

Maybe a new API is needed for memoryview which has a different semantics.

Sep 29 '23 11:09 vstinner

That was just a random suggestion, no need to act on it here.

Sep 29 '23 13:09 pitrou

I’ve now taken the liberty of changing the BufferedReader.peek() documentation as I suggested above. I think I’ve addressed all comments.

Sep 29 '23 14:09 marcelm

cpython cpython copied to clipboard

gh-90533: Implement BytesIO.peek()

CLA Missing

cpython
cpython copied to clipboard