parquet-go icon indicating copy to clipboard operation
parquet-go copied to clipboard

Add "WithError" versions of NewReader/NewGenericReader

Open SgtCoDFish opened this issue 2 years ago • 8 comments

It's currently difficult to use NewReader / NewGenericReader in a situation where an invalid file might be read, since they can panic when trying to read an invalid file.

This PR adds WithError versions of NewReader and NewGenericReader, which return an error if something goes wrong rather than panicing.

This PR also adds a testdata file which causes a New(Generic)Reader to panic, and adds tests ensuring that NewReaderWithError and NewGenericReaderWithError return an error when trying to read that file. It was that file which alerted me to this issue!

I figured it was easier to raise a PR for discussion than to create an issue for this - if this needs any work or if this isn't something you want to add to the library, please let me know!

Alternative Approach

Another approach would be to break the API and change the signatures to e.g. func NewReader(...) (*Reader, error) - given that this library isn't at v1 yet, that might be an option which some libraries would take.

I assumed that breaking existing code wouldn't be desirable here, though - if this library ever publishes a v2, that might be a thing to do then.

That said, if you'd be happy to break the API I'd gladly change this PR to do that instead!

SgtCoDFish avatar Aug 11 '22 16:08 SgtCoDFish

Hi @SgtCoDFish thanks for your PR! This is certainly a change we can accept. I agree that it's better to not break the API just yet.

Something I'm not sure tho is the name of the function. Usually (in the stdlib for example) functions with WithError will expect the user to pass an error in parameter. Example: https://github.com/segmentio/parquet-go/pull/313.

Wondering if we should use something else like NewReaderOrError. What do you think ? Have you seen WithError used in other package already ?

Pryz avatar Aug 19 '22 00:08 Pryz

Wondering if we should use something else like NewReaderOrError. What do you think ? Have you seen WithError used in other package already ?

Ah I much prefer OrError, yeah!

I thought about the name a bit and ended up just going with my first idea; I'm definitely happy to change it. I've made that change :grin:

SgtCoDFish avatar Aug 19 '22 08:08 SgtCoDFish

Something I had documented (but maybe not highlighted enough) is the pattern which is currently supported to handle configuration errors; here is an example https://pkg.go.dev/github.com/segmentio/parquet-go#NewBuffer

I wonder if this would be an acceptable approach, or whether we feel like the new APIs are worth introducing still?

achille-roussel avatar Aug 23 '22 21:08 achille-roussel

I wonder if this would be an acceptable approach, or whether we feel like the new APIs are worth introducing still?

I think if configuration errors were the only way that creating a (Generic)Reader could fail that would be an acceptable tradeoff - the issue I encountered though was that creating the reader will panic on an invalid parquet file, which I think is demonstrated by the test that I added (not_enough_columns.invalid.parquet). I don't think the configuration approach you've shown there would solve that issue?

SgtCoDFish avatar Aug 23 '22 21:08 SgtCoDFish

Thanks for sharing these details.

You could open the file ahead of time to handle errors caused by invalid files, for example:

f, err := parquet.OpenFile(...)
if err != nil {
    ...
}
r := parquet.NewGenericReader[Row](f)

Then you have the guarantee that the reader constructor won't panic.

Let me know if this is an acceptable solution for your use case.

achille-roussel avatar Aug 29 '22 18:08 achille-roussel

You could open the file ahead of time to handle errors caused by invalid files, for example: Let me know if this is an acceptable solution for your use case.

This would probably work in our use case in the short term but it's quite verbose and fiddly. Using OpenFile means I need to manually calculate the size of my file, and it's just more boilerplate overall.

I like that NewGenericReader does all of this for me, so it would still be nice to be able to simply use NewGenericReaderOrError instead.

Another issue with panics is that there's no easy way for me to know whether a new failure mode which could cause a panic might be added to NewGenericReader in the future, which could cause my application to start to panic after I upgrade to a newer version of parquet-go. If it instead returns errors, my existing error handling code will handle those gracefully.

SgtCoDFish avatar Aug 30 '22 13:08 SgtCoDFish

@achille-roussel gentle bump - does my reasoning above seem reasonable?

SgtCoDFish avatar Sep 22 '22 09:09 SgtCoDFish

@Pryz @achille-roussel Hi, another gentle bump - does this look like it might be acceptable? Just looking to tie off the open PRs I have but of course I recognise that people are busy :grin:

SgtCoDFish avatar Nov 29 '22 15:11 SgtCoDFish