parquet-go
parquet-go copied to clipboard
Add "WithError" versions of NewReader/NewGenericReader
It's currently difficult to use NewReader
/ NewGenericReader
in a situation where an invalid file might be read, since they can panic when trying to read an invalid file.
This PR adds WithError versions of NewReader
and NewGenericReader
, which return an error if something goes wrong rather than panic
ing.
This PR also adds a testdata file which causes a New(Generic)Reader
to panic, and adds tests ensuring that NewReaderWithError
and NewGenericReaderWithError
return an error when trying to read that file. It was that file which alerted me to this issue!
I figured it was easier to raise a PR for discussion than to create an issue for this - if this needs any work or if this isn't something you want to add to the library, please let me know!
Alternative Approach
Another approach would be to break the API and change the signatures to e.g. func NewReader(...) (*Reader, error)
- given that this library isn't at v1 yet, that might be an option which some libraries would take.
I assumed that breaking existing code wouldn't be desirable here, though - if this library ever publishes a v2
, that might be a thing to do then.
That said, if you'd be happy to break the API I'd gladly change this PR to do that instead!
Hi @SgtCoDFish thanks for your PR! This is certainly a change we can accept. I agree that it's better to not break the API just yet.
Something I'm not sure tho is the name of the function. Usually (in the stdlib for example) functions with WithError
will expect the user to pass an error in parameter. Example: https://github.com/segmentio/parquet-go/pull/313.
Wondering if we should use something else like NewReaderOrError
. What do you think ? Have you seen WithError
used in other package already ?
Wondering if we should use something else like NewReaderOrError. What do you think ? Have you seen WithError used in other package already ?
Ah I much prefer OrError
, yeah!
I thought about the name a bit and ended up just going with my first idea; I'm definitely happy to change it. I've made that change :grin:
Something I had documented (but maybe not highlighted enough) is the pattern which is currently supported to handle configuration errors; here is an example https://pkg.go.dev/github.com/segmentio/parquet-go#NewBuffer
I wonder if this would be an acceptable approach, or whether we feel like the new APIs are worth introducing still?
I wonder if this would be an acceptable approach, or whether we feel like the new APIs are worth introducing still?
I think if configuration errors were the only way that creating a (Generic)Reader
could fail that would be an acceptable tradeoff - the issue I encountered though was that creating the reader will panic on an invalid parquet file, which I think is demonstrated by the test that I added (not_enough_columns.invalid.parquet
). I don't think the configuration approach you've shown there would solve that issue?
Thanks for sharing these details.
You could open the file ahead of time to handle errors caused by invalid files, for example:
f, err := parquet.OpenFile(...)
if err != nil {
...
}
r := parquet.NewGenericReader[Row](f)
Then you have the guarantee that the reader constructor won't panic.
Let me know if this is an acceptable solution for your use case.
You could open the file ahead of time to handle errors caused by invalid files, for example: Let me know if this is an acceptable solution for your use case.
This would probably work in our use case in the short term but it's quite verbose and fiddly. Using OpenFile
means I need to manually calculate the size of my file, and it's just more boilerplate overall.
I like that NewGenericReader
does all of this for me, so it would still be nice to be able to simply use NewGenericReaderOrError
instead.
Another issue with panics is that there's no easy way for me to know whether a new failure mode which could cause a panic might be added to NewGenericReader
in the future, which could cause my application to start to panic after I upgrade to a newer version of parquet-go
. If it instead returns errors, my existing error handling code will handle those gracefully.
@achille-roussel gentle bump - does my reasoning above seem reasonable?
@Pryz @achille-roussel Hi, another gentle bump - does this look like it might be acceptable? Just looking to tie off the open PRs I have but of course I recognise that people are busy :grin: