csvbase icon indicating copy to clipboard operation
csvbase copied to clipboard

Support uploading Parquet files

Open calpaterson opened this issue 1 year ago • 8 comments

Brief overview

AS A user of a dataframe library like Pandas/Polars/etc

I WANT to be able to upload my Parquet dataset

SO THAT I can skip all the nonsense around csv inference

Additional details

To support this we need to support an analogous type for each type in the Parquet format. Some notables currently missing

  • [ ] Datetime
  • [ ] Bytes/bytearrays - necessary for UUID as well
  • [ ] Enum (we're allowed to parse as strings if necessary)
  • [ ] Time
  • [ ] Interval

Perhaps we could get away without supporting everything to start with, but without at least datetime and probably bytes there would be no real benefit to claiming any kind of support

calpaterson avatar Jun 11 '24 09:06 calpaterson

Dupe of #99 ?

thedatadavis avatar Jul 09 '24 11:07 thedatadavis

Dupe of #99 ?

It absolutely is, yes - my mistake. :)

I've closed the other issue as this has slightly more detail on the types csvbase is missing.

calpaterson avatar Jul 09 '24 11:07 calpaterson

What about converting unsupported datatypes to STRING, until they are all supported?

Max1Truc avatar Sep 13 '24 12:09 Max1Truc

What about converting unsupported datatypes to STRING, until they are all supported?

That's actually not a bad idea. We could mark it as experimental or something meanwhile.

Would that help you use csvbase for your usecase?

calpaterson avatar Sep 13 '24 13:09 calpaterson

Sure, as my data source converts everything to strings anyway :P

Max1Truc avatar Sep 13 '24 14:09 Max1Truc

Ok, I think this can be moved up then. I'll try to have a go next week

calpaterson avatar Sep 13 '24 17:09 calpaterson

Would you consider it a good first issue for new contributors?

If I do not need to know too much about the codebase to make the change I would gladly have a shot at it.

Max1Truc avatar Sep 13 '24 19:09 Max1Truc

Would you consider it a good first issue for new contributors?

If I do not need to know too much about the codebase to make the change I would gladly have a shot at it.

Hmm, probably not as it requires both a fair amount of knowledge and also involves making a load of design decisions.

Probably the best first changes are stuff related to getting it working locally for you. Many people use docker (but I don't so I don't discover the problems there). Does the docker container work for you? Can you think of any ways to improve it? Can you remove tini and thereby possibly resolve https://github.com/calpaterson/csvbase/issues/126?

calpaterson avatar Sep 14 '24 06:09 calpaterson