Brief overview

AS A user of a dataframe library like Pandas/Polars/etc

I WANT to be able to upload my Parquet dataset

SO THAT I can skip all the nonsense around csv inference

Additional details

To support this we need to support an analogous type for each type in the Parquet format. Some notables currently missing

[ ] Datetime
[ ] Bytes/bytearrays - necessary for UUID as well
[ ] Enum (we're allowed to parse as strings if necessary)
[ ] Time
[ ] Interval

Perhaps we could get away without supporting everything to start with, but without at least datetime and probably bytes there would be no real benefit to claiming any kind of support

Jun 11 '24 09:06 calpaterson

Dupe of #99 ?

Jul 09 '24 11:07 thedatadavis

Dupe of #99 ?

It absolutely is, yes - my mistake. :)

I've closed the other issue as this has slightly more detail on the types csvbase is missing.

Jul 09 '24 11:07 calpaterson

What about converting unsupported datatypes to STRING, until they are all supported?

Sep 13 '24 12:09 Max1Truc

What about converting unsupported datatypes to STRING, until they are all supported?

That's actually not a bad idea. We could mark it as experimental or something meanwhile.

Would that help you use csvbase for your usecase?

Sep 13 '24 13:09 calpaterson

Sure, as my data source converts everything to strings anyway :P

Sep 13 '24 14:09 Max1Truc

Ok, I think this can be moved up then. I'll try to have a go next week

Sep 13 '24 17:09 calpaterson

Would you consider it a good first issue for new contributors?

If I do not need to know too much about the codebase to make the change I would gladly have a shot at it.

Sep 13 '24 19:09 Max1Truc

Would you consider it a good first issue for new contributors?

If I do not need to know too much about the codebase to make the change I would gladly have a shot at it.

Hmm, probably not as it requires both a fair amount of knowledge and also involves making a load of design decisions.

Probably the best first changes are stuff related to getting it working locally for you. Many people use docker (but I don't so I don't discover the problems there). Does the docker container work for you? Can you think of any ways to improve it? Can you remove tini and thereby possibly resolve https://github.com/calpaterson/csvbase/issues/126?

Sep 14 '24 06:09 calpaterson

Support uploading Parquet files

Brief overview

Additional details