parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-677: Quoted identifiers in column names.

Open ptkool opened this issue 9 years ago • 14 comments

ptkool avatar Aug 22 '16 20:08 ptkool

Hi @ptkool, thanks for your contribution. Can you open a Parquet jira for this? (https://issues.apache.org/jira/browse/PARQUET) then prefix the title of your PR with "PARQUET-XXX: " Thank you!

julienledem avatar Aug 23 '16 16:08 julienledem

Thanks for posting a patch @ptkool!

Is this preventing Parquet from being used for some cases? I want to make sure the trade-off is worth including this. Right now, we don't have to worry about column name compatibility across object models. Avro, for example, will generate Java and C code for schemas so it only allows column names that are alphanumeric (plus '_'). This would break compatibility with those Avro fields and probably with Thrift and some Protobuf as well. Is that worth fixing strange Hive columns?

I'm wondering if we shouldn't solve this problem by using field IDs or string aliases for column names that are alphanumeric to be consistent across object models.

rdblue avatar Sep 08 '16 16:09 rdblue

@ptkool did you want to update this? I still think this would be a useful addition. @rdblue I think it is compatible as long as we don't quote names that did not need quoting before.

julienledem avatar Feb 02 '17 00:02 julienledem

@julienledem, the problem is that this allows names in Parquet files that will break some object models.

rdblue avatar Feb 02 '17 01:02 rdblue

@rdblue But typically object model integration (Avro, Thrift, Protobuf) is used to import data into Parquet or read data imported in the same model. Since people have existing schemas, it makes sense to me to allow escaping special characters rather than forcing renaming individual fields which is painful and brittle. Possibly we could add a flag on write to enable this. That would force users to acknowledge that they may not be able to read this with Avro (for example). But even then something could be done at that layer to convert to an accepted field name. There are probably a few places that rely on names not containing "." for example for path representation that would need to be fixed similarly.

julienledem avatar Feb 02 '17 07:02 julienledem

@julienledem Yes, I will update this.

ptkool avatar Feb 04 '17 11:02 ptkool

I'm still -1 on this because it is allowing names that we know will break object models. I wouldn't say that Thrift, Protobuf, and Avro are only used to import data that will be read by other models after that. Avro originally only read files where there was an Avro schema, but we quickly had requests to be able to read non-Avro files (those written by Impala) with Avro.

rdblue avatar Feb 21 '17 22:02 rdblue

@rdblue I agree with your statement that we should not restrict reading a parquet file to only the model that wrote it however I don't think that this PR forces that. Each model has different rules about what is a valid name and what is not. By restricting the names in a parquet schema we force mangling them while writing as well as when reading them back. Often schemas already exist with the name as is.

To read back from a model that has different naming restrictions I think we should deal with that on read by having a standard rule per model on how to transform a name into one that is valid in that model on read. All we need is a way to map that name on read.

There are already valid Parquet field that are considered keywords in thrift for example.

julienledem avatar Feb 22 '17 00:02 julienledem

@julienledem Any more thoughts on this?

ptkool avatar Apr 30 '17 22:04 ptkool

This is currently blocking us from using Parquet as we are having arbitrary field names in our (so far json-backed) schema. While it is possible to escape/translate them to conform to Parquet limitations, it would be way preferable if Parquet could use arbitrary field names.

MatthiasEgli avatar Aug 31 '17 21:08 MatthiasEgli

Any news about this PR? It would be very useful IMHO

fpompermaier avatar Jan 31 '19 10:01 fpompermaier

Any update on this issue? I am facing this issue in one of my project with parquet format and its pissing me off. I don't want to alter the dataset by replacing space with _ OR removing it altogether.

airshad avatar Sep 11 '20 11:09 airshad