presto
presto copied to clipboard
Add file create config to hive session properties
This session property will allow file creation option setting on per query basis. This functionality is for Presto native only.
== NO RELEASE NOTE ==
I have some basic questions if you please could take a moment to help me understand.
Can you help to give some examples why we need file creation option setting on a per query basis?
This is a session property: when do we expect the user of Presto to supply a value? And what does the value look like? Is it a path to a config file local to the worker?
@tdcmeehan Hi Tim, this is for native. In Meta internal testing, the file encoding might affect performance and memory efficiency, e.g. the number of replicas, and the physical file block size things. We add this session property to allow user to specify per query. Thanks!
I have some basic questions if you please could take a moment to help me understand.
Can you help to give some examples why we need file creation option setting on a per query basis?
This is a session property: when do we expect the user of Presto to supply a value? And what does the value look like? Is it a path to a config file local to the worker?
It will be a free form config that is targeted to give underlying file system the configs to create a file. This is implementation dependent so different file systems might have their own sets of configs (And that's why it's free form). For example you can specify block size to write or encodings to apply to the files. This property will be passed down to the underlying file system impl and it will be responsible to parse the configs.
As an example, suppose my file system implementation is HDFS. Does this mean that I will pass a config file path local to the worker to a .properties file that specifies individual HDFS properties?
As an example, suppose my file system implementation is HDFS. Does this mean that I will pass a config file path local to the worker to a .properties file that specifies individual HDFS properties?
Not really. You do not supply a path. An example could be file_create_config="hdfs.write-file-block-size=10MB;hdfs.write-file-buffer=1MB"
Got it. And basically each ;
separated value is expected to be a connector config?
Got it. And basically each
;
separated value is expected to be a connector config?
Yeah that is just an example. Again the form can de freely defined by the underlying plugged in file system. You can even use json to pass the properties in if you want to implement your file system to parse it that way.
Have you considered putting this session property in the PrismConnector?
Have you considered putting this session property in the PrismConnector?
The config value is in free form so any hive storage backend can leverage to optimize the physical data storage? Thanks!
Typically, connectors don't take in freeform configs, because such configs may be validated inconsistently, and are often undocumented. Presto already has a plethora of undocumented and obscure session properties that most people don't understand. If the idea is to allow a "generic" config to alter the behavior of the operations on the underlying filesystem, I think it's best to actually explicitly enumerate in the connector which configs the connector can take in to affect filesystem properties. Imagining this from a user's perspective, I find it difficult to understand how a user would set the session without deep experience and understanding of their underlying deployment, and also knowing the nuances of how the filesystem would choose to parse this particular config. Think about how we could document such a session property.
If you really think this is a good idea and the goal is to make it easier for Meta deployments to set custom session properties to tweak filesystem behavior on a per-query basis, then perhaps give the PrismConnector a try, since this is an internal Hive-compliant connector and accomplishes this goal in a way that unblocks your production support needs.