glaredb icon indicating copy to clipboard operation
glaredb copied to clipboard

create external table inconsistencies

Open universalmind303 opened this issue 1 year ago • 1 comments

Description

there are some inconsistencies when using table options, especially in combination with create external table.

For many of them, they use the same object_store options, which allows for some rather inconsistent syntax

create external table test from bson
options (
  location = './external_schema.bson',
  file_type = 'bson'
)

create external table multi_report from excel 
options (
  location='./testdata/xlsx/multiple_sheets.xlsx', 
  file_type = 'xlsx', 
  sheet_name = 'other', 
  has_header = false
);

Additionally, the syntax can often be reversed for object stores (which is confusing)

create external table delta_azure_opts
from delta
options (
  location='azure://glaredb-test/ext-table*'
  account_name= 'AZURE_ACCOUNT',
  access_key='AZURE_ACCESS_KEY'
);

create external table ext_table_1 
from azure
credentials azure_creds 
options (
  location='azure://glaredb-test/ext-table*'
);

I don't think that both syntaxes should be supported, and i think we should drop support for one of them

Suggestion

  • Drop support for create external table t from <file_type> options (location = <object_store>)
  • make create external table t from <object_store> options (file_type = <file_type>) more explicit about only allowing object stores and file types. [1]

for example from delta|lance|iceberg options (...) should not use the object store options, but it's own dedicated options, as they are their own storage formats, and don't integrate into object store nicely like file formats do. It also makes the distinction between the two a lot clearer.

Additionally, create external table t from <file_type> should probably never be valid. If you want to do that, you should probably be using the table function, or the inference from './path/to/file.<file_type>' instead.


For brevity, for the remainder of the text, from ... will refer to create external table t from .....

(create external table t from t -> from t)

in the above examples, I suggest they be rewritten to use from local instead of from <file_type | table_type>.

  • from bson options (location=location...) -> from local options (location=location, file_type = 'bson')
  • from azure credentials azure_creds options (location='azure://'); -> from delta options(location='azure://') credentials azure_creds

So just to recap, all File types (bson, json, csv, parquet, avro, ...) should use the object store as the from qualifier

from <local | s3 | azure | gcs> options (object_store_opts)

  • from local options (file_type = 'csv', ...object_store_opts);
  • from s3 options (file_type = 'bson', ...object_store_opts);
  • from azure options (file_type = 'parquet', ...object_store_opts);

all storage formats should use the storage format as the from qualifier (delta, lance, iceberg)

from <delta | lance | iceberg | excel | ...> options (T_opts)

  • from delta options (location = 's3://', ...other delta specific options)
  • from excel options (location = 's3://', ...other excel specific options)

[1] I define a file format as a format that is used to represent a single table usually in a single file. (there is no notion of a table, and they are generally quite portable (csv, json, parquet, ...)).

A table format (AKA storage format) is any format that supports multiple tables (delta, lance, excel, iceberg).

universalmind303 avatar Mar 08 '24 23:03 universalmind303

I pretty strongly believe that we should hide FROM <storage location/service> from users (because that's all information that we can (and do, in many cases) derive from the path, and using FROM exclusively for "format" rather than "location" simplifies this. The inverse (location and not format) would mean that we would either need to deduce format, or add additional syntax.

tychoish avatar Mar 11 '24 13:03 tychoish