create external table inconsistencies
Description
there are some inconsistencies when using table options, especially in combination with create external table.
For many of them, they use the same object_store options, which allows for some rather inconsistent syntax
create external table test from bson
options (
location = './external_schema.bson',
file_type = 'bson'
)
create external table multi_report from excel
options (
location='./testdata/xlsx/multiple_sheets.xlsx',
file_type = 'xlsx',
sheet_name = 'other',
has_header = false
);
Additionally, the syntax can often be reversed for object stores (which is confusing)
create external table delta_azure_opts
from delta
options (
location='azure://glaredb-test/ext-table*'
account_name= 'AZURE_ACCOUNT',
access_key='AZURE_ACCESS_KEY'
);
create external table ext_table_1
from azure
credentials azure_creds
options (
location='azure://glaredb-test/ext-table*'
);
I don't think that both syntaxes should be supported, and i think we should drop support for one of them
Suggestion
- Drop support for
create external table t from <file_type> options (location = <object_store>) - make
create external table t from <object_store> options (file_type = <file_type>)more explicit about only allowing object stores and file types. [1]
for example from delta|lance|iceberg options (...) should not use the object store options, but it's own dedicated options, as they are their own storage formats, and don't integrate into object store nicely like file formats do. It also makes the distinction between the two a lot clearer.
Additionally, create external table t from <file_type> should probably never be valid. If you want to do that, you should probably be using the table function, or the inference from './path/to/file.<file_type>' instead.
For brevity, for the remainder of the text, from ... will refer to create external table t from .....
(create external table t from t -> from t)
in the above examples, I suggest they be rewritten to use from local instead of from <file_type | table_type>.
from bson options (location=location...)->from local options (location=location, file_type = 'bson')from azure credentials azure_creds options (location='azure://');->from delta options(location='azure://') credentials azure_creds
So just to recap, all File types (bson, json, csv, parquet, avro, ...) should use the object store as the from qualifier
from <local | s3 | azure | gcs> options (object_store_opts)
from local options (file_type = 'csv', ...object_store_opts);from s3 options (file_type = 'bson', ...object_store_opts);from azure options (file_type = 'parquet', ...object_store_opts);
all storage formats should use the storage format as the from qualifier (delta, lance, iceberg)
from <delta | lance | iceberg | excel | ...> options (T_opts)
from delta options (location = 's3://', ...other delta specific options)from excel options (location = 's3://', ...other excel specific options)
[1] I define a file format as a format that is used to represent a single table usually in a single file. (there is no notion of a table, and they are generally quite portable (csv, json, parquet, ...)).
A table format (AKA storage format) is any format that supports multiple tables (delta, lance, excel, iceberg).
I pretty strongly believe that we should hide FROM <storage location/service> from users (because that's all information that we can (and do, in many cases) derive from the path, and using FROM exclusively for "format" rather than "location" simplifies this. The inverse (location and not format) would mean that we would either need to deduce format, or add additional syntax.