Add documentation for other tabular files (`.xls*` and `.sqlite`)
There's plenty of documentation about how to build RecordSets and Fields from a CSV via source.extract.column, but there isn't any for .xls* or .sqlite files. These also house tabular info that should be representable as RecordSets with Fields. How should those columns be extracted from the sheets (for .xls*) and tables (for .sqlite)?
From the user perspective:
- In order to read from Excel
- Name of the sheet. Croissant: this could be
RecordSet.extract.sheet_name(new). - Subset of columns to use in the sheet (optional). For example,
E:Fif only columns E and F are of interest in the sheet. Croissant: this could beRecordSet.extract.column(already exists).
- Name of the sheet. Croissant: this could be
- In order to read from sqlite
- Link to the database to connect to. Croissant: this could be
FileObject.contentUrl(already exists) with"encodingFormat": "application/x-sqlite3"(new). - SQL command to execute. Croissant: this could be
RecordSet.extract.sql(new).
- Link to the database to connect to. Croissant: this could be
@goeffthomas Do you see more parameters?
@benjelloun What do you think of the proposed Croissant equivalents fo each parameter?
Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).
I wonder if we could use less Excel-specific language? Like, would RecordSet.extract.table be more general and achieve the same purpose? In my mind, Excel is just a file that support multiple tables, but maybe that's oversimplifying. Maybe the sqlite handling would benefit from this as well in some way?
For example,
E:Fif only columns E and F are of interest in the sheet
This prompted a couple thoughts:
- I thought
extractwas at theFieldlevel. If not, and you can use this at theRecordSetlevel, would you forego documentingFields because you're gathering many columns in a single extraction? - Adding targeting by column lettering is pretty handy especially if there's no header. That got me thinking, how does our current handling of CSVs account for data without headers? Can you specify a column by 0-based index of the column?
- I'd also be okay just not implementing this kind of Excel-specific targeting unless it's asked for later, so basically just treat each sheet like its own CSV
with
"encodingFormat": "application/x-sqlite3"
Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463
SQL command to execute. Croissant: this could be
RecordSet.extract.sql
Similar to the some themes from the Excel comments above:
- What about a
RecordSet.extract.commandas a more generic term that might apply to other formats? - What are your thoughts on supporting at the
Fieldlevel? So, similar to Excel if we used a more genericextract.table, you could useextract.tableandextract.columnin tandem. Maybe a SQL command could be constructed from this if theRecordSet.fields all have the sameextract.table.
Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463
Actually, I've just uncovered that our implementation of what becomes encodingFormat doesn't account for .sqlite files. All of them end up as the default application/octet-stream. Though it may be nicer to key on encodingFormat alone for determining how to handle the file, we may want a fallback that checks by file extension when it's application/octet-stream.
@goeffthomas Can you say more on why you need this for the Kaggle implementation? My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?
My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?
@benjelloun No, that's not the case. Our Data Explorer presents all of these tabular files in a viewer that may make it seem that way. But all of the files remain intact as xlsx or sqlite in the dataset.
@benjelloun Could we try to get this into the next Croissant version?
@goeffthomas That's a good idea. The directions discussed here make sense to me overall. Do you and @marcenacp want to put together a short proposal for a spec change, and we can use that to iron out the details?
A couple comments on the earlier discussion:
- There is a need to specify how to describe a "database connection" in a way that works for SQL Lite and other systems as well. Hopefully we can find a standard approach we just adopt.
- We need a mechanism to deal with multiple tabs in a spreadsheet.
- We need a mechanism to address nameless columns, e.g., using numbering. We could change the current extract column to be column_name or column_number.
- Fancy: Support SQL queries for extraction. This can be useful, but means that we have a way to evaluate that query (e.g., by sending it to a database server)