croissant Add documentation for other tabular files (`.xls*` and `.sqlite`)

There's plenty of documentation about how to build RecordSets and Fields from a CSV via source.extract.column, but there isn't any for .xls* or .sqlite files. These also house tabular info that should be representable as RecordSets with Fields. How should those columns be extracted from the sheets (for .xls*) and tables (for .sqlite)?

Oct 19 '23 00:10 goeffthomas

From the user perspective:

In order to read from Excel
- Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).
- Subset of columns to use in the sheet (optional). For example, E:F if only columns E and F are of interest in the sheet. Croissant: this could be RecordSet.extract.column (already exists).
In order to read from sqlite
- Link to the database to connect to. Croissant: this could be FileObject.contentUrl (already exists) with "encodingFormat": "application/x-sqlite3" (new).
- SQL command to execute. Croissant: this could be RecordSet.extract.sql (new).

@goeffthomas Do you see more parameters?

@benjelloun What do you think of the proposed Croissant equivalents fo each parameter?

Nov 01 '23 11:11 marcenacp

Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).

I wonder if we could use less Excel-specific language? Like, would RecordSet.extract.table be more general and achieve the same purpose? In my mind, Excel is just a file that support multiple tables, but maybe that's oversimplifying. Maybe the sqlite handling would benefit from this as well in some way?

For example, E:F if only columns E and F are of interest in the sheet

This prompted a couple thoughts:

I thought extract was at the Field level. If not, and you can use this at the RecordSet level, would you forego documenting Fields because you're gathering many columns in a single extraction?
Adding targeting by column lettering is pretty handy especially if there's no header. That got me thinking, how does our current handling of CSVs account for data without headers? Can you specify a column by 0-based index of the column?
I'd also be okay just not implementing this kind of Excel-specific targeting unless it's asked for later, so basically just treat each sheet like its own CSV

with "encodingFormat": "application/x-sqlite3"

Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463

SQL command to execute. Croissant: this could be RecordSet.extract.sql

Similar to the some themes from the Excel comments above:

What about a RecordSet.extract.command as a more generic term that might apply to other formats?
What are your thoughts on supporting at the Field level? So, similar to Excel if we used a more generic extract.table, you could use extract.table and extract.column in tandem. Maybe a SQL command could be constructed from this if the RecordSet.fields all have the same extract.table.

Nov 01 '23 18:11 goeffthomas

Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463

Actually, I've just uncovered that our implementation of what becomes encodingFormat doesn't account for .sqlite files. All of them end up as the default application/octet-stream. Though it may be nicer to key on encodingFormat alone for determining how to handle the file, we may want a fallback that checks by file extension when it's application/octet-stream.

Nov 03 '23 01:11 goeffthomas

@goeffthomas Can you say more on why you need this for the Kaggle implementation? My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?

Nov 03 '23 09:11 benjelloun

My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?

@benjelloun No, that's not the case. Our Data Explorer presents all of these tabular files in a viewer that may make it seem that way. But all of the files remain intact as xlsx or sqlite in the dataset.

Nov 04 '23 03:11 goeffthomas

@benjelloun Could we try to get this into the next Croissant version?

Oct 31 '24 17:10 goeffthomas

@goeffthomas That's a good idea. The directions discussed here make sense to me overall. Do you and @marcenacp want to put together a short proposal for a spec change, and we can use that to iron out the details?

A couple comments on the earlier discussion:

There is a need to specify how to describe a "database connection" in a way that works for SQL Lite and other systems as well. Hopefully we can find a standard approach we just adopt.
We need a mechanism to deal with multiple tabs in a spreadsheet.
We need a mechanism to address nameless columns, e.g., using numbering. We could change the current extract column to be column_name or column_number.
Fancy: Support SQL queries for extraction. This can be useful, but means that we have a way to evaluate that query (e.g., by sending it to a database server)

Nov 04 '24 14:11 benjelloun