specs
specs copied to clipboard
Table Dialect Spec
Overview
For now, we have:
- a resource describes a data entity
- a resource can be tabular
- a tabular resource has
schema
property that must beTabular Schema
- a tabular resource has
dialect
property that must beCSV Dialect
- a tabular resource has
It means that we have only two mechanisms to add tabular information to the resource: schema
and dialect
properties:
-
schema
: what is the data -
dialect
: how to extract the data
Maybe at some point this list can be extended e.g. providing table filtering ability etc but, as for now, I think we definitely can generalize the dialect
property. Instead of having it csv-only
we can have a general Table Dialect
spec helping describe any tabular format details.
The proposed Table Dialect
spec will create a nice symmetry with already existent Table Schema
spec. Here is a quick overview of the proposal. The spec is hierarchical so e.g. Csv Table Dialect
inherits all the props from Table Dialect
.
Table Dialect
Core Table Dialect
spec will handle header
management.
header (bool)
default: true
Whether the table has a header row(s)
headerRows (int[])
default: [1]
An array of header row numbers. Can describe a multiline header.
headerJoin (str)
default: ' ' (one space)
A string to concatenate a multiline header. Has no effect for a single row header.
Csv Table Dialect
@amercader hints that we also need to re-review the CSVW spec in case we miss something - https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dialect-descriptions
It will support all the header
options and the options below which is standard for csv
.
delimiter (str)
default: ,
lineTerminator (str)
default: \r\n
quoteChar (str)
default: ""
doubleQuote (bool)
default: true
escapeChar (str)
default: not set
nullSequence (str)
default: not set
skipInitialSpace (bool)
default: false
I propose the following changes to the current Csv Dialect spec:
- make
skipInitialSpace=False
by default to sync with Python/Pandas/JS/etc behaviour - remove
caseSensitiveHeader
as I guess it should be an option for someinfer
function but for general data description I'm not sure what it does - review
commentChar
option as partially its role will be handled byheaderRows
and, at the same time, there is more functionalskipRows
supported by the software. In software, I've moved all theskip/pick/limit/offset_fields/rows
functionality to a separate group calledTable Query
(orTable Discovery
previously) which should probably exist only in software because we don't want to make ETL from the specs, although I think there are options to consider.
Excel Table Dialect
It will support all the header
options and:
sheet (str|int)
default: 1
String or integer to address an excel sheet e.g. 2
or Sheet 2
.
Options to consider:
- fillMergedCells
- preserveFormatting
- adjustFloatingPointError
Json Table Dialect
It will support all the header
options and:
keyed (bool)
default: false
Whether a source is keyed i.e. an array of dictionaries instead of an array of arrays.
keys (str[])
default: not set
For a keyed source, an array of keys to use as a header row.
Options to consider:
- property (path to the data within json e.g.
dogs/data
)
In conclusion, the idea is:
- csv is not the only tabular format; let's describe others, the most importantly Excel
- to have one hierarchical spec which will help standardize different formats' dialects
- new formats and properties addition should be considered based on users' demand and should happen gradually
Very interesting proposal and generally big 👍
-
headerRows
- we def want something like this - i believe #681 is also about this - Not sure about
headerJoin
- can you provide more detail? - Like the idea of specializing the dialect to other cases
Probably I should have put headerJoin
into the Options to consider
category as it's a very minor and rare case. At the same time, people asked for this option for tabulator
many times including for pilots as there are a lot Excel files with "fancy" multiline header like:
So it's basically the way we join a multiline header row:
dialect = ExcelDialect(header_rows=[7, 8, 9], header_join='/')
with Table('excel.xlsx', dialect=dialect) as table:
print(table.headers)
# ['Current/Phase1/#', ...]
OK, understood and very good to have concrete examples.
*Datopian *| https://datopian.com | Open solutions for a data driven world DataHub | https://datahub.io | GitHub for data *CKAN * | http://ckan.org https://ckan.org/ | The world's leading data portal solution
President - +44 7795176976 - @rufuspollock
On Mon, Jul 27, 2020 at 12:00 PM roll [email protected] wrote:
Probably I should have put headerJoin into the Options to consider category as it's a very minor and rare case. At the same time, people asked for this option for tabulator many times including for pilots as there are a lot Excel files with "fancy" multiline header like:
[image: excel] https://user-images.githubusercontent.com/557395/88529413-a54faf00-d008-11ea-90b2-1edb1b9b72a5.png
So it's basically the way we join a multiline header row:
dialect = ExcelDialect(headerRows=[7, 8, 9], header_join='/') with Table('excel.xlsx', dialect=dialect) as table: print(table.headers) # ['Current/Phase1/#', ...]
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/specs/issues/697#issuecomment-664252911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABMDMWRQIS7TOVVBFH2K5LR5VF3FANCNFSM4PIQDXOQ .
@roll do you want to start with a pull request to add this as a pattern?
Sure. I'll PR (can take some time though)
Json Table Dialect requires further discussion as many ways exist to encode tabular data in JSON.
- array of arrays
- array of objects (aka keyed) so order of columns is unknown
- object with property for rows and property for header
See https://www.w3.org/TR/csv2json/ (CSVW) for an example of a specification that supports 1 (simple) and 3 (slightly complicated). A simplified form of this uses an object with property rows
for rows and property columns
with an array of objects, each having property label
at least.
So keyed
is probably fine but keys
is more complex.
Moreover cells in JSON Tables and Excel Tables can have data types other than plain strings.
- Excel: text, number, logical, error.
- JSON: any JSON data type except object and possibly array (string, number, boolean, null)
Datatypes can be defined with columns as done in CSVW but less complex (e.g. only string, number, logical).
Thanks, @nichtich!
I think it should not be a blocker as in specs like this we have a privilege to start from a small core and extend once other properties are discussed and justified