ontology Add subclasses of `data format` and add `file`

Description of the issue

Issue #859 has shown that subclasses of data format are needed. Also a class file should be added that we can clearly distinguish between a data format and a file.

Ideas of solution

If you already have ideas for the solution describe them here

Workflow checklist

[ ] I discussed the issue with someone else than me before working on a solution
[ ] I already read the latest version of the workflow for this repository
[ ] The goal of this ontology is clear to me

I am aware that

[ ] every entry in the ontology should have a definition
[ ] classes should arise from concepts rather than from words

May 12 '22 06:05 l-emele

When thinking about the data formats, I am asking myself whether we have here more like a subclass hierarchy. Also I think, we have to distinguish between a data format and a file. And then something like file 'has data format' some 'data format' and 'csv file' 'has data format' some 'csv file format'. What about introducing the following subclass structure:

data format: A data format is a data descriptor that describes in which format the data is encoded. (As it is currently implemented._
- file format: A file format is a data format that describes in which format data is encoded in a file.
  - text file format: A text file format is a file format that is structured as a sequence of lines of electronic text.
    - delimiter separated file format: A delimiter separated file format is a text file format that uses delimiter-separated values (also DSV) to store two-dimensional arrays of data by separating the values in each row with specific delimiter characters.
    - comma separated file format: A comma separated file format is a delimiter separated file format that uses comma (,) as delimiter.
  - binary file format: A binary file format is a file format that is not a text format. [^1]
    - GAMS data exchange format: A GAMS data exchange format is a binary file format used by General Algebraic Modeling System (GAMS).
  - microsoft excel workbook (xls): .tbd
  - microsoft excel workbook (xlsx): tbd

The file classes than can be implemented as equivalent classes, e.g. A character separated value file is a file that has a character separated file format with the axiom: 'comma separated value file' 'Equivalent To' some (file and 'has data format' some 'comma separated file format'. However, for that we need to define or import a general file class. Additionally I suggest csv file as alternative term to comma separated file and csv as alternative term to both comma separated file and comma separated file format [^1]: Derived from https://en.wikipedia.org/wiki/Binary_file

Originally posted by @l-emele in https://github.com/OpenEnergyPlatform/ontology/issues/859#issuecomment-1123328815

May 12 '22 06:05 l-emele

I like your proposed structure. I'm not sure how to extend this for other file formats like scripts (.py) or images (.png). Or do they count as data.

I think this part could be found in some other domain ontology. @OpenEnergyPlatform/oeo-general-expert-formal-ontology

May 12 '22 20:05 Ludee

I think it is useful to add a further file format:

source code file format: A source code file format is a text file format that source code in a programming language.

May 13 '22 07:05 l-emele

OEO dev meeting 41: Implementing change data format: A data format is a data descriptor that specifies the structure in which the data item is encoded. add file format: A file format is a data format that describes how information is stuctured and encoded in a file. add file encoding: A file format is a data format that describes how information is stuctured and encoded in a file.

Jul 14 '22 09:07 stap-m

Update to the earlier post by @stap-m based on the "how to implement"-session during the 41th OEO-meeting.

add data file format: A file format is a data format that describes how data is stuctured in a file. add file encoding: We identified that encoding will be necessary. We did not have time to come up with a definition

Jul 14 '22 10:07 MGlauer

Suggestion for file encoding: A file encoding is a data format, that specifies the characterset used to encode the data in a file.

Jul 20 '22 08:07 rue-l

And a suggestion for the structure in order to sort in most of the individuals listed in #1149

data format
- data file format
  - text file format
  - comma seperated file format
  - delimiter separated file format
  - binary file format
  - structured file format : Individuals: xml, json,
- data file encoding
- in-memory data format: Individual: data frame
- data base format

Jul 21 '22 13:07 rue-l

This issue seems close to implementation. Summary:

Agreement seems to be reached about:

data format A data format is a data descriptor that specifies the structure in which the data item is encoded.
- data file format A file format is a data format that describes how data is stuctured in a file.
  - text file format A text file format is a file format that is structured as a sequence of lines of electronic text.
  - comma seperated file format A comma separated file format is a delimiter separated file format that uses comma (,) as delimiter.
  - delimiter separated file format A delimiter separated file format is a text file format that uses delimiter-separated values (also DSV) to store two-dimensional arrays of data by separating the values in each row with specific delimiter characters.
  - binary file format A binary file format is a file format that is not a text format. 1

Still open:

structured file format definition missing : Individuals: xml, json,
data file encoding vs file encoding: A file encoding is a data format, that specifies the characterset used to encode the data in a file.
in-memory data formatdefinition missing : Individual: data frame
data base format definition missing

Sep 19 '22 16:09 chrwm

Is there a reason why CSV and DSV are not a subclass of text file?

Sep 19 '22 20:09 MGlauer

Is there a reason why CSV and DSV are not a subclass of text file?

DSV is a text file format (A delimiter separated file format is a text file format that uses delimiter-separated values (also DSV) to store two-dimensional arrays of data by separating the values in each row with specific delimiter characters.) and CSV is a DSV (A comma separated file format is a delimiter separated file format that uses comma (,) as delimiter.) You were probably mislead as in @chrwm the indentations did not reflect this correctly.

source code file format: A source code file format is a text file format that source code in a programming language.

As no one objected against this proposal for months, I interpret that also as an agreement.

I suggest to implement what has already agreed upon and discuss the remaining parts after.

Sep 20 '22 08:09 l-emele

From last dev meeting we decided I am going to implement the already agreed terms.

Oct 07 '22 09:10 areleu

Should I make #1326 close this issue so we can push the remaining terms to the next release?

Oct 07 '22 09:10 areleu

If not everything is solved with PR #1326, then please leave this issue open and move the milestone. Also a list of open points after the PR would be nice.

Oct 07 '22 09:10 l-emele

Here is a list of classes that are not yet implemented:

GAMS data exchange format: A GAMS data exchange format is a binary file format used by General Algebraic Modeling System (GAMS).
microsoft excel workbook (xls)
microsoft excel workbook (xlsx)
data file encoding
in-memory data format
data base format
structured file format with subclasses xml and json
file encoding: A file encoding is a data format, that specifies the characterset used to encode the data in a file.
data file encoding
in-memory data format with subclass data frame
data base format

Oct 25 '22 09:10 l-emele

I just realised that in #1326, classes were added, but the respective individuals were not deleted. This is the current list of data format individuals: grafik

Instead of the individual csv (OEO:00000116) we have now the class comma seperated file format (OEO:00280003) and instead of the individual txt (OEO:00000426) we have now the class text file format (OEO:00280001). We should delete these two individuals. Further, we should think about whether the classes comma seperated file format and text file format and get the IDs and the IRIs of the old individuals. The advantage would be, that users of the OEO get with the same IRI the refined concept.

Implementation of that would go in two steps:

Delete the individuals OEO:00000116 and OEO:00000426 using Protégé.
Replacing all instances of OEO:00000116 with OEO:00280003 and OEO:00000426 and OEO:00280001 in all ontology files using an text editor.

This is an decision we should ideally make before the next release. Therefore I move this issue back to milestone 1.12.0. @OpenEnergyPlatform/oeo-release-team : Any opinions here?

Oct 27 '22 10:10 l-emele

I agree. The individuals are not used anywhere in the OEO currently. I'll delete the individuals.

Oct 31 '22 19:10 stap-m

I'm fine not recycling the old IRI's and ID's. New OEO versions often lack backward compatibility that the benefit of not breaking the two concepts appears disproportionate. I personally prefer the new ID's more and hence argue not to re-use the IDs and IRIs.

Nov 01 '22 15:11 chrwm

From the dev meeting: According to OBO best-practices classes/individuals are declared obsolete. This should have been done instead of deleting it. We didn't know what the decision in the OEO regarding this is, hence we postpone the issue to the next release milestone.

Nov 02 '22 08:11 chrwm

From the dev meeting: According to OBO best-practices classes/individuals are declared obsolete. This should have been done instead of deleting it. We didn't know what the decision in the OEO regarding this is, hence we postpone the issue to the next release milestone.

Thanks, I wasn't aware of it, since it wasn't commented here. At which dev meeting did we talk about that (number)?

Nov 02 '22 08:11 stap-m

Sorry I meant the release meeting today! We just talked about it before starting the release.

Nov 02 '22 10:11 chrwm

This issue has not any progress in the last to months and its therefore unlikely that we solve it before the next release. I thus move the milestone.

Jan 26 '23 13:01 l-emele

I transferred everything what is left from this issue to a new issue #1518 and thus close this issue here. If I missed something feel free to re-open this issue or open a further issue.

Apr 21 '23 06:04 l-emele

ontology ontology copied to clipboard

Add subclasses of `data format` and add `file`

Description of the issue

Ideas of solution

Workflow checklist

ontology
ontology copied to clipboard