ontology icon indicating copy to clipboard operation
ontology copied to clipboard

Add subclasses of `data format` and add `file`

Open l-emele opened this issue 2 years ago • 14 comments

Description of the issue

Issue #859 has shown that subclasses of data format are needed. Also a class file should be added that we can clearly distinguish between a data format and a file.

Ideas of solution

If you already have ideas for the solution describe them here

Workflow checklist

  • [ ] I discussed the issue with someone else than me before working on a solution
  • [ ] I already read the latest version of the workflow for this repository
  • [ ] The goal of this ontology is clear to me

I am aware that

  • [ ] every entry in the ontology should have a definition
  • [ ] classes should arise from concepts rather than from words

l-emele avatar May 12 '22 06:05 l-emele

When thinking about the data formats, I am asking myself whether we have here more like a subclass hierarchy. Also I think, we have to distinguish between a data format and a file. And then something like file 'has data format' some 'data format' and 'csv file' 'has data format' some 'csv file format'. What about introducing the following subclass structure:

  • data format: A data format is a data descriptor that describes in which format the data is encoded. (As it is currently implemented._
    • file format: A file format is a data format that describes in which format data is encoded in a file.
      • text file format: A text file format is a file format that is structured as a sequence of lines of electronic text.
        • delimiter separated file format: A delimiter separated file format is a text file format that uses delimiter-separated values (also DSV) to store two-dimensional arrays of data by separating the values in each row with specific delimiter characters.
        • comma separated file format: A comma separated file format is a delimiter separated file format that uses comma (,) as delimiter.
      • binary file format: A binary file format is a file format that is not a text format. [^1]
        • GAMS data exchange format: A GAMS data exchange format is a binary file format used by General Algebraic Modeling System (GAMS).
      • microsoft excel workbook (xls): .tbd
      • microsoft excel workbook (xlsx): tbd

The file classes than can be implemented as equivalent classes, e.g. A character separated value file is a file that has a character separated file format with the axiom: 'comma separated value file' 'Equivalent To' some (file and 'has data format' some 'comma separated file format'. However, for that we need to define or import a general file class. Additionally I suggest csv file as alternative term to comma separated file and csv as alternative term to both comma separated file and comma separated file format [^1]: Derived from https://en.wikipedia.org/wiki/Binary_file

Originally posted by @l-emele in https://github.com/OpenEnergyPlatform/ontology/issues/859#issuecomment-1123328815

l-emele avatar May 12 '22 06:05 l-emele

I like your proposed structure. I'm not sure how to extend this for other file formats like scripts (.py) or images (.png). Or do they count as data.

I think this part could be found in some other domain ontology. @OpenEnergyPlatform/oeo-general-expert-formal-ontology

Ludee avatar May 12 '22 20:05 Ludee

I think it is useful to add a further file format:

  • source code file format: A source code file format is a text file format that source code in a programming language.

l-emele avatar May 13 '22 07:05 l-emele

OEO dev meeting 41: Implementing change data format: A data format is a data descriptor that specifies the structure in which the data item is encoded. add file format: A file format is a data format that describes how information is stuctured and encoded in a file. add file encoding: A file format is a data format that describes how information is stuctured and encoded in a file.

stap-m avatar Jul 14 '22 09:07 stap-m

Update to the earlier post by @stap-m based on the "how to implement"-session during the 41th OEO-meeting.

add data file format: A file format is a data format that describes how data is stuctured in a file. add file encoding: We identified that encoding will be necessary. We did not have time to come up with a definition

MGlauer avatar Jul 14 '22 10:07 MGlauer

Suggestion for file encoding: A file encoding is a data format, that specifies the characterset used to encode the data in a file.

rue-l avatar Jul 20 '22 08:07 rue-l

And a suggestion for the structure in order to sort in most of the individuals listed in #1149

  • data format
    • data file format
      • text file format
      • comma seperated file format
      • delimiter separated file format
      • binary file format
      • structured file format : Individuals: xml, json,
    • data file encoding
    • in-memory data format: Individual: data frame
    • data base format

rue-l avatar Jul 21 '22 13:07 rue-l

This issue seems close to implementation. Summary:

Agreement seems to be reached about:

  • data format A data format is a data descriptor that specifies the structure in which the data item is encoded.
    • data file format A file format is a data format that describes how data is stuctured in a file.
      • text file format A text file format is a file format that is structured as a sequence of lines of electronic text.
      • comma seperated file format A comma separated file format is a delimiter separated file format that uses comma (,) as delimiter.
      • delimiter separated file format A delimiter separated file format is a text file format that uses delimiter-separated values (also DSV) to store two-dimensional arrays of data by separating the values in each row with specific delimiter characters.
      • binary file format A binary file format is a file format that is not a text format. 1

Still open:

  • structured file format definition missing : Individuals: xml, json,
  • data file encoding vs file encoding: A file encoding is a data format, that specifies the characterset used to encode the data in a file.
  • in-memory data formatdefinition missing : Individual: data frame
  • data base format definition missing

chrwm avatar Sep 19 '22 16:09 chrwm

Is there a reason why CSV and DSV are not a subclass of text file?

MGlauer avatar Sep 19 '22 20:09 MGlauer

Is there a reason why CSV and DSV are not a subclass of text file?

DSV is a text file format (A delimiter separated file format is a text file format that uses delimiter-separated values (also DSV) to store two-dimensional arrays of data by separating the values in each row with specific delimiter characters.) and CSV is a DSV (A comma separated file format is a delimiter separated file format that uses comma (,) as delimiter.) You were probably mislead as in @chrwm the indentations did not reflect this correctly.

  • source code file format: A source code file format is a text file format that source code in a programming language.

As no one objected against this proposal for months, I interpret that also as an agreement.

I suggest to implement what has already agreed upon and discuss the remaining parts after.

l-emele avatar Sep 20 '22 08:09 l-emele

From last dev meeting we decided I am going to implement the already agreed terms.

areleu avatar Oct 07 '22 09:10 areleu

Should I make #1326 close this issue so we can push the remaining terms to the next release?

areleu avatar Oct 07 '22 09:10 areleu

If not everything is solved with PR #1326, then please leave this issue open and move the milestone. Also a list of open points after the PR would be nice.

l-emele avatar Oct 07 '22 09:10 l-emele

Here is a list of classes that are not yet implemented:

  • GAMS data exchange format: A GAMS data exchange format is a binary file format used by General Algebraic Modeling System (GAMS).
  • microsoft excel workbook (xls)
  • microsoft excel workbook (xlsx)
  • data file encoding
  • in-memory data format
  • data base format
  • structured file format with subclasses xml and json
  • file encoding: A file encoding is a data format, that specifies the characterset used to encode the data in a file.
  • data file encoding
  • in-memory data format with subclass data frame
  • data base format

l-emele avatar Oct 25 '22 09:10 l-emele

I just realised that in #1326, classes were added, but the respective individuals were not deleted. This is the current list of data format individuals: grafik

Instead of the individual csv (OEO:00000116) we have now the class comma seperated file format (OEO:00280003) and instead of the individual txt (OEO:00000426) we have now the class text file format (OEO:00280001). We should delete these two individuals. Further, we should think about whether the classes comma seperated file format and text file format and get the IDs and the IRIs of the old individuals. The advantage would be, that users of the OEO get with the same IRI the refined concept.

Implementation of that would go in two steps:

  1. Delete the individuals OEO:00000116 and OEO:00000426 using Protégé.
  2. Replacing all instances of OEO:00000116 with OEO:00280003 and OEO:00000426 and OEO:00280001 in all ontology files using an text editor.

This is an decision we should ideally make before the next release. Therefore I move this issue back to milestone 1.12.0. @OpenEnergyPlatform/oeo-release-team : Any opinions here?

l-emele avatar Oct 27 '22 10:10 l-emele

I agree. The individuals are not used anywhere in the OEO currently. I'll delete the individuals.

stap-m avatar Oct 31 '22 19:10 stap-m

I'm fine not recycling the old IRI's and ID's. New OEO versions often lack backward compatibility that the benefit of not breaking the two concepts appears disproportionate. I personally prefer the new ID's more and hence argue not to re-use the IDs and IRIs.

chrwm avatar Nov 01 '22 15:11 chrwm

From the dev meeting: According to OBO best-practices classes/individuals are declared obsolete. This should have been done instead of deleting it. We didn't know what the decision in the OEO regarding this is, hence we postpone the issue to the next release milestone.

chrwm avatar Nov 02 '22 08:11 chrwm

From the dev meeting: According to OBO best-practices classes/individuals are declared obsolete. This should have been done instead of deleting it. We didn't know what the decision in the OEO regarding this is, hence we postpone the issue to the next release milestone.

Thanks, I wasn't aware of it, since it wasn't commented here. At which dev meeting did we talk about that (number)?

stap-m avatar Nov 02 '22 08:11 stap-m

Sorry I meant the release meeting today! We just talked about it before starting the release.

chrwm avatar Nov 02 '22 10:11 chrwm

This issue has not any progress in the last to months and its therefore unlikely that we solve it before the next release. I thus move the milestone.

l-emele avatar Jan 26 '23 13:01 l-emele

I transferred everything what is left from this issue to a new issue #1518 and thus close this issue here. If I missed something feel free to re-open this issue or open a further issue.

l-emele avatar Apr 21 '23 06:04 l-emele