eml icon indicating copy to clipboard operation
eml copied to clipboard

check dataset element description

Open stevenchong opened this issue 6 years ago • 8 comments

Following a conversation with @mpsaloha and @gothub , we wanted to get clarification on the definition of a "dataset" that appears in the dataset element description:

The dataset field encompasses all information about a single dataset.  A dataset is 
defined as all of the information describing a data collection event.  This event may 
take place over some period of time and include many actual collections (a time 
series or remote sensing application) or it could be just one actual collection (a day 
in the field).

The second sentence caught our attention and sounds more relevant to the metadata about a dataset, rather than to a dataset itself.

If this description gets edited, note that it also appears in the DatasetType description.

stevenchong avatar Jan 07 '19 23:01 stevenchong

Good catch. I think the word 'dataset' is being used in two different ways in this description. First, two describe the scientific concept of a dataset and, second, to describe what an EML dataset is. I think sentence two holds if you use the second definition but not the first. I actually think that might have been the original intent of the wording.

Did you and the others think up any alternatives, or would you like to have a try at it if you think it still needs tweaking?

amoeba avatar Jan 08 '19 20:01 amoeba

Hi folks,

Bryce-- thanks for the thoughts. When the documentation says:

A dataset is defined as all of the information describing a data collection event.

...I think rather than a "dataset" per se, this is referring to the eml-dataset module's dataset element. This could be grokked from context, but as worded, struck us as a bit confusing.

I think a clearer more accurate description for the eml-dataset "dataset" field would be as follows:

The EML dataset element is the top-level "container" organizing the information describing aspects of the collection event that produces the dataset.

Perhaps this could be wordsmithed a bit more, but I think it conveys what we are trying to describe in the context of the EML documentation...

cheers,

Mark

On Tue, Jan 8, 2019 at 12:28 PM Bryce Mecum [email protected] wrote:

Good catch. I think the word 'dataset' is being used in two different ways in this description. First, two describe the scientific concept of a dataset and, second, to describe what an EML dataset is. I think sentence two holds if you use the second definition but not the first. I actually think that might have been the original intent of the wording.

Did you and the others think up any alternatives, or would you like to have a try at it if you think it still needs tweaking?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/329#issuecomment-452440345, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-QkzFaV6azqhX5DT1ezWaUt1hQf2ks5vBP9wgaJpZM4Z0hE1 .

mpsaloha avatar Jan 08 '19 22:01 mpsaloha

Thanks @mpsaloha that looks pretty good.

I put together a version with minimal modification to increase clarity:

DatasetType is the base type for the dataset element. The dataset element is a container for the information describing a data collection event. This event may take place over some period of time and include many actual collections (a time series or remote sensing application) or it could be just one actual collection (a day in the field).

What do you think? If you like your version better, I'd be fine with that. I'll send this over to the #eml channel in case anyone else has thoughts.

amoeba avatar Jan 09 '19 20:01 amoeba

Hi @amoeba - Interesting discussion. I am surely overthinking this but I find 'event' a bit misleading and, maybe, constraining as it conveys the sense that a dataset results only from going into the field. I wonder if the language could be a bit more encompassing to make it read less "fieldy" and reflect that a dataset could in fact describe the output extensive research. I played around a bit focusing on research effort as a substitute.

DatasetType is the base type for the dataset element. The dataset element is a container for the information describing the features and products of a research effort. The research effort described may be expansive, taking place over an extended period of time and include many unique data products (e.g., tabular data tables, shapefiles), such as resulting from a thesis, or could be a short, focused effort resulting in one or more data products.

srearl avatar Jan 09 '19 21:01 srearl

Thanks for chiming in, @srearl! I take your point about the constrained scope of the current wording. We certainly do use EML dataset to document resources where concepts like temporal/spatial coverage do not apply (e.g., the derived output of running a physical simulation model).

amoeba avatar Jan 09 '19 22:01 amoeba

I'm fine with Stevan's suggestion, but still actually prefer event over effort. Here is why--

While the term "event" is indeed vague, for me it simply connotes any identified process/es that occurred in some place/time, but not necessarily "about" some place and time (though I'm not sure this distinction is readily clarified in EML). So I don't see it as too "fieldy"-- something has to have happened for data to be collected or created, and that something could commonly be called an "event". I might suggest that "efforts" result in "events", that is, "efforts" connote a more project-oriented view, but again usages are varied. Is a model execution that generates simulation data, an effort or an event? For me it is more naturally described by the latter. But I can't think of any situations where replacing "event" with "effort" is going to be too misleading...

I am of the old school, however, that prefers for more greatly constraining the use of "dataset" to pertain to a distinct data object (e.g. a table or image), rather than broadening it. I am a bit sad to see "dataset" become whatever arbitrary circumscription someone wants to apply to a set of digital objects-- where I vastly prefer the less well-established use of the term "data package" as in DataONE (though terminological usages also vary somewhat there).

In our EML documentation we suggested that a dataset could be used to describe multiple "tables", but our example was referring to highly inter-related if not co-dependent objects (e.g. a set of relational tables with integrity constraints in a RDBMS).

Just my thoughts...

cheers, Mark

On Wed, Jan 9, 2019 at 2:53 PM Bryce Mecum [email protected] wrote:

Thanks for chiming in, @srearl https://github.com/srearl! I take your point about the constrained scope of the current wording. We certainly do use EML dataset to document resources where concepts like temporal/spatial coverage do not apply (e.g., the derived output of running a physical simulation model).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/329#issuecomment-452899870, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-Q047H4zcWUnhJDOOYXbxh__TExMks5vBnLOgaJpZM4Z0hE1 .

mpsaloha avatar Jan 10 '19 00:01 mpsaloha

I'll just point out that the eml-dataset module has this sentence that (sort of) defines a dataset:

"A dataset can be (and often is) composed of a series of data entities (tables) that are linked together by particular integrity constraints."

I don't see that text appearing in any of the field descriptions. Perhaps it's worth adding to the "dataset" field description.

Original context: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-modules-resources.md#the-eml-dataset-module---dataset-specific-information

stevenchong avatar Jan 10 '19 18:01 stevenchong

Hi Steven,

Yes- that quote is what I was alluding to in the penultimate sentence of my prior email. It might be good to reiterate here, as you suggest.

Note that this is encouraging a narrower usage of the term "dataset", than saying that a dataset can be whatever someone or some project wants to (somehow) group together. I think this latter usage has the inherent danger of encouraging the lumping together as a "dataset" only loosely related digital objects, although it does make it easier to provide collective metadata at a less detailed level (which I consider problematic)

But again, these terms are ambiguous and subject to semantic drift. So I'm always happy to nudge towards what may be the more traditional use.

Although I've been teased several times for bringing up this reference, it is still the case that Wikipedia says :

A data set (or dataset) is a collection of data https://en.wikipedia.org/wiki/Data. Most commonly a data set corresponds to the contents of a single database table https://en.wikipedia.org/wiki/Table_(database), or a single statistical data matrix https://en.wikipedia.org/wiki/Data_matrix_(multivariate_statistics), where every column https://en.wikipedia.org/wiki/Column_(database) of the table represents a particular variable, and each row https://en.wikipedia.org/wiki/Row_(database) corresponds to a given member of the data set in question.

I concur with Wikipedia about this, based on my experience with its typical usage among scientists, at least in the social science, ecology and biodiversity domains. Wikipedia also goes on to say other stuff, but still mentions these items would be "closely related" through some particular experiment or event.

cheers, Mark

On Thu, Jan 10, 2019 at 10:23 AM Steven Chong [email protected] wrote:

I'll just point out that the eml-dataset module has this sentence that (sort of) defines a dataset:

"A dataset can be (and often is) composed of a series of data entities (tables) that are linked together by particular integrity constraints."

I don't see that text appearing in any of the field descriptions. Perhaps it's worth adding to the "dataset" field description.

Original context: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-modules-resources.md#the-eml-dataset-module---dataset-specific-information

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/329#issuecomment-453200764, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-f2yC4Meafy4_g9m-CjqJXGUNoY8ks5vB4UZgaJpZM4Z0hE1 .

mpsaloha avatar Jan 10 '19 19:01 mpsaloha