EML
EML copied to clipboard
validating EML 2.2/schema location
I'm having trouble generating/validating 2.2. I think the problem is that the second part of the schema location is just eml.xsd rather than the full path, but unsure. Reproducible example below w/comments
library (EML)
me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))
my_eml <- list(dataset = list(
title = "A Minimal Valid EML Dataset",
creator = me,
contact = me)
)
#validates in R but not in oxygen?
write_eml(my_eml, "ex.xml")
eml_validate("ex.xml")
#add in the correct schema location
my_eml$schemaLocation="https:://ecoinformatics.org/eml-2.2.0 https://nis.lternet.edu/schemas/EML/eml-2.2.0/xsd/eml.xsd"
write_eml(my_eml, "ex_2.xml")
#after this validates when read from file
eml_validate("ex_2.xml")
#but not when working on the not-yet-written-out eml object in R?
eml_validate(my_eml)
Hi @, thanks for filing an issue. It looks like there are a few things going on.
Re:
#validates in R but not in oxygen?
What error(s) does Oxygen report? I don't have a license over here to test with.
Re:
#add in the correct schema location
XML validation is a pitfall-laden part of working with XML. The EML
package does set the xsi:schemaLocation
for the eml
namespace by default to local path which really only works under certain circumstances. My understanding of this part of the XML spec is that schemaLocation
is merely a hint and whatever's doing the validating may use it or not. So no value is "incorrect", per se. That said, others here might think we should change our default schemaLocation
to something web-resolvable rather than a local path.
#but not when working on the not-yet-written-out eml object in R?
This looks like a bug. @cboettig it looks like eml_validate(foo)
(validating an in-memory object) isn't equivalent to write_eml(foo, bar)
-> eml_validate(bar)
(writing to disk, then validating an on-disk doc), though I think they oughta be. What do you think? Here's an MRE (copied from @scelmendorf:
> library(EML)
> me <- list(individualName = list(givenName = "Carl", surName = "Boettiger"))
> my_eml <- list(dataset = list(
+ title = "A Minimal Valid EML Dataset",
+ creator = me,
+ contact = me)
+ )
> eml_validate(my_eml)
[1] FALSE
attr(,"errors")
[1] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'packageId' is required but missing."
[2] "Element '{https://eml.ecoinformatics.org/eml-2.2.0}eml': The attribute 'system' is required but missing."
@amoeba - Oxygen error is (on line 2): "Cannot find the declaration of element 'eml:eml'."
@amoeba Thanks!
That behavior was intentional, though maybe misguided.
write_eml()
adds a packageId using a UUID if no packageId
has been assigned (and of course system
refers to the packageId. I did this intentionally so that a user could create a minimal EML file like the one above, where all of the elements are intuitive. Asking a new user to create a
packageId (and the corresponding
system!) is I think way less intuitive, so I thought being able to generate one on the fly makes sense (and matches the behavior of earlier versions of
EML`). But perhaps that was a mistake and users should be forced to set that manually.
Clearly in the list constructor has no mechanism to automatically generate a packageId
, and I think it would be weird if it did, since list constructors are designed so that you can build things up piecewise, so there's no expectation that they should be valid.
We could allow eml_validate
to call write_eml
first, where it would automatically add the packageId
if missing, but I think that is also misguided. IMO the above eml fragment is really a fragment and it shouldn't validate.
I agree it's a bit weird that it validates when you call write_eml
, I think the best fix would be that write_eml
should throw a warning if no packageId
is found, and explain that it is adding an ID automatically. I'd love a PR for that if we have consensus on that.
Also, I know we've discussed the schemaLocation issue before and the potential security risk of using a resolvable URL as the schemaLocation instead of using the locally installed copy of the schema, but if it makes the EML we generate more compatible with other tools, perhaps we should use that as our default schemaLocation instead. IMO it's the security issue is more the responsibility of the user and the other external tools. Open to discussion as to whether this would mean that the R package use the local copy or the online copy to validate (we could at at least check hashes or something, though it's nice to be able to validate offline!)
There are a couple of topics coming up here, so I will try to address them individually. Note: comments based on my experience with a subset of EML-builders who work with LTER and EDI. Those users typically pre-assign the packageId, and the PASTA system also checks schema validity. Sometimes we suggest commercial XML editors like Oxygen.
Re validation within R in general, assigning packageIds:
IMO, it’s a great idea to validate the EML as you go! And as long as EML-builders understand those two errors (packageId
missing, system
missing), they can do that in R - as is - with eml_validate(my_eml)
.
EML-builders should see the error; so I agree, this is misguided:
We could allow eml_validate to call write_eml first, where it would automatically add the packageId if missing, but I think that is also misguided.
The temporary packageId
and system
are fine (as inserted by write_eml
). However, users should be aware that this is happening, so I recommend a message to the screen if write_eml
inserts these. For those who control their packageIds, the msg will help remind them to assign it.
Re schemaLocation:
Does the R-EML package always use its own internal version? Ie, never the contents of the schemaLocation
attribute? (I confess to not reading all the documentation).
IMO current behavior is fine -- the simple, local filename is a good choice for a default schemaLocation
, as this can always be overwritten and appropriate for concerns already outlined by others.
Some EML-builders will want to (a) use the OxygenXML editor or (b) point schemaLocation
to a URL. They may need to learn how to set the schemaLocation
attribute, and to validate with several tools. However, teaching them is the responsibility of their communities (e.g., EDI, LTER); not the R code.
So bottom line(s):
- no changes to code behavior, but consider adding a msg if a placeholder packageId was added.
- Communities who promote certain EML/XML tools should show EML-builders how to validate (code can’t anticipate every possibility).
Thanks @cboettig, @mobb.
@scelmendorf we're discussing a number of things now that don't directly help you out so I want to address your original issues first. To avoid the errors about packageId
and system
you get before you write to disk, you'll need to set them ahead of time like:
my_eml <- list(packageId = "mypackage",
system = "mysystem",
dataset = list( # continued...
And, as you found out, if you write_eml
first, these two values are filled in automatically. The why of the errors is that those are required elements on the root eml
element, as defined in the schema. There is some info there about how to use those attributes but feel free to ask about them.
The second part of your issue, Oxygen not being able to validate your documents without modification, your guess is probably right on and editing the schemaLocation
is probably the best fix. Some XML validation tools support defining a catalog of local schemas which Oxygen does as well which will also work. This is a thing you'd have to deal with in other tools, not just Oxygen but it's a general XML problem and not something specific to EML or this package.
Does that help you out enough? It's a bit of a half fix, though it looks like we're keen to discuss some quality of life improvements that'd make what you experienced less painful.
@cboettig, @mobb, mind if I open two new issues to discuss (1) the possibility of adding warnings/messages when write_eml
fills in packageId
and system
on behalf of the user and (2) changing the default value of schemaLocation
to an HTTP-resolvable copy of the root schema?
@mobb, Re:
Does the R-EML package always use its own internal version?
Looks like it does always try a local copy stored with the package, though this package uses the xml2
package which may have other behavior.
Thanks @amoeba. I can add the schema location (that was my original solution - I just didn't know if that was intentional), and have been putting in the packageID; I just came across the write_eml issues when trying to make a reproducible example for the schemaLocation. My 2 cents is that it would be easier for most users if the schemaLocation default were HTTP resolvable and yes putting in warnings for filling in default packageId and system would be useful.
Thanks all! yes, new issues / PRs would be great for both the warning message and a remote schemaLocation default value. (or I'll get around to that sooner-or-later!)