erddap
erddap copied to clipboard
Support localized metadata on ERDDAP HTML pages
Hello fellow ERDDAP folks!
Recently, I've been championing an initiative over with the CF folks on getting a standard for localized metadata into CF (see https://github.com/cf-convention/discuss/issues/244) which has been paired with a discussion on expanding attribute and variable names to allow for a full (or greatly expanded) Unicode character set (https://github.com/cf-convention/cf-conventions/issues/237). Of note, the latter discussion is simply opening the CF conventions to use what NetCDF already allows, as NetCDF files are allowed to have attributes with any Unicode characters already.
As I also work on our ERDDAP server a lot, I wanted to draw your attention to these discussions because I noted that the current ERDDAP configuration only allows attribute names containing [A-Za-z0-9_] characters (I get a RuntimeException: [variable] isn't variableNameSafe when I put square brackets in for example). I recognize ERDDAP doesn't only work with NetCDF files and so there may be other restrictions than what NetCDF/CF will allow, but with the CF conventions moving towards allowing a full Unicode set (and ERDDAP's metadata is based on what CF/ACDD define) I thought it would be worth having a discussion on expanding the character set allowed in attribute and variable names and that some of you folks might want to weigh in on the CF discussion before it is finalized.
Part of why I have been championing that work is that I would love to see ERDDAP able to take localized metadata from a dataset and integrate it into the translation mechanism. Right now, there isn't a way to display a French title for a dataset when browsing the website in French (something that Canadian laws require for us to be able to use ERDDAP at the federal government level). I've made a hacky solution in Javascript that got me past the requirement, but having a proper internationalization solution for datasets in ERDDAP would be highly useful for me and probably others. I see the CF work as setting the foundation for this by defining a standard for encoding the different titles and such into the files themselves and I hope ERDDAP will pick that up in a future release (and would be happy to contribute myself to it).
From my experience the main hangup was downloading the .mat
file for Matlab. Matlab has very specific requirements for variable names that make this a difficult ask. See https://www.mathworks.com/help/matlab/matlab_prog/variable-names.html
Oh MATLAB :(.
That said, do MATLAB files have attributes? Could we relax the restrictions on attribute names and source names while keeping them on destination names and enforcing that if a source name contains an invalid character, a destination name must be provided? Or maybe automatically create one by removing invalid characters or replacing them with "_"?
I did some research on other file formats, here's what I found:
DAP2: allows [0-9A-Za-z_!~*'"-] and other US-ASCII if URL-escaped; Special Characters: =<>!+-/\*~%.[]
DAP4: UTF-8 characters (escaped if not US-ASCII); Special characters: /
HDF5: UTF-8 supported, ASCII default
ASCII, CSV, TSV: character-encoding dependent but all valid characters allowed (with proper escaping)
KML: depends on coding, <& and either ' or " must be escaped and non-printable control characters and compatibility characters are discouraged: https://www.w3.org/TR/xml/#NT-Char
ESRI: Strongly recommended [A-Za-z0-9_-], explicitly not allowed: +*/!^%()[]{},~'":;><&|\=@#$
So ESRI CSV might also be an issue with the variable names but it is solved by using a similar approach to MATLAB
EDIT: Fixed escaping of formatting characters
Or maybe automatically create one by removing invalid characters or replacing them with "_"?
I think GenerateDatasetsXml
has some similar logic in it. But, we're reaching the capacity of my knowledge.
https://github.com/ERDDAP/erddap/blob/468e2b85d2c2484024f1418619f35bbe01b27a94/WEB-INF/classes/com/cohort/util/ScriptString2.java#L704
Wow! There are a couple of big topics there! I'll try to deal with them separately below...
First: I am basically in support of Unicode/localization. The question is how to get what you want and how to make the usage clear to users.
Unicode Attribute Values - Note that ERDDAP already supports Unicode attribute values as much as it can (e.g., some outgoing file types don't support Unicode).
Unicode Attribute Names (Identifiers) - What you are asking for, notably in your example, is not just Unicode letters, but Unicode punctuation. Yes, nc4 files support this, but that is a special case. I said to Unidata at the time that I thought this was a bad idea. They may get away with it in nc4 files because the special characters have no meaning in the file (although what if there is a slash in an attribute name which is in a group?). The problem is that things get super complicated and cause problems when you allow punctuation and when you go outside of nc4 files. The main question is: what characters have special meanings in which situations? For example, if you want colons to indicate a namespace prefix is being used, then how do you deal with names that don't have a namespace but do have a colon in the name? Or, if you want slashes to be separators for groups (which is allowed in CF now), how do you deal with slashes in a group name? And you can probably imagine that comma, spaces, newline, #0, tab, and undefined characters will cause problems in various situations (CSV files, TSV files, Matlab files, etc.). As soon as you say all Unicode characters are allowed (maybe with specific exemptions like colon and slash, comma, newline, ...), then you are saying there can never be any addition to the special characters in the future, because there will be names in existing datasets which already use those special characters. That would be bad. That said, I am much more open to allowing some subset of Unicode characters corresponding to characters that are letters (or ideograms which are words). That would let you have, e.g., French letters or Chinese words in names, but not punctuation. But Unicode is huge, There would have to be some simple, standard way for CF and software like ERDDAP to easily identify these valid characters. Is there such a standard? I am basically fine with that compromise if problems can be worked out (can OPeNDAP be made to handle this??? I think not. I made suggestions for a DAP 2.1 (e.g., Unicode attribute values, long ints, and unsigned integer types) but was firmly told "no!")). (Note that it is a big project because of all the situations where ERDDAP publishes metadata, e.g., web pages, file formats (e.g., .das), other software (e.g., DAP libraries), and outgoing file types, many of which don't support Unicode or punctuation). But it may be possible (but maybe it isn't). What do you think?
I'll add to the above: as soon as you allow punctuation characters in names/identifiers, you open up security concerns and it is very difficult to foresee all of them (it was beyond the capabilities of all of the computer security people who got it wrong for so many years). Things which seem so simple (e.g., that all you need to percent-encode in a URL are a few special characters (&"#')) can be horrible security problems. There are reasons why identifiers in computer languages have strict requirements for valid characters (e..g., _a-zA-Z) in identifiers.
Localization (text appearing in different languages in different situations, notably on lang=FR web pages or for data requests which specify a language) - This is a huge/complicated project. What if the requested language isn't available? What if the requested file format doesn't support Unicode? And for standardized attribute names (title, summary, infoUrl, etc), there would need to be official translations of each of the names for each possible language. You'll have to get the standard organizations to do that (good luck!) and even then it just makes things very complicated (or not work), e.g., software like ERDDAP that looks for those standardized names (e.g., title, summary, etc) to extract specific information about the dataset. Eek! That is a messy, difficult/impossible project. Let me suggest an alternative that you can do right now in ERDDAP: make variants of datasets (one for each language, e.g., MUR41_en, MUR41_fr) which use the standard attribute names, but have translated versions of the attribute values. Note that ERDDAP's datasets.xml let's you redefine all of the metadata used by a dataset, so you can use the same underlying data source (e.g., files), but clearly and simply, change all of the metadata values for each variant. (Even better, if you use dataset type=EDDGridFromDap or EDDTableFromDap for all of the language variants, then ERDDAP will handle this very efficiently because it only needs to, e.g., read the data files, for the original dataset.) Then a user can make a request to MUR41_fr and they will get the French version of the metadata and the metadata will be CF and ACDD compliant (i.e., with the English attribute names). And users can easily find out which language variants exist by the existence/absence of a dataset with the appropriate name. This requires no changes to ERDDAP, CF, ACDD, or any other standard or file type. And it is super clear to users what language they can use. And it is super clear when a user says "I worked with MUR41_fr" that they worked with the French version of the dataset. This gets you 95% of what you want (just no translated attribute names). I think this is a vastly better approach than trying to get CF, ACDD, other software, etc to support official translations of the defined attribute names, and trying to change ERDDAP to support different languages in different situations. And you can do it right now. Your thoughts?
I hope that addresses all of the big topics. If I missed something or you want to redirect me, please let me know.
Erin, regarding your subsequent comments:
Yes. As you note, Matlab isn't the only troublesome outgoing file type. This is a troublesome problem because so many file types, standards, software, etc. where created before Unicode was widely supported. We can change ERDDAP (perhaps at great cost), but I/Chris can't change all the other standards, file types, software, etc. I had already made the changes that could be done with moderate effort and acceptable consequences (i.e., allowing Unicode in attribute values).
ERDDAP already allows any Unicode character in a source attribute names. It is just destination attribute names which have the stricter requirements. GenerateDatasetsXml already has code to check for not-allowed characters and automatically generate a destination attribute name which is valid. Note that further changes are a very complicated issue because the methods that make identifiers "safe" are used in lots of situations in the code, so simply changing those methods would lead to all kinds of problems. You have to make changes in a way that only affects what you want to change and you have to know/understand/check all of the ramifications (e.g., on all web pages and outgoing file types).
The best solutions (I think) are the ones I proposed in my first email. Well, in some ways, the best solution is no standards or software changes at all. You can do much of what you want (e.g., full Unicode support in attribute values, and localization via different datasetIDs) right now with no changes to any standards or software.
Regarding localized names/identifiers in general: I'll point out that identifier names in e.g., ISO 19115/19139, only exist in one language. That is true of all other XML schemas that I know of. Further, there are strong limitations on the characters allowed in identifiers in XML in general (I'm pretty sure). The same with computer languages (there isn't a French version of C++, Java, Python, or any computer language). Yes, it is a different issue (or at least a different realm), but the point is the same. Sometimes it is best to just pick a language for a task (e.g., names for CF attributes) and stick with it. (I know, easy for me, an English speaker to say.) But a fully localized world (3000 language variants of all software, software languages, standards, etc) just isn't feasible. It is a slippery slope as soon as you allow a second language.
I'll add another compromise to consider vs allowing all Unicode characters in attribute identifiers/names: allow all of the letter characters between 128 and 255 in ISO-8859-1, which is the single byte character set that has all of European accented characters (and some other characters) in positions 128-255. Those characters (and their numbers) are consistent with the first code page (0 - 255) of Unicode. Several places in ERDDAP use this encoding when the original specification says "ASCII" (which just defines characters 0-127) because it rarely causes problems and allows support for all of the European languages. It is obviously an imperfect solution (it doesn't support all of the languages which use other characters), but it is an easier-to-implement solution. (e.g., it would be easy to identify all of the allowed characters in this range and document it in the ERDDAP documentation) which provides some benefit.
I'll have to understand more about Unicode to fully follow this, but for R variable names:
- A variable name must start with a letter and can be a combination of letters, digits, period(.) and underscore(_). If it starts with period(.), it cannot be followed by a digit.
- A variable name cannot start with a number or underscore (_)
- Variable names are case-sensitive (age, Age and AGE are three different variables)
- Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
So does this mean the proposal would break in R? What about other languages? I thought I saw that it would almost work in Python, but almost sort of doesn't do it when a user can't read a file. How about Javascript?
I would add that I fully understand the underlying rationale for the proposal. Besides the work that would be involved as well as the security concerns Bob raises, I would want a better understanding of what would break on the user end of things.
Roy, you bring up a few slightly different issues.
I was talking about attribute names. You and Erin are also talking about variable names. The ideas are basically the same, but variable names are more likely to become identifiers in R, Python, and other programming languages when a chunk of data is read in.
You are correct that R allows "letters" to be in variable names, but the definition of "what is a letter" in R varies based on locale(!). The documentation says "The definition of a letter depends on the current locale: the precise set of characters allowed is given by the C expression (isalnum(c) || c == ‘.’ || c == ‘_’) and will include accented letters in many Western European locales." (Talk about not very helpful! This means you can't know which letters are allowed ahead of time unless you know the locale of the user's R installation!) But this is like I was saying: if you want to allow all Unicode letters (and some other characters like ideograms that are words), you have to have some clear definition of which characters you are allowing.
This highlights another complication of this proposed project: ERDDAP may know which Unicode characters are allowed in a given file type (and thus could perhaps sanitize the name), but ERDDAP won't know that a given file is going to be processed later by some language (like R or Python or ...), doesn't know when a given attribute or variable name is going to be auto-converted into an identifier in that language, and so doesn't know which characters will be valid there. That's why the letters a-zA-Z is the only safe set of characters: they are (I think) always supported (there are probably exceptions). Even the ISO-8859-1 letters will cause problems in some places (so I probably shouldn't have proposed that as an option). Punctuation characters (like the square brackets that Erin wanted) are almost always trouble (e.g., '[' and ']' are interpreted as identifying array subscripts in almost all computer languages).
I should also have said earlier: there are many versions of Unicode. Java supports one of the versions of the UCS-2 (a group of characters identified by 2-bytes). So a proposal to CF would need to include a definition of "Unicode". But as I've said, I think allowing all Unicode characters (as netcdf does for nc4) is a bad idea.
I'll add/emphasize: these issues are generally not a problem for attribute values, which get read into R, Python, etc as Strings. Most computer languages now support at least UCS-2 versions of Unicode as valid characters in Strings. And punctuation characters don't cause problems in Strings. Thus attribute values using Unicode characters, which are already supported by ERDDAP, are generally not trouble when the file is used subsequently. There are exceptions, e.g., DAP, but ERDDAP generally deals with them by converting the string to the ASCII version of the string with characters 128+ rendered as \u plus their 4-digit hexadecimal Unicode number as they would in Json, e.g., the Euro character appears as \u20ac.
As I said, the current Unicode support in ERDDAP gets you most of what you want. Additional Unicode support (e.g., in attribute and variable names) leads to lots of difficult/intractable problems.
Thanks. I should have made clear that the proposal at the present ongoing CF meeting was for variable names also. I haven't thought it through, but it is one thing when you open a file on your desktop that has variable names with all the possible Unicode characters, another when that is put into an URL. The rationale I believe was to have localized versions of the conventions.
I should also add the obvious that an enforced subset of CF is still CF compliant, and ERDDAP has more localizations than most things (how many other web pages have drop down menus where you can change the language of the pages.). So a lot will depend if CF requires that support in variable names as opposed to allows that support.
I do like your suggestion for the French and English versions of datasets, You take the existing xml snippet, change the datasetid, use xml to change the variable names and attributes, et voila you have French and English versions.
J'aime votre suggestion pour les versions française et anglaise des ensembles de données, vous prenez l'extrait XML existant, modifiez le datasetid, utilisez XML pour modifier les noms et attributs des variables, et voilà, vous avez des versions française et anglaise.
I hope you like what I did there.
I hope CF doesn't go with full Unicode support for attribute and variable names. It is the kind of proposal that looks good in isolation (e.g., a cdm representation of a dataset's metadata) but is terrible in practice. It would lead to all kinds of problems (e.g., prevent various punctuation from being used as special characters, e.g., colons for namespaces) in the future, and cause problems when the data file is imported into some client software where variable names (with square brackets?!) become identifiers). Identifiers/names have traditionally been limited to very small character sets (e.g., letters and underscores) for good reasons.
Getting involved in a CF proposal is a full time job with endless bickering. I found it hellish and won't get involved again. I hope someone else advocates against it. (Sorry Erin)
Some notes:
@BobSimons I agree that there are a lot of issues with special characters. I know DAP 4 has gone for full UTF-8 support with the exception of a slash, but DAP 2 is still on US-ASCII (but allows any US-ASCII character as they can be URL-encoded which is also how DAP 4 will resolve ambiguities in what is a special URL character and what is part of a variable or attribute name). After looking at the output files that ERDDAP currently generates, the MATLAB file, ESRI CSV, and DAP protocols seem to be the biggest limiting factors (and would bring us down to what the current standard is which aligns with MATLAB and ESRI) for variable names and I don't see those changing any time soon. That said, attribute names aren't included in MATLAB or ESRI files as far as I can tell, so I think there is more flexibility there.
Based on the idea that ERDDAP is first and foremost a DAP2 server (with bonus features), I think my proposal would be to start by moving to a full US-ASCII character set (according to DAP2's definition) for variable names and attribute names, with some notes:
- Identify and restrict certain characters that could cause issues or confusion. At a minimum, I would suggest 0x00 through 0x1F and 0x7F (the non-printable control characters) be disallowed. In addition, I would consider the set of common math and logic operators
!+-\*/\\<>=
, HTML special characters&%?#:
, the space[]
, quote"
, apostrophe'
(and backtick?) carefully. While allowed, I think they might present complications in escaping but we should balance that against clear naming conventions for things like chemical formulas. I would personally eliminate the space and double-quote at least and analyze what the impact of allowing the others in attribute names and variable names would be in each file type ERDDAP can output. We can then decide if it is worth further restricting the character set. - We would then have to make sure each file output has proper escaping for variable and attribute names and (where necessary) a mechanism to convert invalid names to valid names (notably for ESRI and MATLAB). We would need to replace the characters (with underscores?) and then ensure the no-leading-underscore is met and that there aren't duplicates.
- We should also add a note to the documentation that we recommend using [A-Za-z0-9_] for full compatibility with all output formats and variable/attribute names may be modified in some formats if this is deviated from.
I assume if DAP4 gets approved, ERDDAP might look at supporting it and when that happens we can address full Unicode compatibility. Unicode does have general categories we can use to simplify life (e.g. ban all control characters which is General Category Cc).
In terms of security issues, this is an assumption on my part but most DAP libraries should have support for proper encoding of URLs since they are allowed in the specification. Since ERDDAP does generate URLs though, we would have to look at where we are generating DAP or other URLs and ensure proper escaping is applied (and then handled properly). This seems a bigger project for variable names than attribute names, so perhaps we can start with attribute names (which is most of my use case anyways) and move on to variable names after? I don't think attribute names appear in URLs in ERDDAP, just in HTML text. Proper escaping in HTML would still need to be applied.
In terms of localization, my proposal to CF was to simply provide "title" (for English or a different locale as specified in the "locale_default" attribute), "title_fr", "title_jp" etc. and have tools like ERDDAP fall back to the default attribute if it wasn't available. All of the available locales are documented in "locale_others" so there is a simple list. This aligns with BCP 47 which deals with localization in web applications where a requested language isn't available. I would not propose translating the attribute names themselves, I think that is a logistical nightmare as you said, but I think it is useful to support a suffix on standard attributes for offering the content itself in different locales.
ERDDAP already offers an internationalized interface which is great. But it's actually an accessibility issue at the moment because it puts English text (from dataset names and details) into a French (or other language) web page without noting that the text is not in the language of the page (violation of WCAG 3.1.2 Language of Parts, an AA criterion that organizations here in Canada are often required to meet and that the US federal government aims for under Section 508). Duplicating the dataset just makes the accessibility worse and, honestly, I would find confusing as a user (two datasets that are the same but in different languages?). What I've done for now is used a pipe to separate the English and French, then wrote a small Javascript tool that separates them, displays the appropriate one, and adds the language attributes as needed to make it accessible. A proper solution will need to understand what the language of the dataset is (maybe assuming its English) and then at least display the language attributes (then duplicating the dataset isn't an accessibility issue at least), but I think it's even better if ERDDAP could understand the metadata in multiple languages and display it.
In ERDDAP terms, this would mean the following:
- Read the XML in dataset
- Define the default
title
,summary
andlong_name
for each attribute in the current way but note that they have a locale as noted inlocale_default
. If there is nolocale_default
, I thinken
is a good assumption and we can note that in the documentation. - Read the
locale_others
attribute, if present. If it is present, split it by spaces and look for whatever convention CF ends up defining for localized metadata (e.g.title_fr
if the locale isfr
). Note these alternatives with their locale. - When making a request that displays the title, summary or long name to the user in HTML, use the locale requested (as per the language switcher which adds it as a prefix in the URL) to identify the best match from the locales we have (BCP47 has rules on how to do this, but given the limited selection of locales for ERDDAP, the algorithm should be straight-forward). If none match, use the "default" one (i.e. the one without prefix) as I've proposed in the CF conventions.
- Display the matching title, summary and long names to the user. If the locale isn't an exact match to the one requested (which should be in the
lang
attribute on thehtml
tag), then add alang
attribute to the closest html tag to the text with the locale in it (or add aspan
tag with thelang
attribute).
This change will greatly improve the accessibility of ERDDAP in handling alternative languages even if nobody localizes their metadata because it will properly add the lang
attribute as needed to English text and make ERDDAP more WCAG compliant (good for us all). It also will mean organizations like mine (Fisheries and Oceans Canada, and other Canadian organizations) will be able to offer a full French language equivalent as required by law here.
In terms of safety and using the method in lots of situations in code, I think I would (if I were writing it), create a new method and replace it where appropriate in code. That way we avoid unintended consequences.
As for ISO-19115, the format itself is unilingual agreed and I'm not proposing we change attribute names in CF/ACDD for exactly that reason - the interoperability is important. I only want to add suffixes for providing the value in different languages. This is something that ISO-19115 does provide via PT_FreeText
, for example, and it is then used in tools like CKAN to display the metadata in multiple languages:
<gmd:organisationName xsi:type="gmd:PT_FreeText_PropertyType">
<gco:CharacterString>Government of Canada; Fisheries and Oceans Canada; Fishery & Assessment Data Section</gco:CharacterString>
<gmd:PT_FreeText>
<gmd:textGroup>
<gmd:LocalisedCharacterString xmlns="" locale="#fr">Gouvernement du Canada; Pêches et Océans Canada; Section des données de pêche et d'évaluation</gmd:LocalisedCharacterString>
</gmd:textGroup>
</gmd:PT_FreeText>
</gmd:organisationName>
In CDL/NetCDF/CF parlance, I am propose we adopt a similar convention but without XML we can't do a nested structure. So instead, my proposal is
organization_name: "Government of Canada"
organization_name_fr: "Gouvernement du Canada"
I would further note that I think localization can be fully separated from the expanded character set support - a naming convention like "title_fr" can be done with the existing attribute naming conventions. However, with the CF workshop groups heavily leaning towards expanding support to full Unicode with exceptions (to align with the NetCDF standard), I thought it worth mentioning it here.
Reading back, I also think there was some confusion over my proposal and I wish to be clear that I don't want to translate CF attribute names, I think that's a nightmare. I just want a convention for having alternative content in different languages, but the attribute names themselves can be English-only. My original thought was to do "attribute_name_locale" but the CF convention people are now discussing if it is worth using the expansion of allowed characters under CF to separate the locale from the attribute name.
That said, it seems like it adds a lot of complexity and I'm going to push them for a localization convention that doesn't require expanding the character set based on this discussion since I think it will take a long time to test this to ensure there are no unintended consequences and I'd rather work with small changes to see localization happen faster and worry about Unicode/US-ASCII later.
Wow^2! So many issues. If convenient, for simplicity, please make inline responses to my responses.
I think supporting additional characters, notably new non-letter characters, in variable and attribute names is a bad idea. Chris can weigh in. Like all computer languages, most analysis software (R, Python, Igor), standards (CF, ACDD, DAP, XML), etc that I know of, ERDDAP restricts the characters to _a-zA-Z (others sometimes support 1 or 2 additional characters, which ERDDAP deals with by being more strict to avoid trouble). Note that parsing algorithms in various places rely on these definitions, so changing the definition in ERDDAP will cause all kinds of problems in all kinds of software. It doesn't make sense to modify ERDDAP in a way that goes against the standards and causes all kinds of problems when exporting data to those other systems. Plus, no one is smart enough to foresee all of the consequences of changes like this.
Plus, it would be in violation of the CF, ACDD, DAP, etc. standards. As Roy said, if ERDDAP (when it emits data files) supports fewer characters in identifiers, ERDDAP is still in compliance with CF, ACDD, and DAP, but if it supports more characters, ERDDAP isn't in compliance. Sometimes strict compliance doesn't matter, but in this case I think it does. If you get CF, ACDD (which is inactive), DAP (which has refused to make DAP 2.1), and others to change, then I will think it is a valid thing to consider (but I still think it is a very bad idea). (Note that I don't consider "a working group is considering this" to be anywhere near "the new version of CF supports this".)
CF used to require that proposals include a real life example(s) of why the change is useful and needed. That is a great idea. Can you please give me examples of what you want to do with added chars in identifiers?
Next issue: You say "Duplicating the dataset just makes the accessibility worse and, honestly, I would find confusing as a user (two datasets that are the same but in different languages?)." Why is accessibility worse? You can explicitly say on your ERDDAP home page that "All datasets with datasetID's that end in _fr are identical to the _en datasets except that the attribute values are in French". A French speaking user will choose to work with the French version of the dataset. With this approach, you get to translate all of the attribute values, you don't need any changes to the CF or ACDD (which is inactive) standards, or ERDDAP and you can do it today. (Doesn't that solve all the problems? Hallelujah!) If you don't want French users to see any English (and vice versa), then set up 2 ERDDAPs, one with the English datasets and one with the French datasets. (You could even be fancy at the Apache/Tomcat level and direct all requests to https://...erddap/fr/... to the French ERDDAP and perhaps vice versa.) I still think this is a good idea. Please tell me why you think it isn't.
I'll point out that the European Union has 24 official languages (and counting) which means that the metadata for a dataset might become quite voluminous.
As with the characters-in-identifiers issue, I don't think ERDDAP should be the trailblazer because it makes ERDDAP not standards compliant. If you get CF, ACDD (which is inactive), and others to add support for e.g., _fr at the end of attribute names to identify the language, then I will think it is a valid thing to consider. I think this is a reasonable proposal (although it is not necessary if you make separate datasets).
You wanted to add localized titles via title_fr, title_jp, etc. That seems like a slippery slope and doesn't solve the problem you said you wanted to solve: offering e.g., French versions of all of the metadata. Don't you want/need all the other text attributes to be localizable with this system? If so, then please be straightforward and say what you want. If not, then tell me why you don't want e.g., summary to be treated the same way.
What would the implications be for ERDDAP if CF expands the character set allowed for attribbute names to include hyphen -
, as well as either period .
or the two square brackets [ ]
? See recent comments in https://github.com/cf-convention/discuss/issues/244 for background.
@larsbarring
This is more a Bob question, I do not know enough of all the ins and outs of the code. Remember Bob is retired and deals with these when he feels like it (and well he should), so a response may be a couple of days in coming.
Just for convenience, if the CF conversation moves on, here is what I meant with "... recent comments ...": https://github.com/cf-convention/discuss/issues/244#issuecomment-1773572248 https://github.com/cf-convention/discuss/issues/244#issuecomment-1773858500 https://github.com/cf-convention/discuss/issues/244#issuecomment-1773861426 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775087332 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775257647 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775623793
And for an alternative approach: https://github.com/cf-convention/discuss/issues/244#issuecomment-1773636422 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775281409 https://github.com/cf-convention/discuss/issues/244#issuecomment-1776079418
@larsbarring, I just read the very first part of the CF discussion. Of the choices you are considering, I would recommend requesting a change to allow attribute names like title_fr_ca, because it is a simple extension of the existing attribute names (so it's easy to read and understand) and requires no other change (i.e., allowing other characters in attribute names).
The problem with allowing the characters ".-[]" in attribute names is that those characters are already used for other purposes in languages (like Python,etc), software (R, DAP 2.0, etc), and file types (.mat, etc) when the data from ERDDAP gets into those languages/software/files, notably when those attribute names would be converted into identifers (e.g., datasetID.variableName.attributeName) where those characters are not allowed in identifiers and are already reserved for other purposes: "." is used as a separator to show parent.child relationships. "-" is used for negation and subtraction. "[]" are used for dimensions. Since we aren't going to change R, DAP 2.0, Python, etc., making this change to CF and ERDDAP seems pointless and troublesome.
Note that ERDDAP already deals with a similar problem: some file types (e.g., Matlab's .mat) only support short variable names. ERDDAP deals with this by shorting longer names in a way that makes it unlikely that similar long variable names will be converted to the same short name. But solutions like this are not ideal because ERDDAP is promising one thing (certain variable names) and returning a file with something else (shorter, different names). So ERDDAP could sanitize ".-[]" characters, but it would have to do so in many places, for widely used languages/software/file types, and that would be a really kludgy solution.
Zooming out: I think your request to allow ".-[]" in attribute names is a bad idea because it goes against a widely used standard in the computing world (languages, file types, software, etc): the characters used in identifiers are quite limited (generally they must start with _a-zA-Z and then include just _a-zA-Z0-9 ). Going against this is inviting trouble (as shown by the examples above).
I'll add: if e.g., title_fr_ca becomes valid in the CF world, I think it would be straight-forward to add support for it in ERDDAP, given ERDDAP's existing support for other languages in the web interface. For example, in some situations (e.g., the table of matching datasets which is returned when users do a search for datasets of interest), ERDDAP could display the appropriate title_x_y and summary_x_y variant as the dataset's title and summary. Although, in many circumstances (the attribute lists at the bottom of the Data Access Forms), I think all of the metadata should be shown as is.
I hope that is clear and makes sense. If I need to address some specific point from the CF discussions, please let me know.
Best wishes.
@BobSimons thank you for your explanation. I am aware of that all the suggested characters have special, and different meaning in various languages, as have any number of reserved words, and more. I think that it depends on the programming style (even paradigm) how these matters are handled.
If the example pattern you provide ("datasetID.variableName.attributeName") is indeed used in ERDDAP I think that datasetID.variableName.attributeName.locale
seems like a natural extension for those attributes that have a localized version, whereas datasetID.variableName.attributeName_locale
breaks this pattern, and may be more complicated by the fact that the attributeName
part may have one or several underscores, and the locale
part may have one underscore.
Anyway, I have no insight in inner workings of ERDDAP so i am not commenting on this, but do understand that it might be a substantial task to make the necessary changes. To round off I would like to mention that there are now and then (and as I reckon increasingly often) well motivated requests from various communities to relax the CF restrictions regarding which characters are allowed in variables and attributes.
I think attributeName.locale as an attribute name would cause problems in e.g, R and Python, because the .locale would be part of the attribute name. So it would become something like datasetID.variableName."attributeName.locale", which is messy/confusing.
I see your point about the possibility of _ in an attribute name that then also has _language or _language_locale. But
- Presumably, ERDDAP would only look for/ care about language specifications added to specific attributes (e.g., title, summary).
- Perhaps the revised suggestion is to use notation like title__fr_ca (2 underscores before the language id.
I am more open to adding support for e.g., accented characters from ISO-8859-1 in attribute names, but even this will probably cause a lot of problems in other software, making it a bad idea. Again, other software tends to be very restrictive about what characters are allowed in identifiers. If CF changes, then we'll probably change ERDDAP, but you'll never change Python, R, Matlab, ESRI files, Igor, DAP 2.0, etc., so changing CF looks like a bad idea to me.
I remain doubtful that allowing other punctuation characters is a good idea. Punctuation characters are often used for other purposes in other software and, again, most software is very restrictive about allowing punctuation characters in identifiers.
One of my objections to CF in general when I tried to work with them (a disaster) was that it was mostly run by scientists, with little input or care for what software developers thought (at least what I thought). So they may decide that allowing additional characters makes sense to them (and it may be fine in the CF/netcdf bubble), but it may cause lots of problems in the wider software world (Python, R, Matlab, ESRI files, Igor, DAP 2.0, etc.).
Best wishes.
Hello. I am the advocate for CF issue #237 to remove character set restrictions on the names of netCDF variables, attributes, etc.
I appreciate the concerns about breaking existing ERDDAP code and applications. Would it be feasible to implement some kind of modifier to ERDDAP client requests, such that a knowledgeable application could request original, unmodified netCDF object names, and skip the name sanitizer? Existing applications, both internal and external, would be safe because they would continue to be exposed to only the default, sanitized names.
A scheme like this would allow gradual migration for localization as well as other internationalization and naming strategies.
I think you are missing my main point: Yes, these changes will cause problems within ERDDAP (e.g., because of DAP 2.0 limitations) that are hard to deal with, but the far bigger issue is that these changes are incompatible with major external applications (R, Python, Igor, Esri, etc.) where the data is actually used. Maybe we can find solutions to the problems in ERDDAP, but you will never find solutions / make changes to R, Python, Igor, Esri, etc.). You're going against a tradition of limited characters in identifiers that has always been the dominant system in the computer world.
I don't get why you are so adamant about this when it isn't needed for localization and internationalization. Allowing all Unicode characters in String data and in attribute values is what you really need and we largely have that (other than some legacy file types). Why isn't your problem solved (although not in the way you want) with e.g., title_fr and then attribute values that allow Unicode? That is a tiny change to CF (that doesn't require a change to the characters allowed in attribute names) and causes no problems with all of the external computer languages and software.
@BobSimons, no, I am not missing your main point. I am suggesting some kind of ERDDAP bypass mechanism for aware software that will completely avoid inserting raw netCDF names into code and command namespace. I imagine this would not be very difficult for ERDDAP, but I do not know ERDDAP internals.
My proposed changes are not at all incompatible with various programming languages. They merely require some discipline to keep data names inside string variables, rather than in code namespace.
Developers of core data formats such as HDF5 and netCDF went out of their way, many years ago, to enable a very wide character set for data storage names. Many users would like to utilize more parts of that character set. CF and related standards are blocking that. I understand the temptation to insert raw variable names into code namespace. This is how COARDS, CF, and ERDDAP evolved. However, IMO this particular "tradition" is flawed. I think it is time to step forward and embrace a modern character set and a more flexible way of working with data names.
Erin's localization proposal is a tiny subset of the character set debate. It bothers me a little that this has been conflated with the full UTF-8 proposal. However, here we are. The core technical issue is the same in both cases -- how to safely support an expanded character set.
@Dave-Allured @BobSimons
When this first came up I suggested it would sure be nice to have this tested with the major netcdf (and ERDDAP) clients, exercising their full capabilities, before this suggestion becomes part of the standard. At least be certain what does and doesn't work, so a decision is made with all the facts. As I have said, I understand where the request is coming from, but I don't understand all of its consequences, and what I know of some of the clients suggests that they will break, but I don't know that for certain, and even more I don't understand the rush to come to a decision. Let's be certain of all of these types of things first.
What a mess. These conversations (especially in the CF mailing list, but here, too) always end up with all kinds of misunderstandings and mischaracterizations, debating different options at the same time.
@Dave-Allured, I'm sorry I said you missed my main point, but you did express your "concerns about breaking existing ERDDAP code". But I don't like your main solution (a switch to request sanitized or unsanitized variable and attribute names) because that is way too messy -- it presumes people always know ahead of time in which applications the data file will be read the limitations of that application. And requesting "sanitized" names is messy because different applications will need different sanitation procedures. That is largely why ERDDAP settled long, long ago on the naming conventions that it has. The fact that you will never change how Python, R, Igor, DAP 2.0, etc work should convince you that these proposals to allow more characters is a bad idea. When viewed in the isolation of nc4 and hdf5 files, the new characters are appealing and cause no problems -- the problem is the wider world of file types and applications. As I've said many times: for 60+ years, the world of software languages and applications has used very limited character sets for identifier names so that punctuation could be used for other things (e.g., . for parent.child, - for negation and subtraction, and [] for dimensions). You're going against an ocean of precedent.
And with your solution of sanitized names, you're neutering what you wanted: new options for variable and attribute names. You'll have the CF docs saying things like "although the preferred form is e.g., title-fr, in some applications this will appear as title_fr." If e.g., title-fr is going to end up as title_fr in some places, why not just use title_fr all of the time?
Yes, it is unfortunate that we're discussing a couple of proposals simultaneously (allowing .-[] in names vs allowing any Unicode character), but @turnbullerin started this thread by mentioning the full Unicode option and then changed to adding a limited set of punctuation, then changed to allowing .-[] (sorry if I got that a little wrong). And even you have mentioned different proposals are in the works. But to me, any proposal to support punctuation in identifiers is a bad idea.
In the spirit of CF's requirement that people give examples of why a change is really necessary. it would be really nice if you gave more examples showing why these changes are really needed so we can debate those one at a time. You're just jumping to the changes needed without giving the reasons (other than requesting e.g., title-fr, but you could use title_fr so that is to me not a good reason).
Finally, I don't know what the developers of hdf5 were thinking when they allowed full Unicode in variable and attribute names (which nc4 developers then utilized), but it could easily be that they were simply future-proofing their new file format. To me, that is very reasonable. I probably would have done the same. Then other groups (e.g. CF) can choose which characters are allowed in general for their domain and which have special meanings (e.g., / is disallowed because it is used to separate parent/subgroups). It was easy for hdf5/nc4 developers to support full Unicode since hdf5/nc4, when viewed by themselves, are a self-contained worlds. The developers didn't concern themselves with the downstream effects of different characters in different client software because it wasn't their problem. But it is CF's problem. [I know they are focused largely on nc4, but they should also be focused on the languages and applications in which those files will be read.] And it is ERDDAP's problem. So I don't think allowing .-[], or punctuation in general, or full Unicode, is a good idea because of the problems in the wider world of software languages, file types, and applications.
Best wishes.
@BobSimons, thank you for your detailed response. I think a new switch to request original netCDF names would not be messy. I do not know ERDDAP, so that is only my off-the-cuff opinion. We can disagree about that. I will try to stop talking about a wide character set now.
My preferred localization requires two new characters, ASCII only; period and hyphen, such as title.fr-CA
. I consider this optimal for future purposes; as in, better than e.g. title_fr_CA
. You have a point in your earlier comment that this is a seemingly tiny change from other strategies which would be fully compatible and effective. Nonetheless, would it be easy and non-messy for ERDDAP to provide a limited switch for attribute names only, such that knowledgeable applications may request that those two characters be preserved?
You're asking if something is possible. I want to answer "is it a good idea?"
Again, you ask if changes to ERDDAP are possible, but you ignore the downstream effects (other than the messy solution of offering a switch that knowledgeable users can used to request sanitized attribute names), which is my big point.
I think that in R and Python, when a datafile is read in, attribute names can be represented as e.g., datasetID.variableName.attributeName . In that case, the '.' and '-' become trouble because '.' is used to indicate parent.child relationships and '-' is used for negation and subtraction. Maybe you can use datasetID.variableName."attributeNameWith.And-" but I don't know. And what about all the other analysis programs that you and I don't know about and don't even know they exist? Simple character sets avoid trouble.
I think the idea of a switch to request sanitized names is a bad idea. It is counter to the norm of metadata appearing in a consistent way in different places. There is no standard DAP way to make this request (although ERDDAP could add one, e.g., &sanitizedAttributeNames ). And again, this presumes the user knows ahead of time which apps s/he is going to use the file in and the requirements of the app. And when ERDDAP reads data files from a dataset with a mix of unsanitized and sanitized att names, how is it supposed to know that the sanitized attributes should be unsanitized? But the bigger point is: you and I don't know all about all of the clients so we don't know where this will cause problems. And I'm pretty sure this will be a big annoyance for users who get snagged by invalid attribute names.
Partly, with things like this, it feels like inviting trouble. I'm smart enough to see there will be problems in various places, but I'm not smart enough (I think no one is) to foresee all of the consequences. It's a bad idea to make changes when you can't foresee the consequences.
All of this just seems like so much trouble (and we won't know how much until we do it and it can't be easily undone) for so little benefit. title_fr_CA is such a simple extension of the CF standard and will not cause any problems with any client file types or analysis programs.
I find it interesting that people treat CF, ERDDAP, etc as so malleable. (Well, in a sense, they are.) But if the program you wanted to change were, e.g., Python, R, Igor, ArcGIS, MS Excel, or Postgresql, you wouldn't think to ask for a significant change like adding punctuation characters to identifiers. You would just use title_fr_CA and be back at work in 1 minute. Instead you're asking for significant changes to ERDDAP and you are unconcerned (apparently) about possible problems (like users being confused about when they need to request sanitized att names).
And partly, I think this is a slippery slope. If you get one or two punctuation characters approved, the requests for other characters will be easier. But again, they will cause problems in various places and you and I would be able to predict all of those consequences.
Again, I worked hard to expand attribute strings so they could be Unicode as much as possible. That was the important change.
Let me rephrase and emphasize one point: With you proposal to optionally sanitize attribute names, CF would have to say that both title.fr-CA and title_fr_CA are legal. That is a crazy situation (and an extra pain for software tasked with reading and interpreting DAP .das, .nc, .nccsv, and other files) given that the standard could be for just title_fr_CA.
Here's another consequence of your proposal to allow .-[] in attribute names and have a switch to specify whether a request should return .-[] in the names or sanitize the names:
- If the default for the switch is to return .-[] in the ERDDAP response data file, then it is likely that existing workflows of some users will suddenly fail because the .nc files they get from ERDDAP will suddenly have .-[] (when the data set adds title.fr-CA). I've tried really hard to not break/change existing behavior in ERDDAP partly because a big chunk of my time (and Roy's) was spent dealing with source datasets (I'm looking at you NCEI and NASA) where things changed (server moved, datasetID changed, directory changed, variable names changed, ...). It's bad when existing workflows break.
- If the default is for the switch to sanitize .-[], then what was the point of having .-[] in the attribute names on ERDDAP web pages but not in the data file response? And we are back to the problem of CF having to say that both title.fr-CA and title_fr_CA are legal.