User story As a Compliance Manager (Catherine M) I would like the country of origin [and date of collection] to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol

UPDATE Initial batch of work has standardised data collection in the manifests, and improved the interface in the rarely used front-end. We do not currently have any additional validation, or send the information to the EBI. These steps should follow

Who are the primary contacts for this story Catherine M Tom W

Acceptance criteria To be considered successful the solution must allow:

[x] Align country of origin options for samples to the EBI / ENA interface
[x] Establish if the EGA database requires the same treatment
[x] Establish if there is a fixed set of country names/abbreviations that should be employed
[x] Provide a mechanism (manifest?) for users to enter the country of origin per sample
- This already exists, but the interface will change once we add the controlled vocabulary.
[x] Provide a mechanism for users to correct the country of origin per sample
- This already exists, but the interface will change once we add the controlled vocabulary.
[ ] ~Validate that samples selected for accessioning have the required metadata in the correct format~
[ ] Validate country is supported before sending to EBI
[ ] ~Samples that fail validation should have a clear error message detailing the problem and steps to resolve~
[ ] Add accessioning report to indicate why samples aren't accessioning with clear description of failure
[ ] Update the Accessioning documentation on confluence to include country of origin
[ ] Inform stakeholders and Compliance Managers of the change and anticipated date of enforcement

Dependencies

References See Confluence for details of Accessioning in SequenceScape

Additional context Over the coming year the ENA database will enforce country of origin for all samples submitted This is to comply with the Nagoya Protocol which is an agreement aiming to share the benefits arising from the use of genetic resources in a fair and equitable way

ENA programatic submission documentation Current default sample checklist

Dec 17 '21 13:12 TWJW-SANGER

Validation and collection

Country of origin

We already have country_of_origin in the sample metadata. It is a varchar(255) field.
There is ~no~ basic rails validation on country_of_origin which ensures that it can be encoded as utf8mb3. Essentially this prevents the use of emoji and rare kanji characters.
The field is present in most, but not all manifests.
- Present plate_default, plate_full, plate_library, plate_chromium_library, tube_default, tube_full, tube_multiplexed_library, tube_library_with_tag_sequences, tube_chromium_library, tube_multiplexed_library_with_tag_sequences, tube_multiplexed_chromium_library, tube_rack_default, heron
- Absent plate_rnachip, tube_rnachip, saphyr, long_read, cardinal
The manifests restrict length of country of origin to 30 characters
The field is highlighted in blue, indicating it should be filled in, but this is pretty much just advisory
The manifest contains the instruction: Please add a description up to a maximum of 30 uppercase letters. However noting enforces this uppercase restriction

Date of collection

We already have date_of_sample_collection in the sample metadata, it is a varchar(255) rather than a DATE or DATETIME field. There is a # TODO[xxx]: Date field? in the code next to the definition
There is no rails validation on the field, not even the basic restriction to utf8mb3
The field is present in most, but not all manifests.
- Present plate_default, plate_full, plate_library, plate_chromium_library, tube_default, tube_full, tube_multiplexed_library, tube_library_with_tag_sequences, tube_chromium_library, tube_multiplexed_library_with_tag_sequences, tube_multiplexed_chromium_library, tube_rack_default, heron
- Absent plate_rnachip, tube_rnachip, saphyr, long_read, cardinal
The manifest restricts the field to 5 characters
The field is highlighted in blue, indicating it should be filled in, but this is pretty much just advisory
The manifest contains the heading DATE OF SAMPLE COLLECTION (MM/YY or YYYY only) with the instruction Please Enter either a Month and Year or a complete Year e.g. 04/05 or 2004. and validation error This must be either a combination of month and year, or a whole year, with no spaces.

Feb 03 '22 14:02 JamesGlover

Nagoya protocol: https://www.cbd.int/abs/doc/protocol/nagoya-protocol-en.pdf

Feb 03 '22 14:02 JamesGlover

Data Integrity

Country of origin

6563426 samples have no (NULL) data in the field

There are 471 different values, while a lot of these appear to be valid countries we also have:

Nationalities and ethnicities

A lot of these appear to be populated directly from forms. (eg. Any other Asian background)

Non-country geographical regions (Ie. Cities, Continents, Counties)
Invalid data (eg. 0, 50)
Corrupt data (eg. ?sterreich)
Abuses of the field (eg. RNA, Blank DNA neg control)
Various flavours of blank / Not applicable
Probably invalid but descriptive fields: (eg. lab_strain)
A few spelling errors
Lots of different flavour of some countries (UK, United Kingdom)

Very few of them are in block capitals.

I didn't see any fields that may be inadvertently exposing personal data, however it is possible that some of the geographic regions may be small enough that combined with other data an individual would be identifiable. The chances of this occurring increases with the invalid smaller regions, but could potentially be true for countries.

Date of collection

6856299 samples habe no (NULL) data in the field

22661 samples have 0 in the field
2 are blank
20 are 00/01
9 are ?
1 is #NA
1 is ?2010
Some incomplete or ambiguous dates like 01-Aug. It is likely that this is a result of Excel auto-typing '01/08' which it 'helpfully' converts to '01-Aug' when in actuality, this was probably intended to refer to Jan 2008 given the requirements in the manifest. I'm a bit concerned that the same may apply to dates in the format 01-Feb-21 as well.
Invalid dates like: 0117-11-10
I'm a little suspicious of some of the date's I'm spotting. We have 1888 and 1891, and then regular dates from 1900 onwards. However given some of these appear more than once I assume they may represent historical samples. However we also have at lots of samples sample for 2023-44359 which definitely indicates issues with using excel auto-fill. (especially as many of these follow ranges)
Very broad dates like 2013

The remainder are mostly what appear to be legitimate dates, but most not in the format suggested in the manifest. In many ways this is a good thing, as it means tightening up our collection and storage of this information will improve reportability, rather than just breaking existing reports. A few actually have timestamps, and while most of these are midnight, a few are more precise

Feb 03 '22 15:02 JamesGlover

Exposure

Country of origin

Internal

Country of origin is exposed in the warehouses, and via the V1 API
Country of origin is not currently exposed by the V2 api
Country of origin is exposed on sample pages in Sequencescape
Country of origin is exposed in the study reports

External

This data is not currently sent to either the ENA or EGA

Date of collection

Internal

Date of collection is exposed in the ML warehouse varchar(255), and via the V1 API (raw string)
Date of collection is not currently exposed by the V2 api
Date of collection is exposed on sample pages in Sequencescape

Feb 03 '22 15:02 JamesGlover

Exceptions

The EBI provide scope for exceptions https://www.ebi.ac.uk/about/news/technology-and-innovation/ena-new-metadata

Although the spatio-temporal information will become mandatory in most cases, some exceptions will be allowed when it is deemed necessary and the exception indicated to users.

Feb 04 '22 08:02 JamesGlover

ENA Requirements

The current default requirements are available here: https://www.ebi.ac.uk/ena/browser/view/ERC000011 No fields, including geographic data are currently flagged as required on the base checklist.

Country of origin

From the current list the most applicable fields would appear to be geographic location (country and/or sea) which has the following help text:

The geographical origin of the sample as defined by the country or sea. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html).

However currently the options dropdown also contains some non-country options:

not applicable
not provided
not collected
restricted access

This list is accessible programatically via: https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011

The INSDC country list was last updated 'October 31, 2014' so it doesn't seem to be particularly volatile.

Some of the other checklists have stricter requirements. For example the Tree of life checklist is stricter. https://www.ebi.ac.uk/ena/browser/view/ERC000053 and the filed is already flagged as required. (The options list appears to be the same though, not sure if this is true for ALL lists)

Note, the sample XML schema definition doesn't validate individual attributes.

Collection Date

The linked document also mentions a requirement for 'collection date'

This maps to the field: collection_date which has the following help-text:

date the specimen was collected

Validated by a regex:

(^[12][0-9]{3}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01])(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?(/[0-9]{4}(-[0-9]{2}(-[0-9]{2}(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?)?$)|(^not collected$)|(^not provided$)|(^restricted access$)

This is accessible programatically via: https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011

EGA requirements

I haven't been able to find any guidelines regarding whether the EGA will be affected by these changes. I have reached out to helpdesk for comment.

The EGA have responded:

Thank you for contacting the EGA helpdesk team. At this time, we do not plan to require these fields for metadata submission. We are actively working to improve the metadata on EGA and this review may happen in Q3 of the year. However, at this time it is difficult to estimate if this will be implemented.

Feb 04 '22 09:02 JamesGlover

Synopsis of progress

It seems prudent to wait for the additional communication promised before April 1st before making substantial changes
We currently collect all the required information but do not send it to the ENA/EGA
We lack validation on the fields, thus have issues with data-integrity
The current recommendations on the manifest are incompatible with the EBI requirements at best, and damaging at worst.

Plan of action

[x] Draft RFC to warehouse users proposing changes to date_of_sample_collection (conversion to DATE) and country_of_origin columns (Removal of invalid fields). May involve creation of 'legacy' columns for incompatible historic data.
[ ] Update manifest recommendations to match upcoming requirements
[ ] Communicate change in requirements with SSRs
[ ] Migrate existing columns to 'legacy_*' versions
[ ] Create new DATE column for date_of_sample_collection
[ ] Create new sample_metadata_countries table for country names add country_of_origin_record association to sample_metadata. sample_metadata_countries should include valid_for_submission column
[ ] Populate sample_metadata_countries with current EBI list (Ie. including fields that may be removed)
[ ] Add accession reports to better identify failures. Can be hooked up to studies or manifests.

Post requirements

[ ] Send data to the ENA
[ ] Depending on response, send data to EGA. If not required may be safer not to send. Determine with data governance.

Feb 04 '22 12:02 JamesGlover

Draft RFC

RFC: Proposed changes to multi-lims warehouse sample table

Feedback can be contributed via the github discussion [Link] or directly via email.

In order to improve the value of the data stored within the ENA, and to meet commitments of the Nagoya protocol [1], the EBI will be soon requiring spatio-temporal information for all submitted samples [2]. We currently anticipate that this will cover the 'country_of_origin' and 'date_of_sample_collection' fields as collected in Sequencesscape and presented in the multi-lims warehouse. Neither field is currently sent to the ENA or EGA.

As part of an initial investigation into supporting these requirements we've investigated the validation, persistence and data-integrity of the existing data. And as part of this we anticipate making some changes to the multi-lims warehouse. We hope that ultimately these will improve the quality of the persisted data however they will result in schema changes, and some differences in data.

country_of_origin

This is currently a free-text fields in Sequencescape, however the requirements in the EBI[3] indicate a controlled vocabulary. This list is based on the INSCD country list, although currently also support non-country meta-entities, such as 'not collected' and 'restricted access'.

A brief analysis of data integrity revealed that this field is currently mainly unpopulated. However it also contains several entires that will cease to be valid with the new restrictions. Examples include clearly invalid data such as numbers, non-country geographical regions, such as 'Africa' or 'Forrest of dean' and synonyms such as 'UK' or spelling errors. There are also a large number of cases of the field being used to store nationality, or ethnic background.

There are also a cases where it appears that the field has been repurposed to track other non-geographic information, such as containing RNA and IBS, neither of which appear to be valid three letter country codes.

In future we hope this column will match the controlled vocabulary used by the EBI. This change will obviously result in historical data changing, but should hopefully improve the quality of downstream reporting. In cases where it is not possible to unambiguously match data to a valid field, we we hope to consult with the original owners of the sample metadata to provide corrected values. However we expect that it will not be possible in all situations, and in these cases the field will be populated with NULL.

NULL will be used to represent any fields when country_of_origin has not been specified. We welcome any discussion on whether 'not provided', part of the current EBI controlled vocabulary, would be more appropriate.

date_of_collection

This is also currently a free text field in Sequencescape and the multi-lims warehouse. The EBI requirements[2] specify that in future they will require 'The collection date of the sample, recording at least the year of collection.' Currently this data is validated by a regular expression [3].

In future we hope to convert this column to a DATETIME field. We hope that this greatly simplifies any reporting using this field. We've opted for DATETIME over date as some of our existing data has non-midnight timestamps attached, and the EBI supports higher resolution timestamps.

Currently this column is largely unpopulated. However along with obviously invalid data (#N/A, 0) the column contains a range of dates an a variety of formats. Unfortunately is also appear that excel may have resulted in two data integrity issues.

We see several dates in the format '01-Aug', which initially appear to be ambiguous. However if a date is supplied in the MM/YY format the manifest suggests, then Excel converts 01/08 (January 2008) to 01/08/current_year which gets displayed as '01-Aug'. I have some concerns that dates in the format '02-Dec-19' may also be a side effect of this 'helpful' feature.

There is also reason to suspect that some years provided are invalid, as we have collection dates in the future. Given these often follow on consecutively, I suspect this is a side effect of Excel's auto-fill feature.

We hope to migrate all unambiguous dates to the data-time columns, and will work with data owners to try to update any dates which are ambiguous, or may have fallen foul of Excel's data-conversion. And dates that can't be unambiguously migrated, or which were absent, will have a value NULL.

legacy_data

We are keen to receive feedback on whether anyone feels the need to maintain legacy data, and are happy to work out the best ways to achieve this. Where possible it is likely we'll be able to migrate data to other columns (such as 'geographic_region') but we are willing to consider moving data to explicitly 'legacy' columns if absolutely necessary.

References [1] Nagoya Protocol https://www.cbd.int/abs/ [2] EBI notification https://www.ebi.ac.uk/about/news/press-releases/ena-new-metadata [3] EBI Default sample checklist: https://www.ebi.ac.uk/ena/browser/view/ERC000011 [4] INSCD country list https://www.insdc.org/country.html

Feb 04 '22 14:02 JamesGlover

INSDC Missing Value Reporting Terms

INSDC term (top level)	INSDC term (lower level)	Definition
not applicable		information is inappropriate to report, canindicate that the standard itself fails tomodel or represent the informationappropriately
missing	not collected	information of an expected format was notgiven because it has not been collected
not provided	information of an expected format was notgiven, a value may be given at the laterstage
restricted access	information exists but can not be releasedopenly because of privacy concerns

[](https://ena-docs.readthedocs.io/en/latest/submit/samples/missing-values.html#insdc-missing-value-reporting-terms)INSDC Missing Value Reporting Terms INSDC term (top level) INSDC term (lower level) Definition not applicable information is inappropriate to report, can indicate that the standard itself fails to model or represent the information appropriately missing not collected information of an expected format was not given because it has not been collected not provided information of an expected format was not given, a value may be given at the later stage restricted access information exists but can not be released openly because of privacy concerns

Feb 08 '22 20:02 JamesGlover

Now I've got the full list pulled down I've found 309 different values which cannot be mapped back to countries from the valid list. I've decided before touching any of the data, including the fairly safe corrections 'UK -> United Kingdom' I'd like to get some of the initial changes out.

I think I'd like to provide a tool to assist with some of the safer, simple corrections.

Feb 09 '22 13:02 JamesGlover

Having a bit of trouble handling dates:

Excel is a bit of a pain, and even setting a column to a date-type allows nonsense input
You can get by a bit from this with some validation, such as ensuring a date is < something in the future
But we have the difficulty that we want to support low-prevision dates, such as just a year, or a year and a month

And the latter causes issues when reaching Ruby, as the ruby date library doesn't allow non-existing dates. (MySQL does, with the right permissions attached)

I'm leaning towards 'YYYY-MM-DD', but probably as a text field still to allow arbitrary precision.

Feb 09 '22 16:02 JamesGlover

Checking with the EBI if they mind us redistributing the XML, as it would simplify the process and reduce our load on their systems.

Feb 22 '22 12:02 JamesGlover

Need to identify items for new user story and then can be closed and moved to Done.

Nov 17 '22 15:11 SujitDey2022

We now have more details from EBI/ENA here: https://www.insdc.org/news/insdc-spatiotemporal-metadata-minimum-standards-update-03-03-2023/

Mar 20 '23 10:03 TWJW-SANGER

List of tasks Divided in 2 stories:

First stage Strict solution to make it work only with right data, and all wrong historic data will have a default NULL value for these fields:

[x] Add all this part inside a feature flag
[x] Add country_of_origin and collection date to list of tags for ENA for sample (add in app/models/sample.rb a line include_tag(:country_of_origin) and same for collection date
[x] In app/models/accessionable/base.rb class Tag change the label name to use the field names: geographic location (country and/or sea) and collection_date when generating the XML.
[ ] In app/models/accessionable/base.rb class Tag add validation so we send null values for country of origin and collection date if it does not match the required regular expressions/list of values.
[ ] Update manifest recommendations to match upcoming requirements
[ ] Communicate change in requirements with SSRs

Second stage Curate all historic data:

[ ] Migrate existing columns to 'legacy_*' versions
[ ] Create new DATE column for date_of_sample_collection
[ ] Create new sample_metadata_countries table for country names add country_of_origin_record association to sample_metadata, that may have NULL values if the sample metadata doesnt have country.
[ ] Populate sample_metadata_countries with current EBI list (Ie. including fields that may be removed)
[ ] Test in the ENA dev testing environment and check with them that it is sending it right
[ ] Check if this is needed for EGA

Apr 03 '23 15:04 emrojo

Hi, Quick query. There is no business value in curating the historic data prior to this requirement, is the second stage above related to enforcing the strict requirement in the database? And if so, is enforcing the requirements at an application level good enough that we could drop the second stage? Many thanks, Tom

Apr 11 '23 13:04 TWJW-SANGER

As discussed with @SujitDey2022 , could there be an addition to this story, whereby the mandatory columns in manifests are highlighted in red so it is clear to the service user which columns are mandatory?

Apr 12 '23 14:04 LizCook-ec20

Post talk with Neil and Tom:

We'll send the following flag:

not provided	Information of an expected format was not given, a value may be given at the later stage	data agreement established pre-2023

for everything that is not after 15/May/2023 and does not match the regular expression for the field.

Apr 24 '23 08:04 emrojo

How to test the contents of the sample published:

curl -v -X GET <testing_server_url_and_path>/<accession_number> -u "<username>:<password>"

May 05 '23 14:05 emrojo

Some more documentation about the change happening in ENA:

https://ena-docs.readthedocs.io/en/latest/faq/spatiotemporal-metadata.html

May 18 '23 15:05 emrojo

sequencescape
sequencescape copied to clipboard

DPL-211 As a Compliance Manager (Catherine M) I would like the country of origin added to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol (C=M, V=4)

Validation and collection

Country of origin

Date of collection

Data Integrity

Country of origin

Date of collection

Exposure

Country of origin

Internal

External

Date of collection

Internal

Exceptions

ENA Requirements

Country of origin

Collection Date

EGA requirements

Synopsis of progress

Plan of action

Post requirements

Draft RFC

RFC: Proposed changes to multi-lims warehouse sample table

INSDC Missing Value Reporting Terms

sequencescape sequencescape copied to clipboard

DPL-211 As a Compliance Manager (Catherine M) I would like the country of origin added to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol (C=M, V=4)

Validation and collection

Country of origin

Date of collection

Data Integrity

Country of origin

Date of collection

Exposure

Country of origin

Internal

External

Date of collection

Internal

Exceptions

ENA Requirements

Country of origin

Collection Date

EGA requirements

Synopsis of progress

Plan of action

Post requirements

Draft RFC

RFC: Proposed changes to multi-lims warehouse sample table

INSDC Missing Value Reporting Terms

sequencescape
sequencescape copied to clipboard