sequencescape
sequencescape copied to clipboard
DPL-211 As a Compliance Manager (Catherine M) I would like the country of origin added to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol (C=M, V=4)
User story As a Compliance Manager (Catherine M) I would like the country of origin [and date of collection] to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol
UPDATE Initial batch of work has standardised data collection in the manifests, and improved the interface in the rarely used front-end. We do not currently have any additional validation, or send the information to the EBI. These steps should follow
Who are the primary contacts for this story Catherine M Tom W
Acceptance criteria To be considered successful the solution must allow:
- [x] Align country of origin options for samples to the EBI / ENA interface
- [x] Establish if the EGA database requires the same treatment
- [x] Establish if there is a fixed set of country names/abbreviations that should be employed
- [x] Provide a mechanism (manifest?) for users to enter the country of origin per sample
- This already exists, but the interface will change once we add the controlled vocabulary.
- [x] Provide a mechanism for users to correct the country of origin per sample
- This already exists, but the interface will change once we add the controlled vocabulary.
- [ ] ~Validate that samples selected for accessioning have the required metadata in the correct format~
- [ ] Validate country is supported before sending to EBI
- [ ] ~Samples that fail validation should have a clear error message detailing the problem and steps to resolve~
- [ ] Add accessioning report to indicate why samples aren't accessioning with clear description of failure
- [ ] Update the Accessioning documentation on confluence to include country of origin
- [ ] Inform stakeholders and Compliance Managers of the change and anticipated date of enforcement
Dependencies
References See Confluence for details of Accessioning in SequenceScape
Additional context Over the coming year the ENA database will enforce country of origin for all samples submitted This is to comply with the Nagoya Protocol which is an agreement aiming to share the benefits arising from the use of genetic resources in a fair and equitable way
ENA programatic submission documentation Current default sample checklist
Validation and collection
Country of origin
- We already have country_of_origin in the sample metadata. It is a
varchar(255)
field. - There is ~no~ basic rails validation on country_of_origin which ensures that it can be encoded as utf8mb3. Essentially this prevents the use of emoji and rare kanji characters.
- The field is present in most, but not all manifests.
- Present plate_default, plate_full, plate_library, plate_chromium_library, tube_default, tube_full, tube_multiplexed_library, tube_library_with_tag_sequences, tube_chromium_library, tube_multiplexed_library_with_tag_sequences, tube_multiplexed_chromium_library, tube_rack_default, heron
- Absent plate_rnachip, tube_rnachip, saphyr, long_read, cardinal
- The manifests restrict length of country of origin to 30 characters
- The field is highlighted in blue, indicating it should be filled in, but this is pretty much just advisory
- The manifest contains the instruction:
Please add a description up to a maximum of 30 uppercase letters.
However noting enforces this uppercase restriction
Date of collection
- We already have date_of_sample_collection in the sample metadata, it is a
varchar(255)
rather than a DATE or DATETIME field. There is a# TODO[xxx]: Date field?
in the code next to the definition - There is no rails validation on the field, not even the basic restriction to utf8mb3
- The field is present in most, but not all manifests.
- Present plate_default, plate_full, plate_library, plate_chromium_library, tube_default, tube_full, tube_multiplexed_library, tube_library_with_tag_sequences, tube_chromium_library, tube_multiplexed_library_with_tag_sequences, tube_multiplexed_chromium_library, tube_rack_default, heron
- Absent plate_rnachip, tube_rnachip, saphyr, long_read, cardinal
- The manifest restricts the field to 5 characters
- The field is highlighted in blue, indicating it should be filled in, but this is pretty much just advisory
- The manifest contains the heading
DATE OF SAMPLE COLLECTION (MM/YY or YYYY only)
with the instructionPlease Enter either a Month and Year or a complete Year e.g. 04/05 or 2004.
and validation errorThis must be either a combination of month and year, or a whole year, with no spaces.
Nagoya protocol: https://www.cbd.int/abs/doc/protocol/nagoya-protocol-en.pdf
Data Integrity
Country of origin
6563426 samples have no (NULL) data in the field
There are 471 different values, while a lot of these appear to be valid countries we also have:
- Nationalities and ethnicities
- A lot of these appear to be populated directly from forms. (eg. Any other Asian background)
- Non-country geographical regions (Ie. Cities, Continents, Counties)
- Invalid data (eg. 0, 50)
- Corrupt data (eg. ?sterreich)
- Abuses of the field (eg. RNA, Blank DNA neg control)
- Various flavours of blank / Not applicable
- Probably invalid but descriptive fields: (eg. lab_strain)
- A few spelling errors
- Lots of different flavour of some countries (UK, United Kingdom)
Very few of them are in block capitals.
I didn't see any fields that may be inadvertently exposing personal data, however it is possible that some of the geographic regions may be small enough that combined with other data an individual would be identifiable. The chances of this occurring increases with the invalid smaller regions, but could potentially be true for countries.
Date of collection
6856299 samples habe no (NULL) data in the field
-
22661 samples have 0 in the field
-
2 are blank
-
20 are 00/01
-
9 are ?
-
1 is #NA
-
1 is ?2010
-
Some incomplete or ambiguous dates like 01-Aug. It is likely that this is a result of Excel auto-typing '01/08' which it 'helpfully' converts to '01-Aug' when in actuality, this was probably intended to refer to Jan 2008 given the requirements in the manifest. I'm a bit concerned that the same may apply to dates in the format 01-Feb-21 as well.
-
Invalid dates like: 0117-11-10
-
I'm a little suspicious of some of the date's I'm spotting. We have 1888 and 1891, and then regular dates from 1900 onwards. However given some of these appear more than once I assume they may represent historical samples. However we also have at lots of samples sample for 2023-44359 which definitely indicates issues with using excel auto-fill. (especially as many of these follow ranges)
-
Very broad dates like 2013
The remainder are mostly what appear to be legitimate dates, but most not in the format suggested in the manifest. In many ways this is a good thing, as it means tightening up our collection and storage of this information will improve reportability, rather than just breaking existing reports. A few actually have timestamps, and while most of these are midnight, a few are more precise
Exposure
Country of origin
Internal
- Country of origin is exposed in the warehouses, and via the V1 API
- Country of origin is not currently exposed by the V2 api
- Country of origin is exposed on sample pages in Sequencescape
- Country of origin is exposed in the study reports
External
- This data is not currently sent to either the ENA or EGA
Date of collection
Internal
- Date of collection is exposed in the ML warehouse
varchar(255)
, and via the V1 API (raw string) - Date of collection is not currently exposed by the V2 api
- Date of collection is exposed on sample pages in Sequencescape
Exceptions
The EBI provide scope for exceptions https://www.ebi.ac.uk/about/news/technology-and-innovation/ena-new-metadata
Although the spatio-temporal information will become mandatory in most cases, some exceptions will be allowed when it is deemed necessary and the exception indicated to users.
ENA Requirements
The current default requirements are available here: https://www.ebi.ac.uk/ena/browser/view/ERC000011 No fields, including geographic data are currently flagged as required on the base checklist.
Country of origin
From the current list the most applicable fields would appear to be geographic location (country and/or sea)
which has the following help text:
The geographical origin of the sample as defined by the country or sea. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html).
However currently the options dropdown also contains some non-country options:
- not applicable
- not provided
- not collected
- restricted access
This list is accessible programatically via: https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011
The INSDC country list was last updated 'October 31, 2014' so it doesn't seem to be particularly volatile.
Some of the other checklists have stricter requirements. For example the Tree of life checklist is stricter. https://www.ebi.ac.uk/ena/browser/view/ERC000053 and the filed is already flagged as required. (The options list appears to be the same though, not sure if this is true for ALL lists)
Note, the sample XML schema definition doesn't validate individual attributes.
Collection Date
The linked document also mentions a requirement for 'collection date'
This maps to the field: collection_date
which has the following help-text:
date the specimen was collected
Validated by a regex:
(^[12][0-9]{3}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01])(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?(/[0-9]{4}(-[0-9]{2}(-[0-9]{2}(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?)?$)|(^not collected$)|(^not provided$)|(^restricted access$)
This is accessible programatically via: https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011
EGA requirements
I haven't been able to find any guidelines regarding whether the EGA will be affected by these changes. I have reached out to helpdesk for comment.
The EGA have responded:
Thank you for contacting the EGA helpdesk team. At this time, we do not plan to require these fields for metadata submission. We are actively working to improve the metadata on EGA and this review may happen in Q3 of the year. However, at this time it is difficult to estimate if this will be implemented.
Synopsis of progress
- It seems prudent to wait for the additional communication promised before April 1st before making substantial changes
- We currently collect all the required information but do not send it to the ENA/EGA
- We lack validation on the fields, thus have issues with data-integrity
- The current recommendations on the manifest are incompatible with the EBI requirements at best, and damaging at worst.
Plan of action
- [x] Draft RFC to warehouse users proposing changes to date_of_sample_collection (conversion to DATE) and country_of_origin columns (Removal of invalid fields). May involve creation of 'legacy' columns for incompatible historic data.
- [ ] Update manifest recommendations to match upcoming requirements
- [ ] Communicate change in requirements with SSRs
- [ ] Migrate existing columns to 'legacy_*' versions
- [ ] Create new DATE column for date_of_sample_collection
- [ ] Create new
sample_metadata_countries
table for country names addcountry_of_origin_record
association tosample_metadata
. sample_metadata_countries should includevalid_for_submission
column - [ ] Populate
sample_metadata_countries
with current EBI list (Ie. including fields that may be removed) - [ ] Add accession reports to better identify failures. Can be hooked up to studies or manifests.
Post requirements
- [ ] Send data to the ENA
- [ ] Depending on response, send data to EGA. If not required may be safer not to send. Determine with data governance.
Draft RFC
RFC: Proposed changes to multi-lims warehouse sample table
Feedback can be contributed via the github discussion [Link] or directly via email.
In order to improve the value of the data stored within the ENA, and to meet commitments of the Nagoya protocol [1], the EBI will be soon requiring spatio-temporal information for all submitted samples [2]. We currently anticipate that this will cover the 'country_of_origin' and 'date_of_sample_collection' fields as collected in Sequencesscape and presented in the multi-lims warehouse. Neither field is currently sent to the ENA or EGA.
As part of an initial investigation into supporting these requirements we've investigated the validation, persistence and data-integrity of the existing data. And as part of this we anticipate making some changes to the multi-lims warehouse. We hope that ultimately these will improve the quality of the persisted data however they will result in schema changes, and some differences in data.
country_of_origin
This is currently a free-text fields in Sequencescape, however the requirements in the EBI[3] indicate a controlled vocabulary. This list is based on the INSCD country list, although currently also support non-country meta-entities, such as 'not collected' and 'restricted access'.
A brief analysis of data integrity revealed that this field is currently mainly unpopulated. However it also contains several entires that will cease to be valid with the new restrictions. Examples include clearly invalid data such as numbers, non-country geographical regions, such as 'Africa' or 'Forrest of dean' and synonyms such as 'UK' or spelling errors. There are also a large number of cases of the field being used to store nationality, or ethnic background.
There are also a cases where it appears that the field has been repurposed to track other non-geographic information, such as containing RNA and IBS, neither of which appear to be valid three letter country codes.
In future we hope this column will match the controlled vocabulary used by the EBI. This change will obviously result in historical data changing, but should hopefully improve the quality of downstream reporting. In cases where it is not possible to unambiguously match data to a valid field, we we hope to consult with the original owners of the sample metadata to provide corrected values. However we expect that it will not be possible in all situations, and in these cases the field will be populated with NULL
.
NULL will be used to represent any fields when country_of_origin has not been specified. We welcome any discussion on whether 'not provided', part of the current EBI controlled vocabulary, would be more appropriate.
date_of_collection
This is also currently a free text field in Sequencescape and the multi-lims warehouse. The EBI requirements[2] specify that in future they will require 'The collection date of the sample, recording at least the year of collection.' Currently this data is validated by a regular expression [3].
In future we hope to convert this column to a DATETIME field. We hope that this greatly simplifies any reporting using this field. We've opted for DATETIME over date as some of our existing data has non-midnight timestamps attached, and the EBI supports higher resolution timestamps.
Currently this column is largely unpopulated. However along with obviously invalid data (#N/A, 0) the column contains a range of dates an a variety of formats. Unfortunately is also appear that excel may have resulted in two data integrity issues.
We see several dates in the format '01-Aug', which initially appear to be ambiguous. However if a date is supplied in the MM/YY format the manifest suggests, then Excel converts 01/08 (January 2008) to 01/08/current_year which gets displayed as '01-Aug'. I have some concerns that dates in the format '02-Dec-19' may also be a side effect of this 'helpful' feature.
There is also reason to suspect that some years provided are invalid, as we have collection dates in the future. Given these often follow on consecutively, I suspect this is a side effect of Excel's auto-fill feature.
We hope to migrate all unambiguous dates to the data-time columns, and will work with data owners to try to update any dates which are ambiguous, or may have fallen foul of Excel's data-conversion. And dates that can't be unambiguously migrated, or which were absent, will have a value NULL
.
legacy_data
We are keen to receive feedback on whether anyone feels the need to maintain legacy data, and are happy to work out the best ways to achieve this. Where possible it is likely we'll be able to migrate data to other columns (such as 'geographic_region') but we are willing to consider moving data to explicitly 'legacy' columns if absolutely necessary.
References [1] Nagoya Protocol https://www.cbd.int/abs/ [2] EBI notification https://www.ebi.ac.uk/about/news/press-releases/ena-new-metadata [3] EBI Default sample checklist: https://www.ebi.ac.uk/ena/browser/view/ERC000011 [4] INSCD country list https://www.insdc.org/country.html
INSDC Missing Value Reporting Terms
INSDC term (top level) | INSDC term (lower level) | Definition |
---|---|---|
not applicable | information is inappropriate to report, canindicate that the standard itself fails tomodel or represent the informationappropriately | |
missing | not collected | information of an expected format was notgiven because it has not been collected |
not provided | information of an expected format was notgiven, a value may be given at the laterstage | |
restricted access | information exists but can not be releasedopenly because of privacy concerns |
[](https://ena-docs.readthedocs.io/en/latest/submit/samples/missing-values.html#insdc-missing-value-reporting-terms)INSDC Missing Value Reporting Terms INSDC term (top level) INSDC term (lower level) Definition not applicable information is inappropriate to report, can indicate that the standard itself fails to model or represent the information appropriately missing not collected information of an expected format was not given because it has not been collected not provided information of an expected format was not given, a value may be given at the later stage restricted access information exists but can not be released openly because of privacy concerns
Now I've got the full list pulled down I've found 309 different values which cannot be mapped back to countries from the valid list. I've decided before touching any of the data, including the fairly safe corrections 'UK -> United Kingdom' I'd like to get some of the initial changes out.
I think I'd like to provide a tool to assist with some of the safer, simple corrections.
Having a bit of trouble handling dates:
- Excel is a bit of a pain, and even setting a column to a date-type allows nonsense input
- You can get by a bit from this with some validation, such as ensuring a date is < something in the future
- But we have the difficulty that we want to support low-prevision dates, such as just a year, or a year and a month
And the latter causes issues when reaching Ruby, as the ruby date library doesn't allow non-existing dates. (MySQL does, with the right permissions attached)
I'm leaning towards 'YYYY-MM-DD', but probably as a text field still to allow arbitrary precision.
Checking with the EBI if they mind us redistributing the XML, as it would simplify the process and reduce our load on their systems.
Need to identify items for new user story and then can be closed and moved to Done.
We now have more details from EBI/ENA here: https://www.insdc.org/news/insdc-spatiotemporal-metadata-minimum-standards-update-03-03-2023/
List of tasks Divided in 2 stories:
First stage Strict solution to make it work only with right data, and all wrong historic data will have a default NULL value for these fields:
- [x] Add all this part inside a feature flag
- [x] Add country_of_origin and collection date to list of tags for ENA for sample (add in app/models/sample.rb a line
include_tag(:country_of_origin)
and same for collection date - [x] In app/models/accessionable/base.rb class Tag change the label name to use the field names:
geographic location (country and/or sea)
andcollection_date
when generating the XML. - [ ] In app/models/accessionable/base.rb class Tag add validation so we send null values for country of origin and collection date if it does not match the required regular expressions/list of values.
- [ ] Update manifest recommendations to match upcoming requirements
- [ ] Communicate change in requirements with SSRs
Second stage Curate all historic data:
- [ ] Migrate existing columns to 'legacy_*' versions
- [ ] Create new DATE column for date_of_sample_collection
- [ ] Create new sample_metadata_countries table for country names add country_of_origin_record association to sample_metadata, that may have NULL values if the sample metadata doesnt have country.
- [ ] Populate sample_metadata_countries with current EBI list (Ie. including fields that may be removed)
- [ ] Test in the ENA dev testing environment and check with them that it is sending it right
- [ ] Check if this is needed for EGA
Hi, Quick query. There is no business value in curating the historic data prior to this requirement, is the second stage above related to enforcing the strict requirement in the database? And if so, is enforcing the requirements at an application level good enough that we could drop the second stage? Many thanks, Tom
As discussed with @SujitDey2022 , could there be an addition to this story, whereby the mandatory columns in manifests are highlighted in red so it is clear to the service user which columns are mandatory?
Post talk with Neil and Tom:
We'll send the following flag:
not provided | Information of an expected format was not given, a value may be given at the later stage | data agreement established pre-2023 |
---|
for everything that is not after 15/May/2023 and does not match the regular expression for the field.
How to test the contents of the sample published:
curl -v -X GET <testing_server_url_and_path>/<accession_number> -u "<username>:<password>"
Some more documentation about the change happening in ENA:
https://ena-docs.readthedocs.io/en/latest/faq/spatiotemporal-metadata.html