CoordinateCleaner
CoordinateCleaner copied to clipboard
new metagenomics (MGnify) filter proposition and ideas
Background
GBIF has recently begun publishing records from a metagenomics publisher MGnify. https://www.gbif.org/publisher/ab733144-7043-4e88-bd4f-fca7bf858880
Typically these records can be bacteria or other microbes. Often however these records can be trace DNA of some plant, animal, insect or something else.
https://www.gbif.org/occurrence/taxonomy?publishing_org=ab733144-7043-4e88-bd4f-fca7bf858880
Problems
- Right now I think it is difficult for the average user to judge whether a certain taxonomic hit from a metagenomics study should count as an accurate occurrence.
- Metageonomics studies can also produce 1000s of occurrences of a single taxon at one location, and this might cause problems for naive users.
- Probably the fitness-for-use of metagenomics data for different purposes needs to be studied...
Solutions
- Simplest solution would be something like
cc_metagenome()
that simply filters out datasets published by MGnify or other metagenomics publishers. - Other solutions might try to use the fields
Organism quantity
andSample size value
to judge the quality of resulting taxon label example but this solution probably would need expert input.
New blog post on the gbif data blog outlines some of the problems with this type of data:
https://data-blog.gbif.org/post/gbif-molecular-data-quality/
Thanks for the suggestions. Yes, these genomic data can be problematic. I am not sure if we should add a separate function for this, since the meta-data are probably the best way to address this problem. For instance the "IndividualCount" information provided with GBIF data can be very helpful! Are youa ware of a list of all providers in gbif that provide metagenomics data?
This issue is discussed more here: https://discourse.gbif.org/t/metagenomics-and-metacrap/1583/13
This issue has somewhat been solved on the GBIF-side, but "the problem" will likely continue to get worse.