CoordinateCleaner icon indicating copy to clipboard operation
CoordinateCleaner copied to clipboard

new metagenomics (MGnify) filter proposition and ideas

Open jhnwllr opened this issue 5 years ago • 3 comments

Background

GBIF has recently begun publishing records from a metagenomics publisher MGnify. https://www.gbif.org/publisher/ab733144-7043-4e88-bd4f-fca7bf858880

Typically these records can be bacteria or other microbes. Often however these records can be trace DNA of some plant, animal, insect or something else.

https://www.gbif.org/occurrence/taxonomy?publishing_org=ab733144-7043-4e88-bd4f-fca7bf858880

Problems

  • Right now I think it is difficult for the average user to judge whether a certain taxonomic hit from a metagenomics study should count as an accurate occurrence.
  • Metageonomics studies can also produce 1000s of occurrences of a single taxon at one location, and this might cause problems for naive users.
  • Probably the fitness-for-use of metagenomics data for different purposes needs to be studied...

Solutions

  • Simplest solution would be something like cc_metagenome() that simply filters out datasets published by MGnify or other metagenomics publishers.
  • Other solutions might try to use the fields Organism quantity and Sample size value to judge the quality of resulting taxon label example but this solution probably would need expert input.

jhnwllr avatar Mar 28 '19 12:03 jhnwllr

New blog post on the gbif data blog outlines some of the problems with this type of data:

https://data-blog.gbif.org/post/gbif-molecular-data-quality/

jhnwllr avatar Apr 29 '19 08:04 jhnwllr

Thanks for the suggestions. Yes, these genomic data can be problematic. I am not sure if we should add a separate function for this, since the meta-data are probably the best way to address this problem. For instance the "IndividualCount" information provided with GBIF data can be very helpful! Are youa ware of a list of all providers in gbif that provide metagenomics data?

azizka avatar May 04 '20 20:05 azizka

This issue is discussed more here: https://discourse.gbif.org/t/metagenomics-and-metacrap/1583/13

This issue has somewhat been solved on the GBIF-side, but "the problem" will likely continue to get worse.

jhnwllr avatar May 05 '20 09:05 jhnwllr