schemaorg icon indicating copy to clipboard operation
schemaorg copied to clipboard

Create repository of extractors

Open vsoch opened this issue 7 years ago • 6 comments

Right now these are living in the dockerfiles repo (a full example) but we should also provide simple examples in a separate repo, with the goal of being able plug easily into other tools (e.g., datalad @yarikoptic

These extractors (in progress!) will be here: https://github.com/openschemas/extractors

@yarikoptic I'm done with the schemaorg python tooling, and I'm waiting to hear from the library about use cases to do the first implementations with datalad. I'll also have "ImageDefinition" examples finished soon, just waiting on a few PRs into container-diff to get all the metadata that I want. There will be a full "dockerfiles" example with embedded metadata for schemaorg also soon (it's parsing now).

The general goal will be that if there is a datalad user with some dataset thing that fits a schema.org definition, they can grab one of these extractors to use with datalad (and schemaorg) to generate the metadata (web view) for their dataset.

Another question for you - do you have any datasets / community needs that would do well with a Python extractor with datalad? Since these are ready to go and I'm really wanting to get started working (and I'm not sure how long the library would take) it might be faster to find another use case too.

vsoch avatar Nov 13 '18 19:11 vsoch

oh, you were busy indeed, weren't you @vsoch ?

any datasets / community needs that would do well with a Python extractor with datalad?

I am a bit not sure what you are asking for - all extractors we have are written in Python as well... here you seems to concentrate on container/image definitions - so if you are asking about those, then we do not have many of them in datalad land yet. Within our niceman project we are trying to achieve similar extraction though, while concentrating though on information sufficient to identify the entire component (package, container image, etc) so we would have clear versioning semantic (where available) and origin information, so later on the same environment could be reconstructed, or multiple be compared (similarly to container-diff)

yarikoptic avatar Nov 15 '18 18:11 yarikoptic

It doesn't have to be containers, my aim is to develop the integration with datalad so I'm good with whatever :) I am using Dockerfiles (containers) just because I spent a day last year creating a little database of over 100K so it's good to test things with. An extracter in how I'm doing it would likely use datalad with a schemaorg extraction so the metadata also plugs nicely into search.

vsoch avatar Nov 15 '18 19:11 vsoch

Here is the little writeup for the dockerfiles example and extractors, although I haven't finished up doing the ImageDefinition (new schemaorg definition that will get metadata via container-diff) yet. https://vsoch.github.io/2018/datasets/

vsoch avatar Nov 15 '18 19:11 vsoch

Have you looked at the extractors we already have in DataLad? e.g.

  • generic ones (such as for EXIF in images, XMP in PDFs etc) https://github.com/datalad/datalad/tree/master/datalad/metadata/extractors
  • neuroimaging specific ones: https://github.com/datalad/datalad-neuroimaging/tree/master/datalad_neuroimaging/extractors What we aren't doing in those ATM at all is harmonization - we stick to whatever that underlying metadata standard is (e.g. XMP) and just collate/manage all the extracted metadata so it becomes available for search etc. In our previous version (0.9) we did such harmonization for basic terms (description etc) but then decided to take a step back and concentrate on aggregation, while eventually picking up harmonization efforts of others probably implemented as "meta-extractor" - extractor from extracted metadata. May be that is where your effort could help? ATM through the datasets available from http://datasets.datalad.org there are over a thousand terms from different metadata extractors:
$> datalad_ search --show-keys full | nl
     1	annex.MRI
     2	 in  1 datasets
     3	 has 1 unique values: u'yes'
     4	annex.age
     5	 in  1 datasets
     6	 has 1 unique values: 'unhashable 1688 out of 1690 entries'
     7	annex.dcterms_format
     8	 in  1 datasets
     9	 has 1 unique values: u'image/nifti'
    10	annex.diagnosis
...
  3712	xmp.xmpTPg-PlateNames
  3713	 in  1 datasets
  3714	 has 1 unique values: 'unhashable 0 out of 1 entries'
  3715	xmp.xmpTPg-SwatchGroups<xmpG-groupName>
  3716	 in  1 datasets
  3717	 has 1 unique values: 'unhashable 0 out of 1 entries'
  3718	xmp.xmpTPg-SwatchGroups<xmpG-groupType>
  3719	 in  1 datasets
  3720	 has 1 unique values: 'unhashable 0 out of 1 entries'

Harmonization at least at the level of a dataset description is also needed in our case for our rudimentary datasets browser (again on the same http://datasets.datalad.org): https://github.com/datalad/datalad/issues/2403 and for our datasets to get finally indexed by google datasets (https://github.com/datalad/datalad/issues/2793).

On our end, we could within http://datasets.datalad.org at least

  • include your datasets into our distribution and thus make them searchable etc
  • adopt your setup for visualizing metadata
  • looking forward we could probably adopt schemaorg extractor as the one to provide that basic metadata (description) for our webview, with or without any other harmonization effort.

yarikoptic avatar Nov 15 '18 23:11 yarikoptic

@nsheff you might want to take a look at Datalad for another way to have (some) metadata be parsed automatically. I shamefully have not worked on it yet because I don't have many (real use case) datasets to manage.

vsoch avatar Apr 25 '19 14:04 vsoch

Just want to add another note here - if anyone has a dataset that would conform to Google Datasets (or schema.org) and wants a Datalad extractor, I'm looking for this use case to better develop, and I can offer to help out.

vsoch avatar Nov 11 '19 15:11 vsoch