odis-arch icon indicating copy to clipboard operation
odis-arch copied to clipboard

connect ODISCat as the main ODIS config

Open jmckenna opened this issue 1 year ago • 5 comments

  • from discussions, also to clarify that ODISCat purpose is to list your organization's Products or Services

cc @arnounesco @pbuttigieg @fils

jmckenna avatar Aug 17 '23 14:08 jmckenna

ODISCat purpose is to list your organization's Products or Services

it's to list data sources - what those data sources describe is secondary. One organisation / individual in OceanExpert can be linked to one or more ODISCat data source entries.

These can be APIs, web services, portals, or any other mechanism through which (meta)data can be acquired

pbuttigieg avatar Oct 19 '23 14:10 pbuttigieg

For reference, the new ODISCat pattern template (that we had together drafted on 2023-08-17) is here: odisCatOrganization-example.json (link updated on 2024-04-26 to new repo)

jmckenna avatar Oct 19 '23 14:10 jmckenna

see #308

arnounesco avatar Oct 19 '23 15:10 arnounesco

should not be closed, #308 is about the format of the pattern, not about ODISCat providing it

arnounesco avatar Oct 19 '23 15:10 arnounesco

setting urgency label here (@pbuttigieg please adjust as necessary - I created 3 new labels for urgency)

jmckenna avatar Apr 26 '24 12:04 jmckenna

reporting an internal ODISCat issue here as well, as it applies to this ticket:

@arnounesco some important questions/points for the ODISCat-ODIS connection:

  • is the ODISCat sitemap (and the JSON-LD) accessible on Production? (here was the staging sitemap)
    • for example, ODISCat id #40 does not contain any JSON-LD (see validator), but it does on staging
  • for the ODISCat-ODIS connection, in the JSON-LD for ODISCat, we will use the url value for itemOffered, (see template), which will point to the ODIS-Arch URL value from the ODISCat entry
  • we need a way for an ODISCat administrator to mark/flag an ODISCat entry as ready-for-harvest-into-ODIS
    • previously this was done by manually maintaining a YAML file in the ODIS repo, but instead this will be done through ODISCat itself
    • similarly, we also need a way for that administrator to mark the ODISCat entry as disable-ODIS-harvest, as over time a partner's endpoint could become unmaintained, therefore affecting the ODIS graph/searches (to discuss in tomorrow's WP2 meeting)

related to https://github.com/iodepo/ODISCat/issues/103

cc @pbuttigieg

jmckenna avatar Oct 08 '24 15:10 jmckenna

@arnounesco I've updated the ODISCat JSON-LD template with @pbuttigieg's changes (to use @type CreativeWork)

jmckenna avatar Oct 09 '24 14:10 jmckenna

@arnounesco This looks good.. just one small issue...

  {
            "@context": {
                "@vocab": "https://schema.org/"
            },
            "@id": "https://catalogue.odis.org/view/256
        ",
            "@type": "Organization",
                        "email": "info@ico",

There is a control character \n at the end of the @id value. Would not be an issue in the object literals, but in the subject IRI it's not a valid character.

I can parse such things out of course client side, but better to have it valid server side.

Note that google validator (https://validator.schema.org/#url=http%3A%2F%2Fcatalogue.odis.org%2Fview%2F256) fixes such things. Sometimes I kinda wish they wouldn't. Or at least have a "strict" mode.

If you can use a trim function on the strings or something like that, it is likely a simple fix.

Thanks Doug

fils avatar Oct 23 '24 12:10 fils

@fils I cannot reproduce this, how did you get that content? Tried to view the code or to download, nowhere there is a newline. Also in the code there is nowhere a newline to be seen. This all does not mean you are wrong, but I cannot check what would be the result of any action I take.

arnounesco avatar Oct 23 '24 15:10 arnounesco

I also cannot reproduce. (I use the command :set list inside vi on Ubuntu, to show hidden characters for the test entry)

  curl -OL https://catalogue.odis.org/view/256
  vi 256
    :set list

gives:

{$
            "@context": {$
                "@vocab": "https://schema.org/"$
            },$
            "@id": "https://catalogue.odis.org/view/256",$
            "@type": "Organization",$
                        "email": "info@xxxx",$

jmckenna avatar Oct 23 '24 16:10 jmckenna

Interesting.. I see what you are both seeing too.

Let me check if the python library is messing something up. There might be a processing setting I need to play with.

fils avatar Oct 23 '24 20:10 fils

so tried with with extrunct rather than BeautifulSoup and I still see it.

I get

{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@id": "https://catalogue.odis.org/view/263\n        ",
  "@type": "Organization",
  "email": "[email protected]",

So now you see the \n with spaces or a tab after it..

I'm trying to resolve why I see this in python, with two different libraries, but you don't see it in vi.

fils avatar Oct 23 '24 21:10 fils

@jmckenna @arnounesco

really odd, if I look in the "source view" of the browser it looks fine.. but no matter how I pull it down with python, I get

{'@context': {'@vocab': 'https://schema.org/'}, '@id': 'https://catalogue.odis.org/view/257\n        ', '@type': 'Organization', 'email': '[email protected]', 'contactPoint': 

with the \n in the @id string.

Still trying to explain this.

fils avatar Oct 23 '24 21:10 fils

OK, I think I found it. The python package "response" seems to be the issue, I replaced it with httpx and that seems to be working now. Very odd, but no interest in resolving the issue with that package, will simply use httpx.

Thanks!

FYI, I also indexed with Gleaner, which did work but did find 1 error in the record at https://catalogue.odis.org/view/1105 Which is confirmed at: https://validator.schema.org/#url=https%3A%2F%2Fcatalogue.odis.org%2Fview%2F1105

Gleaner reports

 Error in unmarshaling json: invalid character ' ' in string escape code"

fils avatar Oct 23 '24 22:10 fils

So, I went ahead and actived the github action for the configuration builder for ODIS Cat.
After a for typos in the requirements.txt it seems to be working but there is an odd regression in YAML output. Need to check the version of python and the libraries installed in the action VM.

There also seems to be an odd error condition when the generated config file doesn't have any changes from the previous version. Reviewing this.

In the end, there are some items we use in the config file that are not currently in the ODIS Catalog properties.

Will build a list of thess for this issue.

fils avatar Nov 04 '24 03:11 fils

The yaml issue is resolved, code generates now.

Some observations:

  • The config generated from the code will work, but lacks some of the properties we maintain currently
  • There are 88 items pulled from the ODISCat
  • Of those 88, 40 lack a URL so only 48 items
  • Of those 48, two are duplicated, so only 46 unique sitemap/sitegraph URLs
  • The previous production config file has 53 items, so seem to be missing 7 in the ODISCat that are in the latest prod file

Note:

  • I can put in a check to address the URL issue with a regex to check for a valid URL.
  • To ensure the names are unique, I prefix them with the ODIScat record ID, but we can change this
  • I push the null, properties in, can prune later. Some of these may or may not be in the graph, need to check on that.
  • Assuming the removal of the invalid URLs, this config file will (does work) but using the "current prod" below until all the entries are in sync. However, I can switch at any time with basic just a file reference swap.

Refs:

  • current prod: https://github.com/iodepo/odis-arch/blob/master/collection/config/production-sources.yaml

  • latest generated config: https://github.com/iodepo/odis-arch/blob/schema-dev-df/workflows/actions/odiscat/gleanerconfig.yaml

fils avatar Nov 05 '24 17:11 fils