academic-observatory-workflows icon indicating copy to clipboard operation
academic-observatory-workflows copied to clipboard

Inf 595/update oa schemas

Open kathrynnapier opened this issue 1 year ago • 5 comments

Schema descriptions updated for the aggregate and doi json files.

New files created as new fields have been added to both schemas.

Aggregate schema based off the current author table, as it had the most fields (please let me know if this should be changed!).

Formatting improvements to come in future updates. If the descriptions for the most part make sense, it would be good to get them into the bigquery tables sooner rather than later and we can continue to improve over time.

Schema's have been uploaded to coki-scratch-space.Kathryn.test_agg_schema and coki-scratch-space.Kathryn.test_doi_schema as a test and for ease of viewing.

kathrynnapier avatar Apr 04 '23 14:04 kathrynnapier

And- my sincere apologies for the mess I created when creating the branch off a VERY old version of develop!

kathrynnapier avatar Apr 04 '23 14:04 kathrynnapier

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (ffa9d4d) 95.18% compared to head (0059e67) 95.22%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #164      +/-   ##
===========================================
+ Coverage    95.18%   95.22%   +0.04%     
===========================================
  Files           20       20              
  Lines         5209     5238      +29     
  Branches       720      727       +7     
===========================================
+ Hits          4958     4988      +30     
  Misses         161      161              
+ Partials        90       89       -1     
Files Coverage Δ
...ic_observatory_workflows/workflows/doi_workflow.py 94.35% <100.00%> (+0.60%) :arrow_up:

... and 1 file with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 04 '23 15:04 codecov[bot]

@jdddog do the new schemas necessarily need to have updated dates?

keegansmith21 avatar Apr 05 '23 02:04 keegansmith21

@jdddog do the new schemas necessarily need to have updated dates?

Yeah the schemas don't need dates as dated schemas are only used when backfilling older versions of a dataset.

jdddog avatar Sep 06 '23 02:09 jdddog

As requested, I have added a function to create the DOI schema based off definitions instead of pulling it from the doi_.json file. The paths to the schemas are now attached to the SQLQuery object instead of being passed down with the kwargs for each parallel task.

For tables such as Unpaywall, Pumbed and OpenAlex, all of the fields from their respective source tables are brought into the DOI table. Although, for the Crossref Events (events), open_citations, coki and the affiliation part of the DOI table have been separated out into their own schemas and placed in the "intermediate" folder as they all contain calculated fields produced in either the "intermediate_

" task or "create_doi" stage of the workflow.

The definition of the Crossref metadata part of the schema is messy in comparison as it uses a combination of a few original fields and calculates new ones when creating the intermediate table.

The schemas for the aggregate tables still needed to addressed. I will separate them out into their own schema soon.

alexmassen-hane avatar Sep 27 '23 08:09 alexmassen-hane