capstone icon indicating copy to clipboard operation
capstone copied to clipboard

Use XML values for volume metadata in API

Open jcushman opened this issue 7 years ago • 1 comments

For example, https://api.case.law/v1/volumes/NOTALEPH001226/ shows start_year 0, when the xml_start_year value in the database is 1980.

jcushman avatar Nov 28 '18 15:11 jcushman

Values Overview

  • spine_[start/end]_year

    • used in API
    • in 18,614 volumes, both fields have a value of 0
    • in 1,145 volumes, both fields are null
    • 4 volumes have years that don't fit (1|2)[0-9]{3}. The values are 194444, 197, 20001 and 66
  • xml_[start/end]_year

    • used only as a backup to populate start_year
    • in 2,178 volumes, xml_start_year is null
    • in 725 volumes, xml_end_year is null
  • [start/end]_year

    • doesn't really seem to be used anywhere— seems to almost exlcusively be referred to in the models file or the fastcase ingest where it's used as a backup
    • in 1,147 volumes, both fields are null
    • in 20,240 volumes, start_year is set to 0
    • in 20,246 volumes, end_year is set to 0

Differences

They only all agree in 264 volumes.

xml_start_year = spine_start_year: 448 xml_end_year = spine_end_year: 585

start_year = spine_start_year: 30,163 end_year = spine_end_year: 30,286

start_year = xml_start_year: 396 end_year = xml_end_year: 585

Digging into spine_end_year and end_year differences

Here are the instances where they disagree, neither value is null, and neither value is zero. They mostly look like simple key transcriptions that could be corrected pretty easily, automatically by comparing to the cases, or OCR errors. I also included xml_end_year but it was not used in the calculation.

end_year_spine_end_year_disagreements.csv

Potential Remediation

Though this obviously isn't a comprehensive analysis, I'm can't see any reason to keep more than one of these values. Instances where there's only one value that really is a year wouldn't be worse off regardless of accuracy. I think we could figure out most of the larger differences— e.g. off by n * 10 and off by n * 100 and larger differences— with a partially automated or script-assisted process based on the other volume metadata values or case metadata. The off-by-1 issues might be a little trickier to nip becuase the problems are less obvious and probably more varied (may require looking at the PDFs?) and there are a lot of them.

ChefAndy avatar Nov 16 '21 22:11 ChefAndy