Use XML values for volume metadata in API
For example, https://api.case.law/v1/volumes/NOTALEPH001226/ shows start_year 0, when the xml_start_year value in the database is 1980.
Values Overview
-
spine_[start/end]_year- used in API
- in 18,614 volumes, both fields have a value of
0 - in 1,145 volumes, both fields are null
- 4 volumes have years that don't fit
(1|2)[0-9]{3}. The values are194444,197,20001and66
-
xml_[start/end]_year- used only as a backup to populate
start_year - in 2,178 volumes,
xml_start_yearisnull - in 725 volumes,
xml_end_yearisnull
- used only as a backup to populate
-
[start/end]_year- doesn't really seem to be used anywhere— seems to almost exlcusively be referred to in the models file or the fastcase ingest where it's used as a backup
- in 1,147 volumes, both fields are
null - in 20,240 volumes,
start_yearis set to0 - in 20,246 volumes,
end_yearis set to0
Differences
They only all agree in 264 volumes.
xml_start_year = spine_start_year: 448
xml_end_year = spine_end_year: 585
start_year = spine_start_year: 30,163
end_year = spine_end_year: 30,286
start_year = xml_start_year: 396
end_year = xml_end_year: 585
Digging into spine_end_year and end_year differences
Here are the instances where they disagree, neither value is null, and neither value is zero. They mostly look like simple key transcriptions that could be corrected pretty easily, automatically by comparing to the cases, or OCR errors. I also included xml_end_year but it was not used in the calculation.
end_year_spine_end_year_disagreements.csv
Potential Remediation
Though this obviously isn't a comprehensive analysis, I'm can't see any reason to keep more than one of these values. Instances where there's only one value that really is a year wouldn't be worse off regardless of accuracy. I think we could figure out most of the larger differences— e.g. off by n * 10 and off by n * 100 and larger differences— with a partially automated or script-assisted process based on the other volume metadata values or case metadata. The off-by-1 issues might be a little trickier to nip becuase the problems are less obvious and probably more varied (may require looking at the PDFs?) and there are a lot of them.