Logging / reporting indexing issues when adding EADs
As discussed on 9/9 PO call:
How will Arclight report or log issues encountered when trying to index new fixtures (EADs)?
For example, if I'm adding new EADs to the index, particularly in batch, how will Arclight flag the EADs that are malformed or otherwise cause indexing problems so that I can troubleshoot them?
As we discussed on PO call, there may be a continuum of issues to report. For example:
- EAD is malformed XML
- EAD doesn't validate against EAD2002 schema
- EAD lacks specific elements deemed necessary for Arclight indexing (if we develop certain minimum required elements for indexing, what guides that decision...DACS?)
cc @anarchivist
Here's an example of some output where indexing failed for one of the IU finding aids. I'll note that Traject currently outputs the full "record" (in this case, the full finding aid). The most important parts can be see on lines 1-2 and lines 13198-13228.
I wonder if we can limit to provide context where the error actually occurs (e.g., I think this is an issue with the collection level <archdesc /> tag, but I'm not certain).
I think we should have plenty of control over how the error message gets reported; it's only particularly verbose because we're doing error handling at the top level instead of the individual components.
I also wonder if we could make our indexing code more robust to failure, ensure we always output solr documents, and include both error and validation information in the produced solr documents (perhaps with a flag, so they can be easily hidden from end users).
In Stanford's exhibits app, we started tracking some of this information in the database and, in retrospect, I'm not sure it was a great idea. At least if we have documents representing error or validation conditions, we could show them in context a little easier and provide handy tools for filtering/sorting/etc.