mik
mik copied to clipboard
Add OAI to CSV toolchain to support migrations from Islandora 7.x to CLAW
https://github.com/Islandora-CLAW/CLAW/issues/452 asks whether we can use Drupal 8's migration API to batch ingest content into CLAW. I've got an MIK toolchain that harvests content from 7.x using OAI-PMH and writes out input for a Migrate Plus ingest. Still working on it while travelling but will have something substantially complete within a couple days.
BTW, doing this work is also a good test of MIK's developer documentation. I'll probably be opening a couple issues resulting from this work.
Related issue: #378.
Got this to the point where you can harvest a collection via OAI-PMH and end up with a CSV file similar to the one prepared by @seth-shaw-unlv at the CLAW issue linked above. Sample .in file is:
; MIK configuration file for migrating content from an Islandora
; instance to the format required by the Migrate+ module, for ingesting
; into Islandora CLAW.
[SYSTEM]
[CONFIG]
config_id = MIK OAI to CSV toolchain
last_updated_on = "2018-04-16"
last_update_by = "Mark Jordan"
[FETCHER]
class = Oaipmh
oai_endpoint = "http://localhost:8000/oai2"
set_spec = doitest_collection
temp_directory = "/tmp/oai_to_csv_temp"
[METADATA_PARSER]
class = csv\DcToCsv
; The field identified in record_key is added to the output CSV containing the item's unique ID.
record_key = ID
; DC element names are used as CSV column headings.
dc_elements[] = title
dc_elements[] = identifier
dc_elements[] = description
dc_elements[] = format
[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = "/tmp/oai_to_csv_temp"
datastream_ids[] = OBJ
[WRITER]
class = OaipmhCsv
output_file = "/tmp/oai_to_csv_output/metadata.csv"
output_directory = "/tmp/oai_to_csv_output"
; metadata_only = true
[MANIPULATORS]
[LOGGING]
path_to_log = "/tmp/oai_to_csv_output/mik.log"
path_to_manipulator_log= "/tmp/oai_to_csv_output/manipulator.log"
Here's the resulting CSV file:
ID,title,identifier,description,format
oai%3Adrupal-site.org%3Adoitest_16,"autogen 6 - blurg",doitest:16,"This record was harvested on a Thursday.","nonprojected graphic"
oai%3Adrupal-site.org%3Adoitest_4,"Church Holy Rosary, Vancouver B.C.",doitest:4,"Holy Rosary Church in Vancouver, B.C."
oai%3Adrupal-site.org%3Adoitest_5,"Second test object.",doitest:3,"This record was harvested on a Thursday."
oai%3Adrupal-site.org%3Adoitest_6,"Has DOI?",doitest:6,"This record was harvested on a Thursday.",globe
oai%3Adrupal-site.org%3Adoitest_12,"autogen 6",doitest:12,"This record was harvested on a Thursday.","nonprojected graphic"
Based on discussion at the April 18 CLAW Technical call, I've added an option to output an XML file containing the harvested DC or MODS instead of a CSV file. The generation of this output file is not done via OAI to CSV toolchain, but rather via a shutdown hook script used with the existing OAI Islandora toolchain:
[SYSTEM]
[CONFIG]
config_id = MIK OAI toolchain
last_updated_on = "2018-04-18"
last_update_by = "Mark Jordan"
[FETCHER]
class = Oaipmh
oai_endpoint = "http://localhost:8000/oai2"
set_spec = clawcall_collection
metadata_prefix = mods
temp_directory = /tmp/claw_call_tmp
[METADATA_PARSER]
; We don't use the new csv\DcToCsv parser here.
class = mods\OaiToMods
[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = /tmp/claw_call_tmp
datastream_ids[] = OBJ
[WRITER]
; We don't use the new OaipmhCsv writer here.
class = Oaipmh
output_directory = "/tmp/claw_call"
; This is the new shutdown hook script.
shutdownhooks[] = "php extras/scripts/shutdownhooks/concatentate_xml_files.php"
[MANIPULATORS]
[LOGGING]
path_to_log = "/tmp/claw_call/mik.log"
path_to_manipulator_log = "/tmp/claw_call/manipulator.log"