mik icon indicating copy to clipboard operation
mik copied to clipboard

Add OAI to CSV toolchain to support migrations from Islandora 7.x to CLAW

Open mjordan opened this issue 6 years ago • 4 comments

https://github.com/Islandora-CLAW/CLAW/issues/452 asks whether we can use Drupal 8's migration API to batch ingest content into CLAW. I've got an MIK toolchain that harvests content from 7.x using OAI-PMH and writes out input for a Migrate Plus ingest. Still working on it while travelling but will have something substantially complete within a couple days.

mjordan avatar Apr 12 '18 18:04 mjordan

BTW, doing this work is also a good test of MIK's developer documentation. I'll probably be opening a couple issues resulting from this work.

mjordan avatar Apr 12 '18 18:04 mjordan

Related issue: #378.

mjordan avatar Apr 12 '18 18:04 mjordan

Got this to the point where you can harvest a collection via OAI-PMH and end up with a CSV file similar to the one prepared by @seth-shaw-unlv at the CLAW issue linked above. Sample .in file is:

; MIK configuration file for migrating content from an Islandora
; instance to the format required by the Migrate+ module, for ingesting
; into Islandora CLAW.

[SYSTEM]

[CONFIG]
config_id = MIK OAI to CSV toolchain
last_updated_on = "2018-04-16"
last_update_by = "Mark Jordan"

[FETCHER]
class = Oaipmh
oai_endpoint = "http://localhost:8000/oai2"
set_spec = doitest_collection
temp_directory = "/tmp/oai_to_csv_temp"

[METADATA_PARSER]
class = csv\DcToCsv
; The field identified in record_key is added to the output CSV containing the item's unique ID.
record_key = ID
; DC element names are used as CSV column headings.
dc_elements[] = title
dc_elements[] = identifier
dc_elements[] = description
dc_elements[] = format

[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = "/tmp/oai_to_csv_temp"
datastream_ids[] = OBJ

[WRITER]
class = OaipmhCsv
output_file = "/tmp/oai_to_csv_output/metadata.csv"
output_directory = "/tmp/oai_to_csv_output"
; metadata_only = true

[MANIPULATORS]

[LOGGING]
path_to_log = "/tmp/oai_to_csv_output/mik.log"
path_to_manipulator_log= "/tmp/oai_to_csv_output/manipulator.log"

Here's the resulting CSV file:

ID,title,identifier,description,format
oai%3Adrupal-site.org%3Adoitest_16,"autogen 6 - blurg",doitest:16,"This record was harvested on a Thursday.","nonprojected graphic"
oai%3Adrupal-site.org%3Adoitest_4,"Church Holy Rosary, Vancouver B.C.",doitest:4,"Holy Rosary Church in Vancouver, B.C."
oai%3Adrupal-site.org%3Adoitest_5,"Second test object.",doitest:3,"This record was harvested on a Thursday."
oai%3Adrupal-site.org%3Adoitest_6,"Has DOI?",doitest:6,"This record was harvested on a Thursday.",globe
oai%3Adrupal-site.org%3Adoitest_12,"autogen 6",doitest:12,"This record was harvested on a Thursday.","nonprojected graphic"

mjordan avatar Apr 12 '18 20:04 mjordan

Based on discussion at the April 18 CLAW Technical call, I've added an option to output an XML file containing the harvested DC or MODS instead of a CSV file. The generation of this output file is not done via OAI to CSV toolchain, but rather via a shutdown hook script used with the existing OAI Islandora toolchain:

[SYSTEM]

[CONFIG]
config_id = MIK OAI toolchain
last_updated_on = "2018-04-18"
last_update_by = "Mark Jordan"

[FETCHER]
class = Oaipmh
oai_endpoint = "http://localhost:8000/oai2"
set_spec = clawcall_collection
metadata_prefix = mods
temp_directory = /tmp/claw_call_tmp

[METADATA_PARSER]
; We don't use the new  csv\DcToCsv parser here.
class = mods\OaiToMods

[FILE_GETTER]
class = OaipmhIslandoraObj
temp_directory = /tmp/claw_call_tmp
datastream_ids[] = OBJ

[WRITER]
; We don't use the new OaipmhCsv writer here.
class = Oaipmh
output_directory = "/tmp/claw_call"
; This is the new shutdown hook script.
shutdownhooks[] = "php extras/scripts/shutdownhooks/concatentate_xml_files.php"

[MANIPULATORS]

[LOGGING]
path_to_log = "/tmp/claw_call/mik.log"
path_to_manipulator_log = "/tmp/claw_call/manipulator.log"

mjordan avatar Apr 19 '18 14:04 mjordan