HOS-MetadataTransformations copied to clipboard
DEPRECATED - no longer actively maintained. Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" s...
Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.
Use case
- Harvest metadata in different standards (dublin core, datacite, ...) from multiple OAI-PMH endpoints
- Transform harvested data with specific rules for each source to produce normalized and enriched data
- Load transformed data into a Solr search index (which serves as a backend for a discovery system, e.g. HOS-TYPO3-find)
Data Flow
Source: flowchart.mmd (try mermaid live editor)
- Simple automated cronjob-ready workflow: one bash script for each data source and an additional script to run all scripts in parallel
- Cache for incremental OAI harvesting (via metha)
- Graphical user interface (OpenRefine) for exploring the data, creating the transformation rules and checking the results; it is accessible in the local network via a web browser; data will be updated automatically
- Results are made available in preinstalled local or in external Solr core. You can set (and reset) the Solr schema via bash script.
- Data is stored in the filesystem in common formats (xml, tsv) so you can extend the workflow with command line tools to further manipulate the data.
System requirements
- minimum: 2GB RAM
- recommended: 8GB RAM (to run all scripts in parallel)
tested with Ubuntu 16.04 LTS and Ubuntu 18.04 LTS
install git:
sudo apt install git
clone this git repository:
git clone https://github.com/subhh/HOS-MetadataTransformations.git
cd HOS-MetadataTransformations
install openjdk-8-jre-headless, zip, curl, jq, metha 1.29, OpenRefine 3.2 beta, openrefine-client 0.3.4 and Solr 7.3.1:
sudo ./install.sh
Configure Solr schema:
Data will be available after first run at:
- Solr admin: http://localhost:8983/solr/#/hos
- Solr browse: http://localhost:8983/solr/hos/browse
- OpenRefine: http://localhost:3333
Run workflow with data source "uhhediss" and load data into local Solr (-s) and local OpenRefine service (-d)
bin/uhhediss.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
Run workflow with all data sources in parallel and load data into local Solr (-s) and local OpenRefine service (-d):
./run.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
Run workflow with all data sources and load data into two external Solr cores (-s) and external OpenRefine service (-d)
./run.sh -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -s https://openscience.hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80
Solr authentication
If your external Solr is secured with username/password (Basic Authentication Plugin), you may provide the credentials by copying cfg/solr/credentials.example to cfg/solr/credentials
and fill in username and password.
cp cfg/solr/credentials.example cfg/solr/credentials
nano cfg/solr/credentials
chmod 400 cfg/solr/credentials
Example for daily cronjob at 00:35 AM to run workflow with all data sources, load data into external Solr core (-s) and external OpenRefine service (-d) and delete files older than 7 days (-x)
command="$(readlink -f run.sh) -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80 -x 7"
job="35 0 * * * $command"
cat <(fgrep -i -v "$command" <(crontab -l)) <(echo "$job") | crontab -
Add a data source
- Step 1: Harvest new OAI-PMH endpoint and load data into OpenRefine. Example for a new data source called
with OAI-PMH endpointhttp://ediss.sub.uni-hamburg.de/oai2/oai2.php
./load-new-data.sh -c yourdatasource -i http://ediss.sub.uni-hamburg.de/oai2/oai2.php
Step 2: Explore the data in OpenRefine at http://localhost:3333 (project
) and create transformations until data looks fine and suits the Solr schema. -
Step 3: Extract the OpenRefine project history in json format and save it in a subdirectory of cfg/, e.g.
. -
Step 4: Copy an existing bash shell script (e.g. bin/uhhediss.sh to
and edit line 17 (codename of the source, e.g.yourdatasource
) and line 18 (url to OAI-PMH endpoint, e.g.http://ediss.sub.uni-hamburg.de/oai2/oai2.php
). If you load a big dataset you may need to allocate more memory to OpenRefine (line 19).
cp -a bin/uhhediss.sh bin/yourdatasource.sh
gedit bin/yourdatasource.sh
- Step 5: Run your shell script (or full workflow)
bin/yourdatasource.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
- Step 6: Check results in OpenRefine at http://localhost:3333 (project
) and Solr (query: collectionId:yourdatasource)