climate
climate copied to clipboard
PDF processing
Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.
I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.
ami-pdf will read the PDFs in bulk and split into characters and images. After that we need to know the application.
Try http://discuss.contentmine.org/t/cm-ucl-ii-semantic-content-enhancement-of-table-data/396/2 for an overview of extracting tables
You need to be able to run the latest ami-pdf which is available in the ami-jars repo. https://github.com/petermr/ami-jars There is no simple tutorial - for text only I would use GROBID , for tables and diagrams AMI.
In haste - more later.
On Thu, Sep 19, 2019 at 10:11 AM Simon Worthington [email protected] wrote:
Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.
I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS3DIJ4DEPMIWH2BFGDQKM63LA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HMLLK4Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4MNJXT5IOABU4WW2LQKM63LANCNFSM4IYIQCTQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
will have a go, much appreciated
Much of this is available through java Tests on petermr/normami now moved to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so probably limited value. Back in 20 mins
On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington [email protected] wrote:
will have a go, much appreciated
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
How many documents do you have? The first step is to trun them into A CProject put them in a directory e.g. simon20190919 then ami-makeproject gives the help then ami-makeproject -p simon20190919 -f pdf should do it. Please record everything here including the new Cproject
On Thu, Sep 19, 2019 at 11:04 AM Peter Murray-Rust < [email protected]> wrote:
Much of this is available through java Tests on petermr/normami now moved to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so probably limited value. Back in 20 mins
On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington < [email protected]> wrote:
will have a go, much appreciated
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
25k docs I think, very mixed over multiple decades :-) I'll send you a sample doc and quickly describe what we want to extract. And thank you for your time. If you can give your view on the doc I send it might shortcut things a little. You can just say 'yay', 'nay' if we're going to have any luck.
Here's a stack of ami commands
#! /bin/sh
# your path should include the /bin directory of the appassembler distrib, e.g.
# ami-forestplot => /Users/pm286/workspace/cmdev/normami/target/appassembler/bin/ami-forestplot
# edit this to your own directory
# STATA="/Users/pm286/projects/forestplots/stataforestplots"
# STATA="/Users/pm286/projects/forestplots/_stataok"
WORKSPACE=$HOME/workspace/
FOREST_TOP=$WORKSPACE/projects/forestplots
MID_DIR=test20190804
FOREST_MID=$FOREST_TOP/$MID_DIR
LOW_DIR=_stataok
FOREST_DIR=$FOREST_MID/$LOW_DIR
CPROJECT=$FOREST_DIR
CTREE_NAME=PMC6127950
#CTREE_NAME=PMC5882397
CTREE=$CPROJECT/$CTREE_NAME
echo CTREE $CTREE
while getopts p:t: option
do
case "${option}"
in
p) CPROJECT=${OPTARG};;
t) CTREE=${OPTARG};;
esac
done
# choose the first SOURCE to run a single CTree, the second to run a CProject (long).
# Comment in the one you want
SOURCE=" -t $CTREE"
# SOURCE=" -p $CPROJECT"
echo $CTREE
ls $CTREE
# images
RAW=raw
RAW230DS=raw_thr_230_ds
RAWS4230DS=raw_s4_thr_230_ds
#subimages
# regions of image
HEADER=header
BODY=body
LTABLE=ltable
RTABLE=rtable
SCALE=scale
HEADERS120D=${HEADER}"_s4_thr_120_ds"
LTABLES120D=${LTABLE}"_s4_thr_120_ds"
RTABLES120D=${RTABLE}"_s4_thr_120_ds"
SLEEP1=1
SLEEP5=5
# make project from a directory (CPROJECT) containing PDFs.
# a no-op here as EuPMC has already done this
ami-makeproject -p $CPROJECT --rawfiletypes pdf
# convert PDFs to CTrees
ami-pdf $SOURCE
# image processing at 3 threshold levels (later will try to make this an AMI loop)
ami-image $SOURCE --sharpen sharpen4 --threshold 150 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 230 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 240 --despeckle true
echo "===============Finished AmiImage============="
sleep $SLEEP1
# run OCR both types
ami-ocr $SOURCE --gocr /usr/local/bin/gocr --extractlines gocr --forcemake
ami-ocr $SOURCE --tesseract /usr/local/bin/tesseract --extractlines hocr --html false --forcemake
echo "===============Finished AmiOcr============="
sleep $SLEEP1
# extract the pixels and project onto axes to get subimage regions
# further project the scale subimage (y(2)) to get the tick values
# in this case do it for the threshold 230 version only
# the spreadsheet location (xsl) is hard coded into the distrib but it could be
# more general.
# This *generates* raw_thr_230_ds/template.xml . its variables (e.f. $RAW.$HEADER) are specified
# in the stylesheet and values computed from applying ami-pixel to the images
ami-pixel $SOURCE --projections --yprojection 0.8 --xprojection 0.5 \
--minheight -1 --rings -1 --islands 0 \
--inputname $RAW230DS \
--subimage statascale y 2 delta 10 projection x \
--templateinput $RAW230DS/projections.xml \
--templateoutput template.xml \
--templatexsl /org/contentmine/ami/tools/stataTemplate.xsl
echo "===============Finished AmiPixel============="
sleep $SLEEP5
# use the generated template.xml in each CTree/*/image*/raw_thr_230_ds/ directory to segment the image
# this will create subimages $RAW.$HEADER, $RAW.$BODY.$LTABLE, raw.body.graph, $RAW.$BODY.$RTABLE and raw.scale
# these subimages will be written to *.png in the CTree/*/image* directory
ami-forestplot $SOURCE --template $RAW230DS/template.xml
echo "===============Finished AmiForest============="
sleep $SLEEP5
#now re-run ami-image to enhance each subimage separately
ami-image $SOURCE --inputname $RAW.$HEADER --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$LTABLE --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$RTABLE --sharpen sharpen4 --threshold 120 --despeckle true
echo "===============Finished Sharpen Threshold============="
sleep $SLEEP5
# and rerun tesseract on each subimage (suspect Tesseract gets confused by the whole
# image including the graph and lines.
ami-ocr $SOURCE --inputname $RAW.$HEADERS120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
echo "===============Finished Tesseract ============="
sleep $SLEEP5
ami-ocr $SOURCE --inputname $RAW.$HEADERS120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
echo "===============Finished GOCR ============="
sleep $SLEEP5
dont send it, add it in a new folder here unless there are copyright issues
from the 25K try to select ca 20 which are:
- newish (old docs are problematc, but maybe that is the point)
- born digital if possible
- OPEN (we cannot have takedowns)
- show the range of problems
- make clear what needs extracted
I'll check but I think copyright questions, yes. But I'll check first.
if it's publicly visible I'm happy. We did that with phylotrees We are allowed to extract data if we can legally read it somewhere. Doesn't have to be CC BY. Also I dont think stopping Climate research is good PR
happy to talk on phone/skype if helps
if you have 100-year old records as bitmaps I am happy to try those, but they must be homogenous in type
I need to wait for colleagues to get docs :-)
see table extraction at http://discuss.contentmine.org/t/ami-eppi-cm-ucl-table-extraction-project/322/14
even one doc would be a useful start. can tackle it in next 1.5 hours
Would like to show something for my school visit in 10 days.
https://edocs.tib.eu/files/e01fb19/1676027963.pdf has https://creativecommons.org/licenses/by/3.0/de. I'll look for some more, might take some minutes.
http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf
Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.
I have processed your first PDF and uploaded the results. It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text.
See if you can make some sense. The SVG is in pages
On Thu, Sep 19, 2019 at 1:57 PM hauschke [email protected] wrote:
http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf
Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
The next 5 don't seem very relevant to climate change? It's not clear what would be extracted.
I want to stick to climate and specific types of information - tables/graphs vs time, e.g.
On Thu, Sep 19, 2019 at 1:59 PM Peter Murray-Rust < [email protected]> wrote:
I have processed your first PDF and uploaded the results. It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text.
See if you can make some sense. The SVG is in pages
On Thu, Sep 19, 2019 at 1:57 PM hauschke [email protected] wrote:
http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf
Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
We'll assemble a small climate change collection, will take a few days though. Also will get hold of an example list of items want to extract. The context is wanting to make final research reports more visible so as to make them part of the research corpus in a more usable way. The climate change related reports would sit within the bigger body of research reports. If you can share back the current SVG outputs that would be great.
Here is a set of 10 research reports that are CC licensed. This is not a priority, but interesting to know some time if entities like 'Abstract, Introduction and Conclusion' can be extracted. The context is in terms of making German research reports more visible, usable, and obviously help future research. The ambition is to make the national collection easier to use, and well if it can be done for one collection, why not more.
Files
http://creativecommons.org/licenses/by-sa/3.0/de,https://edocs.tib.eu/files/e01fb19/1676027963.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076258.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076134.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027897045.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027879500.pdf http://creativecommons.org/licenses/by-nd/4.0/deed/,https://edocs.tib.eu/files/e01fn18/1018823859.pdf https://creativecommons.org/licenses/by-nd/4.0/deed.en,https://edocs.tib.eu/files/e01fn17/893648477.pdf http://creativecommons.org/licenses/by/4.0/,https://edocs.tib.eu/files/e01fb17/881442836.pdf http://creativecommons.org/licenses/by-nd/3.0/de/,https://edocs.tib.eu/files/e01fn16/864300328.pdf http://creativecommons.org/licenses/by-nd/3.0/de/,http://edok01.tib.uni-hannover.de/edoks/e01fn17/857413724.pdf http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/739959433.pdf http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/719349311.pdf
Oh, some more context :-) https://twitter.com/Lambo/status/1176901945249939463
I am back in Cambridge so can start working on this.
On Thu, Sep 26, 2019 at 9:08 AM Simon Worthington [email protected] wrote:
Here is a set of 10 research reports that are CC licensed. This is not a priority, but interesting to know some time if entities like 'Abstract, Introduction and Conclusion' can be extracted. The context is in terms of making German research reports more visible, usable, and obviously help future research. The ambition is to make the national collection easier to use, and well if it can be done for one collection, why not more.
Files
http://creativecommons.org/licenses/by-sa/3.0/de,https://edocs.tib.eu/files/e01fb19/1676027963.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076258.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076134.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027897045.pdf
http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027879500.pdf
http://creativecommons.org/licenses/by-nd/4.0/deed/,https://edocs.tib.eu/files/e01fn18/1018823859.pdf
https://creativecommons.org/licenses/by-nd/4.0/deed.en,https://edocs.tib.eu/files/e01fn17/893648477.pdf
http://creativecommons.org/licenses/by/4.0/,https://edocs.tib.eu/files/e01fb17/881442836.pdf
http://creativecommons.org/licenses/by-nd/3.0/de/,https://edocs.tib.eu/files/e01fn16/864300328.pdf
http://creativecommons.org/licenses/by-nd/3.0/de/,http://edok01.tib.uni-hannover.de/edoks/e01fn17/857413724.pdf
http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/739959433.pdf
http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/719349311.pdf
Oh, some more context :-) https://twitter.com/Lambo/status/1176901945249939463
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS2EGH7SD4G53N7LCBTQLRUZVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7UWYQI#issuecomment-535391297, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2V4DU3ZFBLALBJCVTQLRUZVANCNFSM4IYIQCTQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK