climate icon indicating copy to clipboard operation
climate copied to clipboard

PDF processing

Open mrchristian opened this issue 6 years ago • 23 comments
trafficstars

Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.

I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.

mrchristian avatar Sep 19 '19 09:09 mrchristian

ami-pdf will read the PDFs in bulk and split into characters and images. After that we need to know the application.

Try http://discuss.contentmine.org/t/cm-ucl-ii-semantic-content-enhancement-of-table-data/396/2 for an overview of extracting tables

You need to be able to run the latest ami-pdf which is available in the ami-jars repo. https://github.com/petermr/ami-jars There is no simple tutorial - for text only I would use GROBID , for tables and diagrams AMI.

In haste - more later.

On Thu, Sep 19, 2019 at 10:11 AM Simon Worthington [email protected] wrote:

Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.

I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS3DIJ4DEPMIWH2BFGDQKM63LA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HMLLK4Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4MNJXT5IOABU4WW2LQKM63LANCNFSM4IYIQCTQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Sep 19 '19 09:09 petermr

will have a go, much appreciated

mrchristian avatar Sep 19 '19 09:09 mrchristian

Much of this is available through java Tests on petermr/normami now moved to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so probably limited value. Back in 20 mins

On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington [email protected] wrote:

will have a go, much appreciated

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Sep 19 '19 10:09 petermr

How many documents do you have? The first step is to trun them into A CProject put them in a directory e.g. simon20190919 then ami-makeproject gives the help then ami-makeproject -p simon20190919 -f pdf should do it. Please record everything here including the new Cproject

On Thu, Sep 19, 2019 at 11:04 AM Peter Murray-Rust < [email protected]> wrote:

Much of this is available through java Tests on petermr/normami now moved to petermr/ami3 . ami3 has the tests but not the data. It's image-based, so probably limited value. Back in 20 mins

On Thu, Sep 19, 2019 at 10:57 AM Simon Worthington < [email protected]> wrote:

will have a go, much appreciated

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCSZS2WBB5HCJJEFUI6DQKNEJVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7C5LSQ#issuecomment-533059018, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS4KTBKVOW3XREM7JMTQKNEJVANCNFSM4IYIQCTQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Sep 19 '19 10:09 petermr

25k docs I think, very mixed over multiple decades :-) I'll send you a sample doc and quickly describe what we want to extract. And thank you for your time. If you can give your view on the doc I send it might shortcut things a little. You can just say 'yay', 'nay' if we're going to have any luck.

mrchristian avatar Sep 19 '19 11:09 mrchristian

Here's a stack of ami commands

#! /bin/sh

# your path should include the /bin directory of the appassembler distrib, e.g.
# ami-forestplot => /Users/pm286/workspace/cmdev/normami/target/appassembler/bin/ami-forestplot

# edit this to your own directory
# STATA="/Users/pm286/projects/forestplots/stataforestplots"
# STATA="/Users/pm286/projects/forestplots/_stataok"
WORKSPACE=$HOME/workspace/
FOREST_TOP=$WORKSPACE/projects/forestplots
MID_DIR=test20190804
FOREST_MID=$FOREST_TOP/$MID_DIR
LOW_DIR=_stataok
FOREST_DIR=$FOREST_MID/$LOW_DIR

CPROJECT=$FOREST_DIR
CTREE_NAME=PMC6127950
#CTREE_NAME=PMC5882397
CTREE=$CPROJECT/$CTREE_NAME

echo CTREE $CTREE

while getopts p:t: option
do
case "${option}"
in
p) CPROJECT=${OPTARG};;
t) CTREE=${OPTARG};;
esac
done


# choose the first SOURCE to run a single CTree, the second to run a CProject (long). 
# Comment in the one you want
SOURCE=" -t $CTREE"
# SOURCE=" -p $CPROJECT"
echo $CTREE
ls $CTREE

# images 
RAW=raw
RAW230DS=raw_thr_230_ds
RAWS4230DS=raw_s4_thr_230_ds
#subimages

# regions of image
HEADER=header
BODY=body
LTABLE=ltable
RTABLE=rtable
SCALE=scale

HEADERS120D=${HEADER}"_s4_thr_120_ds"
LTABLES120D=${LTABLE}"_s4_thr_120_ds"
RTABLES120D=${RTABLE}"_s4_thr_120_ds"

SLEEP1=1
SLEEP5=5

# make project from a directory (CPROJECT) containing PDFs. 
# a no-op here as EuPMC has already done this

ami-makeproject -p $CPROJECT --rawfiletypes pdf

# convert PDFs to CTrees

ami-pdf $SOURCE

# image processing at 3 threshold levels (later will try to make this an AMI loop)

ami-image $SOURCE --sharpen sharpen4 --threshold 150 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 230 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 240 --despeckle true

echo "===============Finished AmiImage============="
sleep $SLEEP1

# run OCR both types

ami-ocr $SOURCE --gocr      /usr/local/bin/gocr      --extractlines gocr               --forcemake
ami-ocr $SOURCE --tesseract /usr/local/bin/tesseract --extractlines hocr --html false  --forcemake

echo "===============Finished AmiOcr============="
sleep $SLEEP1

# extract the pixels and project onto axes to get subimage regions
# further project the scale subimage (y(2)) to get the tick values 
# in this case do it for the threshold 230 version only
# the spreadsheet location (xsl) is hard coded into the distrib but it could be 
# more general.
# This *generates* raw_thr_230_ds/template.xml . its variables (e.f. $RAW.$HEADER) are specified 
# in the stylesheet and values computed from applying ami-pixel to the images

ami-pixel $SOURCE --projections --yprojection 0.8 --xprojection 0.5 \
                --minheight -1 --rings -1 --islands 0 \
			    --inputname $RAW230DS \
			    --subimage statascale y 2 delta 10 projection x \
			    --templateinput $RAW230DS/projections.xml \
			    --templateoutput template.xml \
			    --templatexsl /org/contentmine/ami/tools/stataTemplate.xsl

echo "===============Finished AmiPixel============="
sleep $SLEEP5

# use the generated template.xml in each CTree/*/image*/raw_thr_230_ds/ directory to segment the image
# this will create subimages $RAW.$HEADER, $RAW.$BODY.$LTABLE, raw.body.graph, $RAW.$BODY.$RTABLE and raw.scale
# these subimages will be written to *.png in the CTree/*/image* directory
			    
ami-forestplot $SOURCE --template $RAW230DS/template.xml

echo "===============Finished AmiForest============="
sleep $SLEEP5

#now re-run ami-image to enhance each subimage separately

ami-image $SOURCE --inputname $RAW.$HEADER --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$LTABLE --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$RTABLE --sharpen sharpen4 --threshold 120 --despeckle true

echo "===============Finished Sharpen Threshold============="
sleep $SLEEP5

# and rerun tesseract on each subimage (suspect Tesseract gets confused by the whole
# image including the graph and lines.

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr

echo "===============Finished Tesseract ============="
sleep $SLEEP5

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr

echo "===============Finished GOCR ============="
sleep $SLEEP5

petermr avatar Sep 19 '19 11:09 petermr

dont send it, add it in a new folder here unless there are copyright issues

petermr avatar Sep 19 '19 11:09 petermr

from the 25K try to select ca 20 which are:

  • newish (old docs are problematc, but maybe that is the point)
  • born digital if possible
  • OPEN (we cannot have takedowns)
  • show the range of problems
  • make clear what needs extracted

petermr avatar Sep 19 '19 11:09 petermr

I'll check but I think copyright questions, yes. But I'll check first.

mrchristian avatar Sep 19 '19 11:09 mrchristian

if it's publicly visible I'm happy. We did that with phylotrees We are allowed to extract data if we can legally read it somewhere. Doesn't have to be CC BY. Also I dont think stopping Climate research is good PR

petermr avatar Sep 19 '19 11:09 petermr

happy to talk on phone/skype if helps

petermr avatar Sep 19 '19 11:09 petermr

if you have 100-year old records as bitmaps I am happy to try those, but they must be homogenous in type

petermr avatar Sep 19 '19 11:09 petermr

I need to wait for colleagues to get docs :-)

mrchristian avatar Sep 19 '19 11:09 mrchristian

see table extraction at http://discuss.contentmine.org/t/ami-eppi-cm-ucl-table-extraction-project/322/14

petermr avatar Sep 19 '19 12:09 petermr

even one doc would be a useful start. can tackle it in next 1.5 hours

petermr avatar Sep 19 '19 12:09 petermr

Would like to show something for my school visit in 10 days.

petermr avatar Sep 19 '19 12:09 petermr

https://edocs.tib.eu/files/e01fb19/1676027963.pdf has https://creativecommons.org/licenses/by/3.0/de. I'll look for some more, might take some minutes.

hauschke avatar Sep 19 '19 12:09 hauschke

http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf

Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.

hauschke avatar Sep 19 '19 12:09 hauschke

I have processed your first PDF and uploaded the results. It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text.

See if you can make some sense. The SVG is in pages

On Thu, Sep 19, 2019 at 1:57 PM hauschke [email protected] wrote:

http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf

Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Sep 19 '19 12:09 petermr

The next 5 don't seem very relevant to climate change? It's not clear what would be extracted.

I want to stick to climate and specific types of information - tables/graphs vs time, e.g.

On Thu, Sep 19, 2019 at 1:59 PM Peter Murray-Rust < [email protected]> wrote:

I have processed your first PDF and uploaded the results. It extracts the bitmaps and characters as SVG. I will revisit my SVG 2 text.

See if you can make some sense. The SVG is in pages

On Thu, Sep 19, 2019 at 1:57 PM hauschke [email protected] wrote:

http://creativecommons.org/licenses/by/4.0/, https://edocs.tib.eu/files/e01fb19/1666373214.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1670198502.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/, https://edocs.tib.eu/files/e01fb19/1667335782.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/1665279796.pdf http://creativecommons.org/licenses/by-sa/4.0/, https://edocs.tib.eu/files/e01fb19/166506773X.pdf

Some more for testing. Sorry, I could deliver some dozens more, but I hope that's enough for a trial.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS535367L3B2R2VCJYLQKNZMVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DL5ZA#issuecomment-533118692, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSYAA2WSH5Q4G2FADATQKNZMVANCNFSM4IYIQCTQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Sep 19 '19 13:09 petermr

We'll assemble a small climate change collection, will take a few days though. Also will get hold of an example list of items want to extract. The context is wanting to make final research reports more visible so as to make them part of the research corpus in a more usable way. The climate change related reports would sit within the bigger body of research reports. If you can share back the current SVG outputs that would be great.

mrchristian avatar Sep 19 '19 14:09 mrchristian

Here is a set of 10 research reports that are CC licensed. This is not a priority, but interesting to know some time if entities like 'Abstract, Introduction and Conclusion' can be extracted. The context is in terms of making German research reports more visible, usable, and obviously help future research. The ambition is to make the national collection easier to use, and well if it can be done for one collection, why not more.

Files

http://creativecommons.org/licenses/by-sa/3.0/de,https://edocs.tib.eu/files/e01fb19/1676027963.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076258.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076134.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027897045.pdf http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027879500.pdf http://creativecommons.org/licenses/by-nd/4.0/deed/,https://edocs.tib.eu/files/e01fn18/1018823859.pdf https://creativecommons.org/licenses/by-nd/4.0/deed.en,https://edocs.tib.eu/files/e01fn17/893648477.pdf http://creativecommons.org/licenses/by/4.0/,https://edocs.tib.eu/files/e01fb17/881442836.pdf http://creativecommons.org/licenses/by-nd/3.0/de/,https://edocs.tib.eu/files/e01fn16/864300328.pdf http://creativecommons.org/licenses/by-nd/3.0/de/,http://edok01.tib.uni-hannover.de/edoks/e01fn17/857413724.pdf http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/739959433.pdf http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/719349311.pdf

Oh, some more context :-) https://twitter.com/Lambo/status/1176901945249939463

mrchristian avatar Sep 26 '19 08:09 mrchristian

I am back in Cambridge so can start working on this.

On Thu, Sep 26, 2019 at 9:08 AM Simon Worthington [email protected] wrote:

Here is a set of 10 research reports that are CC licensed. This is not a priority, but interesting to know some time if entities like 'Abstract, Introduction and Conclusion' can be extracted. The context is in terms of making German research reports more visible, usable, and obviously help future research. The ambition is to make the national collection easier to use, and well if it can be done for one collection, why not more.

Files

http://creativecommons.org/licenses/by-sa/3.0/de,https://edocs.tib.eu/files/e01fb19/1676027963.pdf

http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076258.pdf

http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1028076134.pdf

http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027897045.pdf

http://creativecommons.org/licenses/by-nc-nd/4.0,https://edocs.tib.eu/files/e01fb18/1027879500.pdf

http://creativecommons.org/licenses/by-nd/4.0/deed/,https://edocs.tib.eu/files/e01fn18/1018823859.pdf

https://creativecommons.org/licenses/by-nd/4.0/deed.en,https://edocs.tib.eu/files/e01fn17/893648477.pdf

http://creativecommons.org/licenses/by/4.0/,https://edocs.tib.eu/files/e01fb17/881442836.pdf

http://creativecommons.org/licenses/by-nd/3.0/de/,https://edocs.tib.eu/files/e01fn16/864300328.pdf

http://creativecommons.org/licenses/by-nd/3.0/de/,http://edok01.tib.uni-hannover.de/edoks/e01fn17/857413724.pdf

http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/739959433.pdf

http://creativecommons.org/licenses/by-nc-nd/3.0/de/,https://edocs.tib.eu/files/e01fn13/719349311.pdf

Oh, some more context :-) https://twitter.com/Lambo/status/1176901945249939463

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/climate/issues/16?email_source=notifications&email_token=AAFTCS2EGH7SD4G53N7LCBTQLRUZVA5CNFSM4IYIQCT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7UWYQI#issuecomment-535391297, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS2V4DU3ZFBLALBJCVTQLRUZVANCNFSM4IYIQCTQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr avatar Sep 27 '19 07:09 petermr