CEVOpen Explore KNIME functionality on articles and CProjects

@deadlyvices has been exploring this and reporting in email.

ACTION: copy any relevant past emails here...

Oct 14 '19 12:10 petermr

I've made some headway in processing ContentMine output using KNIME. I can now read in the full text of articles and tag it up using OSCAR.

We should also be able to tag using the dictionaries that Ambarish are creating.

Oct 14 '19 13:10 petermr

I'm picking up the conversation from our initial email chart. I'm currently using KNIME to see if we can use the output of getpapers and ami as feedstock for some further analysis.

There are two main areas I'm investigating:

rule-based tagging using KNIME's own built-in nodes
Dictionary-based tagging using Ambarish's work.

I should hopefully have something to show over the next few days or so.

Oct 14 '19 14:10 deadlyvices

So: I suppose the next question is - if we're looking for telling correlations between conditions and substances, should we be looking in the abstract or the body? I'd say the former as it's most likely to spell out key conclusions.

Oct 14 '19 14:10 deadlyvices

So: I've now got KNIME reading in the dictionaries and tagging up documents with them. This is good, but would be even better if there was an easy way of defining one's own tag set. I have do make to with the standard set of tags. The only way of doing this is in Java, and I know absolutely nothing about Java programming. So if anyone wants to take this on, please, be my guest.

Oct 14 '19 15:10 deadlyvices

On Mon, Oct 14, 2019 at 4:06 PM Clyde Davies [email protected] wrote:

So: I've now got KNIME reading in the dictionaries and tagging up documents with them. This is good, but would be even better if there was an easy way of defining one's own tag set. I have to make to with the standard set of tags.

What are these tags?

The only way of doing this is in Java, https://www.knime.com/for-developers-integration-of-custom-tag-sets and I know absolutely nothing about Java programming.

I can understand what is in the tutorial. It also uses Eclipse which is a standard Java IDE and I'm familiar with it.

So if anyone wants to take this on, please, be my guest.

If you can spell out what is required we can estimate the effort. (Most things at this stage are tweaking examples, not writing code from scratch).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS62SHPZQRZJFT4FNDLQOSDGBA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBFDMTI#issuecomment-541734477, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZRRWNUYHNKK75S6CTQOSDGBANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 14 '19 16:10 petermr

KNIME passes data tables from node to node. These support a Document column type. Tagging nodes insert, well, tags into the document to mark recognised terms. There are your typical part-of-speech POS tag categories but also some more specialised ones. The blue text [OSCAR(ONT)] shown in the screenshot means that the OSCAR category has recognised an ontology entity and tagged it up appropriately. I think if we are going to take this further then we probably need an AMI category with PLANT, ACTIVITY, INSTRUMENT, PLANTPART etc. tags for each dictionary and the entity classes they recognise. Currently I'm having to use OSCAR tag types with a tag value of CUST to mark these up. This is nowhere near granular enough for our purposes. And it's wrong.

Oct 14 '19 18:10 deadlyvices

This is what I've got so far. You can see the three taggers at the end of the workflow:

I leave the dictionary tagging until last

Oct 14 '19 18:10 deadlyvices

You can see what happens: it's recognised all the terms in the dictionaries but has been unable to differentiate between them:

Oct 14 '19 18:10 deadlyvices

I think the most immediate advantage of this approach is that it allows us to visually test the dictionaries. The word 'and' is tagged, but why this should is unknown.

Oct 14 '19 19:10 deadlyvices

I just discovered a delightful feature of KNIME hub which makes it incredibly easy to overwrite a workflow with an old version! So I am going to have to recreate that one. But I have the screenshot so that at least will save me time figuring it out all over again.

Oct 14 '19 20:10 deadlyvices

What dictionaries are running and how does each work? I am a bit mystified by the words which are tagged. I now understand that each dictionary provides a single class of tag . I think your approach is reasonable - but need to know how it decides to tag.

On Mon, Oct 14, 2019 at 8:01 PM Clyde Davies [email protected] wrote:

I think the most immediate advantage of this approach is that it allows us to visually test the dictionaries. The word 'and' is tagged, but why this should is unknown.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS7OCTEIOKKQ7QFJIGDQOS6YJA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGCB6Q#issuecomment-541860090, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4P4NOCH775LOCVR4TQOS6YJANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 14 '19 21:10 petermr

I'm just lumping all the dictionaries together right now. I think I can get away with just loading each dictionary and assigning a custom tag value individually to the matched terms. That would disambiguate effectively as there aren't so many of them.

On Mon, 14 Oct 2019, 22:20 petermr, [email protected] wrote:

What dictionaries are running and how does each work? I am a bit mystified by the words which are tagged. I now understand that each dictionary provides a single class of tag . I think your approach is reasonable - but need to know how it decides to tag.

On Mon, Oct 14, 2019 at 8:01 PM Clyde Davies [email protected] wrote:

I think the most immediate advantage of this approach is that it allows us to visually test the dictionaries. The word 'and' is tagged, but why this should is unknown.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS7OCTEIOKKQ7QFJIGDQOS6YJA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGCB6Q#issuecomment-541860090 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFTCS4P4NOCH775LOCVR4TQOS6YJANCNFSM4JAOYUAQ

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMSZCZVQCSQ2LKJERGLQOTPBXA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGSY3I#issuecomment-541928557, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMWJY6RQLENBITIXVRTQOTPBXANCNFSM4JAOYUAQ .

Oct 14 '19 21:10 deadlyvices

Thanks Can you post a typical output?

On Mon, 14 Oct 2019, 22:58 Clyde Davies, [email protected] wrote:

I'm just lumping all the dictionaries together right now. I think I can get away with just loading each dictionary and assigning a custom tag value individually to the matched terms. That would disambiguate effectively as there aren't so many of them.

On Mon, 14 Oct 2019, 22:20 petermr, [email protected] wrote:

What dictionaries are running and how does each work? I am a bit mystified by the words which are tagged. I now understand that each dictionary provides a single class of tag . I think your approach is reasonable - but need to know how it decides to tag.

On Mon, Oct 14, 2019 at 8:01 PM Clyde Davies [email protected] wrote:

I think the most immediate advantage of this approach is that it allows us to visually test the dictionaries. The word 'and' is tagged, but why this should is unknown.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS7OCTEIOKKQ7QFJIGDQOS6YJA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGCB6Q#issuecomment-541860090

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAFTCS4P4NOCH775LOCVR4TQOS6YJANCNFSM4JAOYUAQ

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMSZCZVQCSQ2LKJERGLQOTPBXA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGSY3I#issuecomment-541928557 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACM3QMWJY6RQLENBITIXVRTQOTPBXANCNFSM4JAOYUAQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS73FOLQLKIXKP62CMLQOTTQRA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGXD2Q#issuecomment-541946346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6RKX2SGHAHC3ABUEDQOTTQRANCNFSM4JAOYUAQ .

Oct 15 '19 08:10 petermr

I'll have to rework that workflow anyway, so when I do I will get it to generate the output. Might be a little while doing that, though.

On Tue, Oct 15, 2019 at 9:13 AM petermr [email protected] wrote:

Thanks Can you post a typical output?

On Mon, 14 Oct 2019, 22:58 Clyde Davies, [email protected] wrote:

I'm just lumping all the dictionaries together right now. I think I can get away with just loading each dictionary and assigning a custom tag value individually to the matched terms. That would disambiguate effectively as there aren't so many of them.

On Mon, 14 Oct 2019, 22:20 petermr, [email protected] wrote:

What dictionaries are running and how does each work? I am a bit mystified by the words which are tagged. I now understand that each dictionary provides a single class of tag . I think your approach is reasonable - but need to know how it decides to tag.

On Mon, Oct 14, 2019 at 8:01 PM Clyde Davies <[email protected]

wrote:

I think the most immediate advantage of this approach is that it allows us to visually test the dictionaries. The word 'and' is tagged, but why this should is unknown.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS7OCTEIOKKQ7QFJIGDQOS6YJA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGCB6Q#issuecomment-541860090

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAFTCS4P4NOCH775LOCVR4TQOS6YJANCNFSM4JAOYUAQ

.

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMSZCZVQCSQ2LKJERGLQOTPBXA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGSY3I#issuecomment-541928557

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACM3QMWJY6RQLENBITIXVRTQOTPBXANCNFSM4JAOYUAQ

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS73FOLQLKIXKP62CMLQOTTQRA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBGXD2Q#issuecomment-541946346 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFTCS6RKX2SGHAHC3ABUEDQOTTQRANCNFSM4JAOYUAQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMTS2QN3ZSMYB6HAUUDQOV3RPA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBH3IKY#issuecomment-542094379, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMSGBCNXZ2C4YXFEW43QOV3RPANCNFSM4JAOYUAQ .

-- Clyde

Oct 15 '19 09:10 deadlyvices

OK, I've been thinking about the best way to share this. And it's the most obvious way: get the workflows into GitHub. I suggest we create a Knime folder (not where where exactly) in the repo and put the Knime workflows as immediate children. A workflow is simply a folder hierarchy, so it should fit in nicely.
This will also allow us to use relative paths when referencing our existing data files. Which should mean no pesky config changes for new users.

Oct 15 '19 10:10 deadlyvices

Great, as always happy to talk.

On Tue, Oct 15, 2019 at 11:53 AM Clyde Davies [email protected] wrote:

OK, I've been thinking about the best way to share this. And it's the most obvious way: get the workflows into GitHub. I suggest we create a Knime folder (not where where exactly) in the repo and put the Knime workflows as immediate children. A workflow is simply a folder hierarchy, so it should fit in nicely. This will also allow us to use relative paths when referencing our existing data files. Which should mean no pesky config changes for new users.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSZ64RBLSPWZ2VPOFXLQOWOKNA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBIKEKY#issuecomment-542155307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS27RH4CFA43GEHV7IDQOWOKNANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 15 '19 12:10 petermr

How about if I create a top-level folder workflows and then one immediately under it knime? Then I put my workflows into that? That way if we adopt any other tools, we can put them into their own tool specific folders.

Oct 15 '19 13:10 deadlyvices

Quick question: I have a good working knowledge of git but am no expert. Do .gitignore files only work at the top level of the repo, or can they be declared lower down so they're folder-specific?

Oct 15 '19 13:10 deadlyvices

do whatever makes sense. Github is free - we can create a fresh repo if it doesn't work...

On Tue, Oct 15, 2019 at 2:40 PM Clyde Davies [email protected] wrote:

How about if I create a top-level folder workflows* and then one immediately under it knime? Then I put my workflows into that? That way if we adopt any other tools, we can put them into their own tool specific folders.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS735DNIFFF5VAQW5BTQOXB6JA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBIZJII#issuecomment-542217377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYHLZWDNYC5K37HIVTQOXB6JANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 15 '19 14:10 petermr

Do you update the master branch directly or is it all through pull requests? I've just created a branch that I'm happy with and could do with merging into master.

Oct 15 '19 14:10 deadlyvices

On Tue, Oct 15, 2019 at 3:47 PM Clyde Davies [email protected] wrote:

Do you update the master branch directly or is it all through pull requests? I've just created a branch that I'm happy with and could do with merging into master.

At present we generally all push to master directly. That's because maintaining consistent policy on branches is not easy when people aren't familiar with GIthub. I know it's crude but so far no problems

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS5E3GTCPSMUXL5EFULQOXJZ5A5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJBL2A#issuecomment-542250472, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZ3K72MRPSEJ4XUTQ3QOXJZ5ANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 15 '19 14:10 petermr

OK. We're used on Chem4Word to processing changes through pull requests. I'll still create task branches, just to keep things isolated, but I'll merge in directly

Oct 15 '19 14:10 deadlyvices

Sure, It's more critical for code, especially where it overlaps. Here you are creating you own contribution and - so far - there probably won't be potential fo conflicts. It might happen if several people want to author a dictionary.

On Tue, Oct 15, 2019 at 3:58 PM Clyde Davies [email protected] wrote:

OK. We're used on Chem4Word to processing changes through pull requests. I'll still create task branches, just to keep things isolated, but I'll merge in directly

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCSYIITUOXMBKIFOQWEDQOXLCDA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJCV7A#issuecomment-542255868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS62AKIOLEPAH7Y7B23QOXLCDANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 15 '19 15:10 petermr

I'll work as I suggested then, until we end up with more people working on the workflows.

Oct 15 '19 15:10 deadlyvices

Huge Thanks for all the work. I am installing MACOSX KNIME. Then we can work together. I'd be surprised if we couldn't make rapid progress. We probably need to talk.

UPDATE have installed it. Point me at a CEV workflow!

Oct 16 '19 08:10 petermr

Yes, we probably do. I might have some time tomorrow night. After that it will be Sunday at the earliest. I'm hoping I can have some more to show you by then

On Wed, Oct 16, 2019 at 9:13 AM petermr [email protected] wrote:

Huge Thanks for all the work. I am installing MACOSX KNIME. Then we can work together. I'd be surprised if we couldn't make rapid progress. We probably need to talk.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=ACM3QMVCBZT7AA4S5KQTOZLQO3EIXA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBLSB6I#issuecomment-542580985, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMXOL6JE3MJ7WTAFEZDQO3EIXANCNFSM4JAOYUAQ .

-- Clyde

Oct 16 '19 08:10 deadlyvices

OK name a time... (Check this is 2019-10-16)

Oct 16 '19 09:10 petermr

Let's say 21:00 UTC Thursday (8 PM) - provisionally

Oct 16 '19 11:10 deadlyvices

On Wed, Oct 16, 2019 at 12:40 PM Clyde Davies [email protected] wrote:

Let's say 21:00 UTC Thursday (8 PM) - provisionally

Which parallel universe? 21:00 Wednesday, Coordinated Universal Time (UTC) is 22:00 Wednesday, in Cambridge, UK

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/38?email_source=notifications&email_token=AAFTCS6QOQELSKI53CELIA3QO34RPA5CNFSM4JAOYUA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBMFHBA#issuecomment-542659460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS5SSI6T6OXGU4L22B3QO34RPANCNFSM4JAOYUAQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Oct 16 '19 11:10 petermr

Oh sorry. having a 'blonde' moment. 19:00 UTC! (8 PM!)

Oct 16 '19 11:10 deadlyvices

CEVOpen CEVOpen copied to clipboard

Explore KNIME functionality on articles and CProjects

CEVOpen
CEVOpen copied to clipboard