MedCAT icon indicating copy to clipboard operation
MedCAT copied to clipboard

Relation extraction

Open vladd-bit opened this issue 3 years ago • 8 comments

vladd-bit avatar Nov 16 '21 10:11 vladd-bit

@vladd-bit this should be a spacy component, so needs to have the __call__ and pipe as weel as save and load with the same args and structure as in meta_cat.py

The relation_extraction.py is a mix of everything ever, has training pre-processing functions, losses, datasets, train/test stuff, tokenizers and so on. Would be good to organize this slightly

w-is-h avatar Nov 16 '21 11:11 w-is-h

@vladd-bit - don't think this is ready to merged. I think we're after a simple 'reference' or baseline implementation of a RelationExtraction model.

is there a reference paper that you've built this implementation on - Is it this one? and some reference data / results that you've tested this on?

Is this PR essentially a conversion of that code to pyTorch and integration into spacy Docs so once relations are found by the model these are attached to the Doc?

  • Why was the re-implementation of the BERT components i.e. why is all of module https://github.com/CogStack/MedCAT/pull/173/files#diff-cc36c887710fb29a577dce375998afef4bfe9b54ba77ad0438b70688a8f81b51 necessary? Seems like you're just re-implementing the entirety of the model? Are there any differences in the implementation that are worth noting, please comment them if so?
  • From what I could tell, BERTMLMHead, BERTOnlyNSPHead aren't used anywhere?
  • The import of the BERT_RelationExctracton, is wrong and should be medcat.utils.relation_extraction.models import ... Are you sure this PR works / and produces results?
  • This implementation should work with Relation Annotation data as outputted by the trainer. Similarly to how CAT and MetaCAT classes accept MedCATtrainer_export.json exports to train instances

tomolopolis avatar Nov 18 '21 13:11 tomolopolis

Structure is now separated, utils/tokenizer/models and other dataset stuff are located in /utils/relation_extraction, RelCAT now contains most of the implemntation of the PipeRunner, somewhat similar to metacat.

vladd-bit avatar Jan 19 '22 16:01 vladd-bit

@vladd-bit - don't think this is ready to merged. I think we're after a simple 'reference' or baseline implementation of a RelationExtraction model.

is there a reference paper that you've built this implementation on - Is it this one? and some reference data / results that you've tested this on?

Is this PR essentially a conversion of that code to pyTorch and integration into spacy Docs so once relations are found by the model these are attached to the Doc?

* Why was the re-implementation of the BERT components i.e. why is all of module https://github.com/CogStack/MedCAT/pull/173/files#diff-cc36c887710fb29a577dce375998afef4bfe9b54ba77ad0438b70688a8f81b51 necessary? Seems like you're just re-implementing the entirety of the model? Are there any differences in the implementation that are worth noting, please comment them if so?

* From what I could tell, BERTMLMHead, BERTOnlyNSPHead aren't used anywhere?

* The import of the BERT_RelationExctracton, is wrong and should be medcat.utils.relation_extraction.models import ...
  Are you sure this PR works / and produces results?

* This implementation should work with Relation Annotation data as outputted by the trainer. Similarly to how CAT and MetaCAT classes accept  MedCATtrainer_export.json exports to train instances

Implementation references : https://github.com/dmis-lab/biobert (https://academic.oup.com/bioinformatics/article/36/4/1234/5566506), https://github.com/uf-hobi-informatics-lab/ClinicalTransformerRelationExtraction (https://arxiv.org/abs/1906.03158)

All previous code in relation to BERTModel has been removed, the only reference to the model is now located in /medcat/utils/relation_extraction/models.py

Tool now supports MedCAT export json files as seen in the /medcat/utils/relation_extraction/rel_dataset.py file 'create_relations_from_export' method.

Some first results on the i2c2 dataset

Dataset relation labels

    1. medical problems and treatments, classes: (TrIP = Treatment improves medical problem), (TrWP = Treatment worsens medical problem ), (TrCP = Treatment causes medical problem ), (TrAP = Treatment is administered for medical problem), (TrNAP = Treatment is not administered because of medical problem)
    1. medical problems and tests (TeRP = Test reveals medical problem), (TeCP = Test conducted to investigate medical problem)
    1. medical problems and the other medical problems (PiP = Medical problem in dicates medical problem)

Losses at Epoch 14: 0.21785 Train accuracy at Epoch 14: 0.13141 Evaluating test samples... ==================== Evaluation Results =================== no. of batches: 13 accuracy = 0.044 f1 = 0.044 loss = 2.188 precision = 0.044 recall = 0.044 ----------------------- class stats ----------------------- label: TrCP | f1: 0.028 | prec : 0.028 | acc: 0.554 | recall: 0.028 label: TrAP | f1: 0.015 | prec : 0.015 | acc: 0.932 | recall: 0.015 label: PIP | f1: 0.000 | prec : 0.000 | acc: 0.538 | recall: 0.000 label: TrIP | f1: 0.000 | prec : 0.000 | acc: 0.308 | recall: 0.000 label: TeCP | f1: 0.000 | prec : 0.000 | acc: 0.308 | recall: 0.000 label: TeRP | f1: 0.000 | prec : 0.000 | acc: 0.769 | recall: 0.000 label: TrNAP | f1: 0.000 | prec : 0.000 | acc: 0.231 | recall: 0.000

===========================================================

vladd-bit avatar Jan 24 '22 12:01 vladd-bit

Evaluation results: -train samples : ~2500 over the three classes -test samples : 300 samples , 100 for each label ==================== Evaluation Results =================== no. of batches: 12 accuracy = 0.383 f1 = 0.383 loss = 1.279 precision = 0.383 recall = 0.383 ----------------------- class stats ----------------------- label: TeRP | f1: 0.183 | prec : 0.183 | acc: 0.536 | recall: 0.183 label: TrAP | f1: 0.113 | prec : 0.113 | acc: 0.661 | recall: 0.113 label: PIP | f1: 0.087 | prec : 0.087 | acc: 0.893 | recall: 0.087

===========================================================

vladd-bit avatar Jan 31 '22 14:01 vladd-bit

Good find on the stat fixes. What are the updated results?

tomolopolis avatar Feb 03 '22 22:02 tomolopolis

@vladd-bit This is marked as draft. Is that still the case? Or would I be fine to start a review?

mart-r avatar Feb 08 '24 14:02 mart-r

@vladd-bit This is marked as draft. Is that still the case? Or would I be fine to start a review?

leave it as is for now, there are a few things which need to be added, mainly tests and a bit of code cleanup.

vladd-bit avatar Feb 08 '24 14:02 vladd-bit

@vladd-bit @shubham-s-agarwal I take it this is now done? I still don't see any tests. Would be good to have at the very least some basic tests. We don't want these things to start failing after some (seemingly) unrelated changes elsewhere in the project.

mart-r avatar Apr 09 '24 08:04 mart-r

@vladd-bit @shubham-s-agarwal I take it this is now done? I still don't see any tests. Would be good to have at the very least some basic tests. We don't want these things to start failing after some (seemingly) unrelated changes elsewhere in the project.

Will push the tests when I'm back (tommorrow) they are ready, just minor changes to the latest commits.

vladd-bit avatar Apr 09 '24 09:04 vladd-bit

@vladd-bit - still working on those tests?

tomolopolis avatar Apr 12 '24 10:04 tomolopolis

@vladd-bit Do you mind fixing typing stuff

mart-r avatar Apr 15 '24 09:04 mart-r

flake8 fixes required

tomolopolis avatar Apr 19 '24 13:04 tomolopolis