medaCy icon indicating copy to clipboard operation
medaCy copied to clipboard

Compatibility with newer spaCy versions

Open plandes opened this issue 3 years ago • 9 comments

What problem does your feature solve? Add instructions on how to retrain the models, or better, one easy robust easy to run script, on new versions of packages (specifically spaCy 2.3.5, and later 3.0).

Describe the solution you'd like I'd like to have an easy reproducible way to retrain the model on an updated set of packages (numpy/msgpack/msgpack-numpy, torch etc) as I'm using this package with newer versions of its dependencies. Specifically those packages pinned to a version (i.e. spaCy 2.2.2).

Describe alternatives you've considered Using current versions of packages work, but with warnings and I don't trust it given the word vectors might have changed and other data serialized to (for example) medacy-model-clinical-notes .

Additional context If you can point me to the resources, I can write a script/process to do this automatically.

plandes avatar Mar 18 '21 17:03 plandes

We have not yet investigated what it would take to make medaCy compatible with the latest versions of its dependencies.

swfarnsworth avatar Mar 18 '21 18:03 swfarnsworth

Fair enough. Do you have a documented process of how to train the clinical notes model so I can do the work?

plandes avatar Mar 19 '21 12:03 plandes

That model is trained on the n2c2 2018 track 2 dataset, which is described here. One can then train a model over it using the command line interface.

Updating the package's dependencies will likely be a future project.

swfarnsworth avatar Mar 19 '21 13:03 swfarnsworth

Thanks for your links on how to train the model. Also, thank you for writing this software and making it available for the public--it is well written.

I have generated two models with an updated set of dependencies.

Changes

  • Updated the setup.py file's dependencies to spaCy 2.3.5, which is the last version of what seems to me to be the stable version. I've tried to upgrade to 3.0.5 and had issues a few times and it doesn't seem like it's "prime time" just yet.

  • Created an automated process to train the same model as the medaCy_model_clinical_notes in a forked repo (see below more about this process).

  • Retrained the clinical model in the forked medaCy_model_clinical_notes repo.

  • Trained another clinical model was trained using the ClinicalBERT pre-trained embeddings with CRF. (Note: I tried to train the model with out CRF, but was getting index errors so perhaps an API has changed with a return value from the transformers package.

    This model repo has the same form as the clinical notes repo. However, it is a LFS enabled repo since the PyTorch model exceeds 100MB. I have tested installing it via pip using a git+https URL and it works fine as Python has an LFS implementation built-in/used from pip.

Re-Training

The automated process for training is in the subdirectory train in the root repository directory and automated with GNU make with instructions in the README.md file. It provides a way to train the two models I've mentioned (both clinical, one CRF and the other BERT CRF). It does the following:

  • Clone the respective model when not present locally.
  • Provides the configuration and parameters to the medaCy CLI.
  • Runs the cross-fold validation on the generated model.
  • Copies the corresponding model file(s) to the model repo.
  • Parses and copies the output performance metrics to the respective model_data.txt in the model directory and the model repo's README.md.

The downside is that each of these steps requires a command line invocation since many things can go wrong and the process must be "baby sat". However, they are short and (as mentioned) documented.

Action Items

From here, I can either:

  • Create a pull request to the medaCy and/and existing clinical model repos, or
  • Do nothing and leave the forked repos where they are.

Please let me know what you want.

Future Work

I plan on first incorporating this work in to my research, then more than likely add a model for the N2C2 2014 Deidentification & Heart Disease data to tag PHI.

Thanks again for this great software.

plandes avatar Mar 27 '21 15:03 plandes

@plandes It sounds like you put a lot of work into adapting medaCy for your needs, and I really appreciate that you've clearly documented the workflow that you used.

Am I correct in understanding that you have a copy of n2c2 2018, that you were able to train a CRF model with no source code modifications using spaCy 2.3.5, and that you received an IndexError when attempting to train a BERT model with a certain version of BERT that may not be compatible with the version of transformers used in medaCy? If so, we can probably set a range of permissible versions of spaCy that include the currently required version up to 2.3.5 (following regression tests).

swfarnsworth avatar Mar 27 '21 16:03 swfarnsworth

@swfarnsworth Correct, this was trained on the n2c2 2018 task 2 corpus.

Also, the Clinical BERT embeddings, to be specific (see the paper), were trained with 10 epochs and showed better performance than the default cased (at least on the folds trained on the default 3 epochs before I stopped it).

Also correct on the IndexError, which my guess is a result from the library dependency change. The dependency I changed is for the latest.

I also forgot to mention that PyTorch 1.8 doesn't appear to be stable as well (along with space 3.0) so I backed off to 1.7.

plandes avatar Mar 27 '21 17:03 plandes

Update: I have:

  1. Renamed the model repo from bert to bert_crf_ and updated the link in my previous reopens in this thread.
  2. Fixed the bug in the bert_learner.py. The issue was the newer version of 🤗 transformers return more data now (loss, the logins, hidden states and attentions).
  3. Trained a new model and in this new repo (note that it has the same name as the previous model because it was trained without the CRF layer.

The new trained model has very similar scores to the BERT+CRF model with the non CRF model performing slightly higher. I trained it on 3 epochs instead of 10, but I doubt that makes a difference. I'll be taking down the medaCy_bertcrf_model_clinical_notes at some point, but because it uses a lot of Git LFS space. A better long term strategy might be to make models pip installs because they take so much space.

Speaking of, will you please indicate whether you plan to incorporate these changes? If not, that's fine but, I'll change all repo URLs to my forked repos for the main source and models. Then I'll release medaCy to PyPi.

Thanks again. This will be helpful in my own research.

plandes avatar Mar 31 '21 16:03 plandes

@plandes, I am graduating at the end of this semester and I don't know if I will be able to review any changes to the package before then, though I will see if any my colleagues might be able to.

That being said, NLP@VCU will continue to support medaCy after my departure, and we appreciate how thoroughly you've documented your workflow. We will need to discuss internally what changes you've made and which we can merge into the main repository.

Am I to understand that you were planning to publish your fork of medaCy to PyPI, or this one?

swfarnsworth avatar Mar 31 '21 19:03 swfarnsworth

@swfarnsworth Congrats on graduating! Seems like a dream to me at this point.

Yes, I totally understand--take the time you need, and there's no reason I can't publish what I need for my own purposes, and we can all fold the changes later back in to NLP@VCU later.

Yes, I'd publish medaCy with the Bert fix and updated dependencies under my own name space (zensols) along with the models (assuming PyPi doesn't have size constraints) for my own purposes. However, if you can review the changes somewhat soon I'll hold off and wait for that integration. If we can get everything merged back under NLP@VCU, then I'll take down my work from GitHub.

plandes avatar Mar 31 '21 20:03 plandes