ehr_deidentification
ehr_deidentification copied to clipboard
Adding threshold to Transformers pipeline
I'm using this code to run inference:
// Use a pipeline as a high-level helper
from transformers import pipeline
// Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("obi/deid_bert_i2b2")
pipe = pipeline("token-classification", tokenizer=tokenizer, model="obi/deid_bert_i2b2",
aggregation_strategy="first")
I'm trying to increase the threshold, but can't find a config for it. Is it possible with my setup?
Hi, sorry for the late response.
Unfortunately, the threshold can't be added via the HuggingFace pipelines. What you could do is see if you can get the raw logits from the pipeline - if you can, then you can process the raw logit values using the code given here: Threshold max or Threshold sum
Let us know if you have any other questions!
Not sure if I'm doing this right, but this is the code I have so far:
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits
print(PostProcessPicker.get_threshold_max(predictions, 1.8982457699258832e-06))
I'm getting this error message when I run the code:
/usr/local/lib/python3.10/dist-packages/robust_deid/sequence_tagging/post_process/model_outputs/post_process_picker.py in get_threshold_max(self, threshold) 56 (ThresholdProcessMax): Return Threshold Max post processor 57 """ ---> 58 return ThresholdProcessMax(self._label_list, threshold=threshold) 59 60 def get_threshold_sum(self, threshold) -> ThresholdProcessSum:
AttributeError: 'Tensor' object has no attribute '_label_list'
Hi,
Could you add the following lines of code:
# Import the respective classes from the respective locations
# Initialize labels
ner_labels = NERLabels(notation='BIO', ner_types=["PATIENT", "STAFF", "AGE", "DATE", "PHONE", "ID", "EMAIL", "PATORG", "LOC", "HOSP", "OTHERPHI"])
label_list = ner_labels.get_label_list()
# Get the post processing object
picker = PostProcessPicker(label_list=label_list)
# This creates an object of the threshold max class which you can use to process the predictions with the threshold
threshold_max = picker.get_threshold_max(predictions, 1.8982457699258832e-06))
# Get the model predictions
outputs = model(**inputs)
predictions = outputs.logits
# There are two ways to process the predictions -
# Case 1: Get the predictions - no additional filtering
# The label list converts ids of labels back to string form
final_preds = [[label_list[self.process_prediction(p)] for p in prediction] for prediction in predictions]
# Case 2: Get the predictions - where we also pass a labels list(that can be used to ignore predictions at certain positions etc.)
Use the pre-defined function
final_preds, final_labels = threshold_max.decode(predictions, labels)
Let us know if this piece of code did not work!
I'm not understanding this piece of code:
# Case 2: Get the predictions - where we also pass a labels list(that can be used to ignore predictions at certain positions etc.)
# Use the pre-defined function
final_preds, final_labels = threshold_max.decode(predictions, labels)
what is the "labels" list supposed to be?
We have an option to ignore the predictions for certain tokens, which can be specified via the labels argument. If we pass [O, NA, O, O, NA] as labels (assuming we have 5 tokens as input) the function ignores the predictions at positions 1 & 4 and return [P0, P2, P3] (predictions at the three positions).
You don't need to use it, this is optional
The decode method seems to require the labels list. I've tried to create labels list with the same shape as the predictions tensor, but I'm getting a different error.
Code:
tensor_shape = torch.Size([1, 105, 45])
labels = [["O"] * tensor_shape[2] for _ in range(tensor_shape[1])]
final_preds, final_labels = threshold_max.decode(predictions, labels)
Error message:
/usr/local/lib/python3.10/dist-packages/numpy/ma/core.py in new(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order) 2904 msg = "Mask and data not compatible: data size is %i, " +
2905 "mask size is %i." -> 2906 raise MaskError(msg % (nd, nm)) 2907 copy = True 2908 # Set the mask to the new valueMaskError: Mask and data not compatible: data size is 45, mask size is 23.