ehr_deidentification icon indicating copy to clipboard operation
ehr_deidentification copied to clipboard

Adding threshold to Transformers pipeline

Open joshpopelka20 opened this issue 2 years ago • 6 comments

I'm using this code to run inference:


// Use a pipeline as a high-level helper
from transformers import pipeline

// Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification


tokenizer = AutoTokenizer.from_pretrained("obi/deid_bert_i2b2")

pipe = pipeline("token-classification", tokenizer=tokenizer, model="obi/deid_bert_i2b2",
                aggregation_strategy="first")

I'm trying to increase the threshold, but can't find a config for it. Is it possible with my setup?

joshpopelka20 avatar Nov 21 '23 22:11 joshpopelka20

Hi, sorry for the late response.

Unfortunately, the threshold can't be added via the HuggingFace pipelines. What you could do is see if you can get the raw logits from the pipeline - if you can, then you can process the raw logit values using the code given here: Threshold max or Threshold sum

Let us know if you have any other questions!

prajwal967 avatar Dec 20 '23 09:12 prajwal967

Not sure if I'm doing this right, but this is the code I have so far:

inputs = tokenizer(text, return_tensors="pt")

   outputs = model(**inputs)
   predictions = outputs.logits
   print(PostProcessPicker.get_threshold_max(predictions, 1.8982457699258832e-06))

I'm getting this error message when I run the code:

/usr/local/lib/python3.10/dist-packages/robust_deid/sequence_tagging/post_process/model_outputs/post_process_picker.py in get_threshold_max(self, threshold) 56 (ThresholdProcessMax): Return Threshold Max post processor 57 """ ---> 58 return ThresholdProcessMax(self._label_list, threshold=threshold) 59 60 def get_threshold_sum(self, threshold) -> ThresholdProcessSum:

AttributeError: 'Tensor' object has no attribute '_label_list'

joshpopelka20 avatar Dec 20 '23 16:12 joshpopelka20

Hi,

Could you add the following lines of code:

# Import the respective classes from the respective locations

# Initialize labels
ner_labels = NERLabels(notation='BIO', ner_types=["PATIENT", "STAFF", "AGE", "DATE", "PHONE", "ID", "EMAIL", "PATORG", "LOC", "HOSP", "OTHERPHI"]) 
label_list = ner_labels.get_label_list()

# Get the post processing object
picker = PostProcessPicker(label_list=label_list)
# This creates an object of the threshold max class which you can use to process the predictions with the threshold
threshold_max = picker.get_threshold_max(predictions, 1.8982457699258832e-06))

# Get the model predictions 
outputs = model(**inputs)
predictions = outputs.logits

# There are two ways to process the predictions -
# Case 1: Get the predictions - no additional filtering
# The label list converts ids of labels back to string form
final_preds = [[label_list[self.process_prediction(p)] for p in prediction] for prediction in predictions]

        
# Case 2: Get the predictions - where we also pass a labels list(that can be used to ignore predictions at certain positions etc.)
Use the pre-defined function
final_preds, final_labels = threshold_max.decode(predictions, labels)

Let us know if this piece of code did not work!

prajwal967 avatar Dec 31 '23 11:12 prajwal967

I'm not understanding this piece of code:

   # Case 2: Get the predictions - where we also pass a labels list(that can be used to ignore predictions at certain positions etc.)
   # Use the pre-defined function
   final_preds, final_labels = threshold_max.decode(predictions, labels)

what is the "labels" list supposed to be?

joshpopelka20 avatar Jan 02 '24 20:01 joshpopelka20

We have an option to ignore the predictions for certain tokens, which can be specified via the labels argument. If we pass [O, NA, O, O, NA] as labels (assuming we have 5 tokens as input) the function ignores the predictions at positions 1 & 4 and return [P0, P2, P3] (predictions at the three positions).

You don't need to use it, this is optional

prajwal967 avatar Jan 04 '24 08:01 prajwal967

The decode method seems to require the labels list. I've tried to create labels list with the same shape as the predictions tensor, but I'm getting a different error.

Code:

tensor_shape = torch.Size([1, 105, 45])
labels = [["O"] * tensor_shape[2] for _ in range(tensor_shape[1])]
final_preds, final_labels = threshold_max.decode(predictions, labels)

Error message:

/usr/local/lib/python3.10/dist-packages/numpy/ma/core.py in new(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order) 2904 msg = "Mask and data not compatible: data size is %i, " +
2905 "mask size is %i." -> 2906 raise MaskError(msg % (nd, nm)) 2907 copy = True 2908 # Set the mask to the new value

MaskError: Mask and data not compatible: data size is 45, mask size is 23.

joshpopelka20 avatar Jan 04 '24 21:01 joshpopelka20