unstructured Clean up warning table transformer warning statements statements

Using table transformer in unstructured produces the following warning statement. The goal of this issue is to clean up the warnings. The original issue description is posted below.

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Original issue

Describe the bug I am getting the following when calling partition_pdf

from unstructured.partition.pdf import partition_pdf path = "/app/example-docs/" fname = "list-item-example.pdf" raw_pdf_documents = partition_pdf( ... filename=path + fname, ... extract_images_in_pdf=False, ... infer_table_structure=True, ... chunking_strategy="by_title", ... max_characters=4000, ... new_after_n_chars=3800, ... combine_text_under_n_chars=2000, ... image_output_dir_path=path, ... ) Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']

This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

To Reproduce Just run the code above

Expected behavior No errors or warnings when running partition_pdf

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info Docker image running on MacPro Intel Core i9

Docker container is started as follow:

create the container docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest

this will drop you into a bash shell where the Docker image is running docker exec -it unstructured bash

Additional context Add any other context about the problem here.

Jun 25 '24 00:06 magallardo

Hi @magallardo - are you using the arm64 image? And are you getting an error or just warnings?

Jun 25 '24 11:06 MthwRobinson

@MthwRobinson I am running the container on a MacPro with AMD 64 chip (not arm). I got the docker image using the following command and when I list on my machine it shows as follows:

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest

downloads.unstructured.io/unstructured-io/unstructured latest 24326ebafc76 10 hours ago 11.6GB

I am getting the reported message every time I make a call to the partition_pdf function. I am not sure if it is expected or unexpected.

Thanks, Marcelo

Jun 25 '24 12:06 magallardo

@magallardo - Were you able to get output from partition_pdf? raw_pdf_documents should be a list of Element objects

Jun 25 '24 13:06 MthwRobinson

@MthwRobinson Thanks for the update.
As for your question, the operation is getting some output. I just wanted to make sure if this is actually returning valid responses or if the operation was not completing as the message is not very clear.

Thanks Marcelo

Jun 25 '24 13:06 magallardo

Got it, thanks for clarifying @magallardo . Yeah your output is valid. We run the unit tests from inside the docker container during our build process and the outputs get checked then.

That said, we should suppress those warning statements so people don't worry. I'll update the scope of this issue to reflect that.

Jun 25 '24 13:06 MthwRobinson

@MthwRobinson this is giving me different outputs to what I had before, when it did not give me that warning and the quality of outputs is slightly worse. Can you please look into this further as it might be a bigger issue than initially thought. This issue seems to have only come up recently so may be some of the underlying packages you are using might have changed?

Jun 28 '24 05:06 atangsyd

Thanks @atangsyd we'll take a look. FYI @leah1985

Jun 28 '24 12:06 MthwRobinson

Let me go over this issue and my thinking.

TableTransformer model is implemented inside unstructured-inference library. Thus I reproduced the bug using unstrucutured-inference 0.7.36 version (the most recent version) simply running:

from unstructured_inference.models.tables import load_agent
load_agent()

And indeed I got the same warning. Then I have checked if the warning occurs on older versions and it occurs on 0.7.30 (1st May) and on 0.7.23 (18 January).

So in general there should be no issue with that. The changes in the outputs from partitioning are expected as many things are evolving. Moreover table output is impacted by other modules like OCR or table detection by OD model.

At final step, I have verified what 'num_batches_tracked' is and if it is important. For best of my knowledge if you check BatchNorm2D (https://pytorch.org/docs/stable/_modules/torch/nn/modules/batchnorm.html#BatchNorm2d) this is the variable controlling amount of batches then went through the layer, and this statistic is used for further calculation of running means etc. So this variable is used only during training now during the inference. Moreover if we check the implementation in hugging face https://github.com/huggingface/transformers/blob/main/src/transformers/models/table_transformer/modeling_table_transformer.py#L218 This is FrozenBatchNorm which doesn't use that parameter.

Concluding, I think everything is okay, I will just hide the warning ;)

Jul 08 '24 14:07 plutasnyy

I also get the warning and receive relevant outputs. However, this warning message from HuggingFace seems to be an important one in my opinion. Though the model outputs relevant elements, I believe it is not performing with its full potential as not the all parameters are loaded correctly.

Oct 08 '24 11:10 hasansalimkanmaz

Closing, assumed resolved.

Dec 16 '24 19:12 scanny