unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat/Use local model for hi_res partition

Open AntoninLeroy opened this issue 11 months ago • 15 comments

Hello,

Maybe this feature already exist but I didn't manage to implement it. I work on a network that blocks huggingface and I would like to run:

elements = partition_pdf(filename=PDF_PATH, strategy='hi_res', infer_table_structure=True)

But the function cannot run because it's trying to access the yolox model on the hub:

SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))"), '(Request ID: 757ef56e-88d9-4a7a-88ef-ff3fade2139c)')

My question is: If I manage to download the model on my machine somehow, how can use it with the ustructured library without having to call the https request ?

I hope my explainations are somehow ok.

Thanks in advance.

AntoninLeroy avatar Mar 11 '24 13:03 AntoninLeroy

Sorry I don't have an answer but I'd like to ask, is it trying to download the model files or do inference elsewhere? Also saw a page from unstructured io docs that might help https://unstructured-io.github.io/unstructured/installation/full_installation.html#setting=up-unstructured-for-local-inference

sanjaydokula avatar Mar 14 '24 11:03 sanjaydokula

It's trying to download the model from hf yes it's the default behaviour or the unstructured library.

AntoninLeroy avatar Mar 21 '24 15:03 AntoninLeroy

I have encountered the same issue. I explored the source code a little bit and found these lines of code in site-packages\unstructured_inference\models\yolox.py (line 34 -59):

MODEL_TYPES = {
    "yolox": LazyDict(
        model_path=LazyEvaluateInfo(
            hf_hub_download,
            "unstructuredio/yolo_x_layout",
            "yolox_l0.05.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
    "yolox_tiny": LazyDict(
        model_path=LazyEvaluateInfo(
            hf_hub_download,
            "unstructuredio/yolo_x_layout",
            "yolox_tiny.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
    "yolox_quantized": LazyDict(
        model_path=LazyEvaluateInfo(
            hf_hub_download,
            "unstructuredio/yolo_x_layout",
            "yolox_l0.05_quantized.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
}

So I guess the models from huggingface are hard-coded to download from the url at the moment. Could we have the feature to config the models to some pre-downloaded path in the later version? Thanks!

peixin-lin avatar Apr 01 '24 08:04 peixin-lin

met the same problem, want to add local model support

kexuedaishu avatar Apr 12 '24 08:04 kexuedaishu

Not sure if it will help, but have you tried specifying the custom model (undocumented): https://github.com/Unstructured-IO/unstructured/pull/2462

ericfeunekes avatar May 04 '24 20:05 ericfeunekes

I haven't dug into all the details of HuggingFace caching, but this page from their website seems like an excellent resource: https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache

I expect some sort of "download-separately-and-manually-install-into-cache" solution is possible. Probably not something we'll get to in the short-term, but if someone is willing to work out how to do that we'd be very interested in it. At least to have a working solution here that folks can get to on search, or possibly add to the documentation.

scanny avatar May 09 '24 19:05 scanny

Hey I haven't made any progress on this unfortunatelly I'm lost in the module code... There's probably a modification to do in the get_model() function on the unstructured-inference dependency.

The best implementation to me would be to simply pass the a new argument: "hi_res_model_path" to the partition_pdf function: elements = partition_pdf(filename=PDF_PATH, strategy='hi_res', hi_res_model_path='/path/to/model', infer_table_structure=True)

Anyone able to evaluate the amont of work needed to develop this ?

AntoninLeroy avatar May 14 '24 14:05 AntoninLeroy

I encountered the same problem too

Vincewz avatar May 14 '24 15:05 Vincewz

I finally made it work locally with these changes. Depending on the setting you are using you need give local path as one of the input . You can download the models for once, and then save them, or just download them directly and give their path to it.

1)env/lib/python3.9/site-packages/unstructured_inference/models/tables.py

logger.info("Loading the table structure model ...")  
model_path = 'path to table_transformer_recognition'  
self.model = TableTransformerForObjectDetection.from_pretrained(model_path, use_pretrained_backbone=False)  
self.model.eval()  
#self.model.save_pretrained("path to save model")#this one can save the model the first time by getting connected to hub.( if you don't want to save the model directly you can use it once to download the model and then comment it.

In case you are using yolox in the setting, then you need to change this one 2) env/lib/python3.9/site-packages/unstructured_inference/models/yolox.py

MODEL_TYPES = {  
    "yolox": LazyDict(  
        model_path='path to yolox_l0.05.onnx folder downloaded from hugging face',  
        label_map=YOLOX_LABEL_MAP,  
    ),

Hope it can solve your issue

mmaryam2020 avatar May 15 '24 15:05 mmaryam2020

Thanks all, we'll plan to add better support for this.

MthwRobinson avatar May 24 '24 14:05 MthwRobinson

Same problem. Any progress there?

zzw1123 avatar May 28 '24 07:05 zzw1123

I finally made it work locally with these changes. Depending on the setting you are using you need give local path as one of the input . You can download the models for once, and then save them, or just download them directly and give their path to it.

1)env/lib/python3.9/site-packages/unstructured_inference/models/tables.py

logger.info("Loading the table structure model ...")  
model_path = 'path to table_transformer_recognition'  
self.model = TableTransformerForObjectDetection.from_pretrained(model_path, use_pretrained_backbone=False)  
self.model.eval()  
#self.model.save_pretrained("path to save model")#this one can save the model the first time by getting connected to hub.( if you don't want to save the model directly you can use it once to download the model and then comment it.

In case you are using yolox in the setting, then you need to change this one 2) env/lib/python3.9/site-packages/unstructured_inference/models/yolox.py

MODEL_TYPES = {  
    "yolox": LazyDict(  
        model_path='path to yolox_l0.05.onnx folder downloaded from hugging face',  
        label_map=YOLOX_LABEL_MAP,  
    ),

Hope it can solve your issue

Thank you for posting, this worked for me as well!

FennFlyer avatar May 28 '24 17:05 FennFlyer

Hii all ! I am facing the same error due to the corporate proxy. I need to extract table and images out of pdf files but since unstructured is trying to call the hugging face it fails to do so. I have downloaded yolo_x_layout/yolox_l0.05.onnx model from the hugging face and I tried to change the following. C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\unstructured_inference\models\yolox.py:

 "yolox": LazyDict(
        model_path=r"C:\Users\Username\Desktop\Models\yolo_x_layout\yolox_l0.05.onnx",
        label_map=YOLOX_LABEL_MAP,
    )

I am facing the following error:

Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from C:\Users\k.divyanshu\Desktop\Assignment_3\Models\yolo_x_layout 
failed:system error number 13

Can someone let me how to pass the downloaded model correctly in order to correctly extract the table and images from a given PDF File.

kd10041 avatar Jul 04 '24 08:07 kd10041

@kd10041 If you encounter the same issue, you can use the following workaround. Execute these steps in the directory where you run the script:

mkdir -p unstructuredio/yolo_x_layout/
wget --no-check-certificate https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx -O unstructuredio/yolo_x_layout/yolox_l0.05.onnx

Alternatively, you can download the file manually from the following link:

Download yolox_l0.05.onnx

After downloading, copy the yolox_l0.05.onnx file to the unstructuredio/yolo_x_layout/ directory:

cp path/to/downloaded/yolox_l0.05.onnx unstructuredio/yolo_x_layout/

nullbyte91 avatar Aug 04 '24 03:08 nullbyte91