unstructured
unstructured copied to clipboard
feat/Use local model for hi_res partition
Hello,
Maybe this feature already exist but I didn't manage to implement it. I work on a network that blocks huggingface and I would like to run:
elements = partition_pdf(filename=PDF_PATH, strategy='hi_res', infer_table_structure=True)
But the function cannot run because it's trying to access the yolox model on the hub:
SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))"), '(Request ID: 757ef56e-88d9-4a7a-88ef-ff3fade2139c)')
My question is: If I manage to download the model on my machine somehow, how can use it with the ustructured library without having to call the https request ?
I hope my explainations are somehow ok.
Thanks in advance.
Sorry I don't have an answer but I'd like to ask, is it trying to download the model files or do inference elsewhere? Also saw a page from unstructured io docs that might help https://unstructured-io.github.io/unstructured/installation/full_installation.html#setting=up-unstructured-for-local-inference
It's trying to download the model from hf yes it's the default behaviour or the unstructured library.
I have encountered the same issue.
I explored the source code a little bit and found these lines of code in site-packages\unstructured_inference\models\yolox.py
(line 34 -59):
MODEL_TYPES = {
"yolox": LazyDict(
model_path=LazyEvaluateInfo(
hf_hub_download,
"unstructuredio/yolo_x_layout",
"yolox_l0.05.onnx",
),
label_map=YOLOX_LABEL_MAP,
),
"yolox_tiny": LazyDict(
model_path=LazyEvaluateInfo(
hf_hub_download,
"unstructuredio/yolo_x_layout",
"yolox_tiny.onnx",
),
label_map=YOLOX_LABEL_MAP,
),
"yolox_quantized": LazyDict(
model_path=LazyEvaluateInfo(
hf_hub_download,
"unstructuredio/yolo_x_layout",
"yolox_l0.05_quantized.onnx",
),
label_map=YOLOX_LABEL_MAP,
),
}
So I guess the models from huggingface are hard-coded to download from the url at the moment. Could we have the feature to config the models to some pre-downloaded path in the later version? Thanks!
met the same problem, want to add local model support
Not sure if it will help, but have you tried specifying the custom model (undocumented): https://github.com/Unstructured-IO/unstructured/pull/2462
I haven't dug into all the details of HuggingFace caching, but this page from their website seems like an excellent resource: https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache
I expect some sort of "download-separately-and-manually-install-into-cache" solution is possible. Probably not something we'll get to in the short-term, but if someone is willing to work out how to do that we'd be very interested in it. At least to have a working solution here that folks can get to on search, or possibly add to the documentation.
Hey I haven't made any progress on this unfortunatelly I'm lost in the module code... There's probably a modification to do in the get_model() function on the unstructured-inference dependency.
The best implementation to me would be to simply pass the a new argument: "hi_res_model_path" to the partition_pdf function:
elements = partition_pdf(filename=PDF_PATH, strategy='hi_res', hi_res_model_path='/path/to/model', infer_table_structure=True)
Anyone able to evaluate the amont of work needed to develop this ?
I encountered the same problem too
I finally made it work locally with these changes. Depending on the setting you are using you need give local path as one of the input . You can download the models for once, and then save them, or just download them directly and give their path to it.
1)env/lib/python3.9/site-packages/unstructured_inference/models/tables.py
logger.info("Loading the table structure model ...")
model_path = 'path to table_transformer_recognition'
self.model = TableTransformerForObjectDetection.from_pretrained(model_path, use_pretrained_backbone=False)
self.model.eval()
#self.model.save_pretrained("path to save model")#this one can save the model the first time by getting connected to hub.( if you don't want to save the model directly you can use it once to download the model and then comment it.
In case you are using yolox in the setting, then you need to change this one 2) env/lib/python3.9/site-packages/unstructured_inference/models/yolox.py
MODEL_TYPES = {
"yolox": LazyDict(
model_path='path to yolox_l0.05.onnx folder downloaded from hugging face',
label_map=YOLOX_LABEL_MAP,
),
Hope it can solve your issue
Thanks all, we'll plan to add better support for this.
Same problem. Any progress there?
I finally made it work locally with these changes. Depending on the setting you are using you need give local path as one of the input . You can download the models for once, and then save them, or just download them directly and give their path to it.
1)env/lib/python3.9/site-packages/unstructured_inference/models/tables.py
logger.info("Loading the table structure model ...") model_path = 'path to table_transformer_recognition' self.model = TableTransformerForObjectDetection.from_pretrained(model_path, use_pretrained_backbone=False) self.model.eval() #self.model.save_pretrained("path to save model")#this one can save the model the first time by getting connected to hub.( if you don't want to save the model directly you can use it once to download the model and then comment it.
In case you are using yolox in the setting, then you need to change this one 2) env/lib/python3.9/site-packages/unstructured_inference/models/yolox.py
MODEL_TYPES = { "yolox": LazyDict( model_path='path to yolox_l0.05.onnx folder downloaded from hugging face', label_map=YOLOX_LABEL_MAP, ),
Hope it can solve your issue
Thank you for posting, this worked for me as well!
Hii all ! I am facing the same error due to the corporate proxy. I need to extract table and images out of pdf files but since unstructured is trying to call the hugging face it fails to do so. I have downloaded yolo_x_layout/yolox_l0.05.onnx model from the hugging face and I tried to change the following. C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\unstructured_inference\models\yolox.py:
"yolox": LazyDict(
model_path=r"C:\Users\Username\Desktop\Models\yolo_x_layout\yolox_l0.05.onnx",
label_map=YOLOX_LABEL_MAP,
)
I am facing the following error:
Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from C:\Users\k.divyanshu\Desktop\Assignment_3\Models\yolo_x_layout
failed:system error number 13
Can someone let me how to pass the downloaded model correctly in order to correctly extract the table and images from a given PDF File.
@kd10041 If you encounter the same issue, you can use the following workaround. Execute these steps in the directory where you run the script:
mkdir -p unstructuredio/yolo_x_layout/
wget --no-check-certificate https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx -O unstructuredio/yolo_x_layout/yolox_l0.05.onnx
Alternatively, you can download the file manually from the following link:
After downloading, copy the yolox_l0.05.onnx file to the unstructuredio/yolo_x_layout/ directory:
cp path/to/downloaded/yolox_l0.05.onnx unstructuredio/yolo_x_layout/