Refa: Optimize pptx shape extraction to reduce content loss
What problem does this PR solve?
When parsing pptx files, some shapes do not contain the shape_type attribute, which causes the original code to throw an exception during extraction, leading to failure in content extraction. This optimization introduces handling logic for such anomalous shapes, providing a safer and more robust processing mechanism.
Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
@zhudongwork Thanks for your issue. Could you please upload a pptx file with the problems. And we can quickly locate the bug.
@zhudongwork Thanks for your issue. Could you please upload a pptx file with the problems. And we can quickly locate the bug.
Thank you for submitting the code. You have resolved the error issues we had before, but the effect of the version you generated in the specific text parsing is not as good as our original code. We hope you can further modify your code. The log error message has changed from
`Traceback (most recent call last):
File "/ragflow/deepdoc/parser/ppt_parser.py", line 73, in __call__
txt = self.__extract(shape)
File "/ragflow/deepdoc/parser/ppt_parser.py", line 34, in __extract
if shape.shape_type == 19:
File "/ragflow/.venv/lib/python3.10/site-packages/pptx/shapes/autoshape.py", line 325, in shape_type
raise NotImplementedError("Shape instance of unrecognized shape type")
NotImplementedError: Shape instance of unrecognized shape type`
to
`2025-04-01 17:58:59,015 INFO 27 HTTP Request: POST https://open.bigmodel.cn/api/paas/v4/embeddings "HTTP/1.1 200 OK"
2025-04-01 17:58:59,076 INFO 27 HEAD http://es01:9200/ragflow_748ba2da0edc11f0b42b726aca92dd24 [status:200 duration:0.006s]
2025-04-01 17:58:59,341 INFO 27 From minio(0.26492000406142324) demo.pptx/demo.pptx
2025-04-01 17:58:59,652 ERROR 27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,653 ERROR 27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,653 ERROR 27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text'
2025-04-01 17:58:59,665 INFO 27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.5, progress_msg: 17:58:59 Page(1~100000001): Text extraction finished.
2025-04-01 17:58:59,784 INFO 27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.9, progress_msg: 17:58:59 Page(1~100000001): Image extraction finished`
Thank you for submitting the code. You have resolved the error issues we had before, but the effect of the version you generated in the specific text parsing is not as good as our original code. We hope you can further modify your code. The log error message has changed from
`Traceback (most recent call last): File "/ragflow/deepdoc/parser/ppt_parser.py", line 73, in __call__ txt = self.__extract(shape) File "/ragflow/deepdoc/parser/ppt_parser.py", line 34, in __extract if shape.shape_type == 19: File "/ragflow/.venv/lib/python3.10/site-packages/pptx/shapes/autoshape.py", line 325, in shape_type raise NotImplementedError("Shape instance of unrecognized shape type") NotImplementedError: Shape instance of unrecognized shape type`to
`2025-04-01 17:58:59,015 INFO 27 HTTP Request: POST https://open.bigmodel.cn/api/paas/v4/embeddings "HTTP/1.1 200 OK" 2025-04-01 17:58:59,076 INFO 27 HEAD http://es01:9200/ragflow_748ba2da0edc11f0b42b726aca92dd24 [status:200 duration:0.006s] 2025-04-01 17:58:59,341 INFO 27 From minio(0.26492000406142324) demo.pptx/demo.pptx 2025-04-01 17:58:59,652 ERROR 27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text' 2025-04-01 17:58:59,653 ERROR 27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text' 2025-04-01 17:58:59,653 ERROR 27 Error processing shape: 'Ppt' object has no attribute 'get_bulleted_text' 2025-04-01 17:58:59,665 INFO 27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.5, progress_msg: 17:58:59 Page(1~100000001): Text extraction finished. 2025-04-01 17:58:59,784 INFO 27 set_progress(ee4414a60edf11f096d5726aca92dd24), progress: 0.9, progress_msg: 17:58:59 Page(1~100000001): Image extraction finished`
![]()
The function naming has been corrected (the underscore was missing), and it can now run properly.
@zhudongwork @KevinHuSh Successfully tested the latest version with the following results:
- File chunking functionality working as expected
- No errors detected on backend services
