drunkpig
drunkpig
@shibainu-gbq 标题的形式太多了,段落间距,字体,颜色,粗细,背景都能决定是不是标题。很难有普世的方法。
`--method ocr` means use paddle to get text from pdf, `--method text` means use pymuPDF to get text from pdf. The difference lies in that the bounding boxes obtained by...
@freedom1993 We will document this phenomenon you reported as a bug and investigate the root cause.
@freedom1993 can you provide me this pdf?
https://github.com/opendatalab/magic-html 在这里
## make pdf index pdf indexes looks likes this: ```json { "track_id": "afeda417-5a33-4ec8-bd79-56222763f832", "path": "s3://mybook/pdf/book-name.pdf", "file_type": "pdf", "title": "My book Name", } ``` ## batch inference ```python if __name__ ==...
@Alan-zhong 使用libreoffice命令行,转换office格式到pdf,,然后处理 ```shell soffice --headless --convert-to pdf path/to/your/file.docx ```
```"models-dir":"~/tools/PDF-Extract-Kit/models/"``` ==> ```"models-dir":"/abs/path/to/tools/PDF-Extract-Kit/models/",```
@strongerfly 产生了比较多的冲突,建议从dev分支下拉代码,修改并提交PR到dev分支,感谢。
@ProseGuys please commit code to dev branch.