drunkpig

Results 91 comments of drunkpig

@shibainu-gbq 标题的形式太多了,段落间距,字体,颜色,粗细,背景都能决定是不是标题。很难有普世的方法。

`--method ocr` means use paddle to get text from pdf, `--method text` means use pymuPDF to get text from pdf. The difference lies in that the bounding boxes obtained by...

https://github.com/opendatalab/magic-html 在这里

## make pdf index pdf indexes looks likes this: ```json { "track_id": "afeda417-5a33-4ec8-bd79-56222763f832", "path": "s3://mybook/pdf/book-name.pdf", "file_type": "pdf", "title": "My book Name", } ``` ## batch inference ```python if __name__ ==...

@Alan-zhong 使用libreoffice命令行,转换office格式到pdf,,然后处理 ```shell soffice --headless --convert-to pdf path/to/your/file.docx ```

```"models-dir":"~/tools/PDF-Extract-Kit/models/"``` ==> ```"models-dir":"/abs/path/to/tools/PDF-Extract-Kit/models/",```

@strongerfly 产生了比较多的冲突,建议从dev分支下拉代码,修改并提交PR到dev分支,感谢。

@ProseGuys please commit code to dev branch.