pdf解析出来的md乱码
Description of the bug | 错误描述
使用cuda加速,ocr模式,解析出来的pdf文件乱码
How to reproduce the bug | 如何复现
sys.platform linux
Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
numpy 1.26.4
detectron2 0.6 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/detectron2
Compiler GCC 11.4
CUDA compiler not available
DETECTRON2_ENV_MODULE
- scihub_train TRAIN:
- scihub_train
GLOBAL:
HACK: 1.0
ICDAR_DATA_DIR_TEST: ''
ICDAR_DATA_DIR_TRAIN: ''
INPUT:
CROP:
ENABLED: true
SIZE:
- 384
-
- 600 TYPE: absolute_range FORMAT: RGB MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN:
- 480
- 512
- 544
- 576
- 608
- 640
- 672
- 704
- 736
- 768
- 800
MIN_SIZE_TRAIN_SAMPLING: choice
RANDOM_FLIP: horizontal
MODEL:
ANCHOR_GENERATOR:
ANGLES:
-
- -90
- 0
- 90 ASPECT_RATIOS:
-
- 0.5
- 1.0
- 2.0 NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES:
-
- 32
-
- 64
-
- 128
-
- 256
-
- 512 BACKBONE: FREEZE_AT: 2 NAME: build_vit_fpn_backbone CONFIG_PATH: '' DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES:
- layer3
- layer5
- layer7
- layer11 NORM: '' OUT_CHANNELS: 256 IMAGE_ONLY: true KEYPOINT_ON: false LOAD_PROPOSALS: false MASK_ON: true META_ARCHITECTURE: VLGeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: true INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN:
-
- 127.5
- 127.5
- 127.5 PIXEL_STD:
- 127.5
- 127.5
- 127.5
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:
- false
- false
- false
- false DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES:
- res4 RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: true WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS:
- 1.0
- 1.0
- 1.0
- 1.0 FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES:
- p3
- p4
- p5
- p6
- p7 IOU_LABELS:
- 0
- -1
- 1 IOU_THRESHOLDS:
- 0.4
- 0.5 NMS_THRESH_TEST: 0.5 NORM: '' NUM_CLASSES: 10 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS:
-
- 10.0
- 10.0
- 5.0
- 5.0
-
- 20.0
- 20.0
- 10.0
- 10.0
-
- 30.0
- 30.0
- 15.0
- 15.0 IOUS:
- 0.5
- 0.6
- 0.7 ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS:
- 10.0
- 10.0
- 5.0
- 5.0 CLS_AGNOSTIC_BBOX_REG: true CONV_DIM: 256 FC_DIM: 1024 NAME: FastRCNNConvFCHead NORM: '' NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: false ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES:
- p2
- p3
- p4
- p5 IOU_LABELS:
- 0
- 1 IOU_THRESHOLDS:
- 0.5 NAME: CascadeROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 10 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: true SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512 LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: false CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: '' NUM_CONV: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS:
- 1.0
- 1.0
- 1.0
- 1.0 BOUNDARY_THRESH: -1 CONV_DIMS:
- -1 HEAD_NAME: StandardRPNHead IN_FEATURES:
- p2
- p3
- p4
- p5
- p6 IOU_LABELS:
- 0
- -1
- 1 IOU_THRESHOLDS:
- 0.3
- 0.7 LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES:
- p2
- p3
- p4
- p5 LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 10 VIT: DROP_PATH: 0.1 IMG_SIZE:
- 224
- 224 NAME: layoutlmv3_base OUT_FEATURES:
- layer3
- layer5
- layer7
- layer11 POS_TYPE: abs WEIGHTS: OUTPUT_DIR: SCIHUB_DATA_DIR_TRAIN: ~/publaynet/layout_scihub/train SEED: 42 SOLVER: AMP: ENABLED: true BACKBONE_MULTIPLIER: 1.0 BASE_LR: 0.0002 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 2000 CLIP_GRADIENTS: CLIP_TYPE: full_model CLIP_VALUE: 1.0 ENABLED: true NORM_TYPE: 2.0 GAMMA: 0.1 GRADIENT_ACCUMULATION_STEPS: 1 IMS_PER_BATCH: 32 LR_SCHEDULER_NAME: WarmupCosineLR MAX_ITER: 20000 MOMENTUM: 0.9 NESTEROV: false OPTIMIZER: ADAMW REFERENCE_WORLD_SIZE: 0 STEPS:
- 10000
WARMUP_FACTOR: 0.01
WARMUP_ITERS: 333
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.05
WEIGHT_DECAY_BIAS: null
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200 DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 1000 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: false NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0
命令: magic-pdf -m ocr -p 奥迪Q5混合动力技术培训.pdf 配置文件: { "bucket_info":{ "bucket-name-1":["ak", "sk", "endpoint"], "bucket-name-2":["ak", "sk", "endpoint"] }, "models-dir":"/home/gllg/PDF-Extract-Kit/models", "device-mode":"cuda", "table-config": { "is_table_recog_enable": false, "max_time": 400 } }
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.7.x
Device mode | 设备模式
cuda
可以上传一份样本吗?
奥迪Q5混合动力技术培训.pdf @myhloli
@myhloli 我在ubuntu上面测试了好几个都是乱码。有一个纯文本格式的,使用auto提取没有乱码,使用ocr模式就乱码
我这边测试没发现乱码问题,如果你这边设备所有的ocr都出现乱码,大概率是paddleocr库和设备不兼容导致,可以去paddleocr的仓库反馈一下。
好的,我再试试
On Fri, Aug 16, 2024 at 14:54 Xiaomeng Zhao @.***> wrote:
image.png (view on web) https://github.com/user-attachments/assets/476f9372-9481-48fd-a31e-bfd4776a15e5
我这边测试没发现乱码问题,如果你这边设备所有的ocr都出现乱码,大概率是paddleocr库和设备不兼容导致,可以去paddleocr的仓库反馈一下。
— Reply to this email directly, view it on GitHub https://github.com/opendatalab/MinerU/issues/444#issuecomment-2292931103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKYLUJIPP6HGGEZSNH6NLZRWOZXAVCNFSM6AAAAABMTMCRXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJSHEZTCMJQGM . You are receiving this because you authored the thread.Message ID: @.***>
验证后发现是tesla p100显卡太老,出现兼容性问题。
验证后发现是tesla p100显卡太老,出现兼容性问题。
请问你限制是怎么解决的,我也是这个显卡
没解决呢,paddleocr也提了issue,说显卡太老,不适配。 [image: image.png]
On Mon, Sep 23, 2024 at 9:30 AM 243006306 @.***> wrote:
验证后发现是tesla p100显卡太老,出现兼容性问题。
请问你限制是怎么解决的,我也是这个显卡
— Reply to this email directly, view it on GitHub https://github.com/opendatalab/MinerU/issues/444#issuecomment-2367092543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKYLUX3USXLFIRL7TZ75TZX5VKDAVCNFSM6AAAAABMTMCRXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXGA4TENJUGM . You are receiving this because you modified the open/close state.Message ID: @.***>
没解决呢,paddleocr也提了issue,说显卡太老,不适配。 [image: image.png] … On Mon, Sep 23, 2024 at 9:30 AM 243006306 @.> wrote: 验证后发现是tesla p100显卡太老,出现兼容性问题。 请问你限制是怎么解决的,我也是这个显卡 — Reply to this email directly, view it on GitHub <#444 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKYLUX3USXLFIRL7TZ75TZX5VKDAVCNFSM6AAAAABMTMCRXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXGA4TENJUGM . You are receiving this because you modified the open/close state.Message ID: @.>
大佬,我这边也是,头疼。 我我其他项目能够p100用paddle,但是这里面环境搞得太复杂了,估计确实解决不了