MinerU icon indicating copy to clipboard operation
MinerU copied to clipboard

pdf解析出来的md乱码

Open zuanzuanshao opened this issue 1 year ago • 5 comments

Description of the bug | 错误描述

使用cuda加速,ocr模式,解析出来的pdf文件乱码

How to reproduce the bug | 如何复现

sys.platform linux Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] numpy 1.26.4 detectron2 0.6 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0,1 Tesla P100-PCIE-12GB (arch=6.0) Driver version 535.183.01 CUDA_HOME None - invalid! Pillow 10.4.0 torchvision 0.18.1+cu121 @/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/torchvision torchvision arch flags /root/miniconda3/envs/MinerU/lib/python3.10/site-packages/torchvision/_C.so fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0 [08/16 04:21:11 detectron2]: Command line arguments: {'config_file': '/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/home/gllg/PDF-Extract-Kit/models/Layout/model_final.pth']} [08/16 04:21:11 detectron2]: Contents of args.config_file=/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

  • scihub_train TRAIN:
  • scihub_train GLOBAL: HACK: 1.0 ICDAR_DATA_DIR_TEST: '' ICDAR_DATA_DIR_TRAIN: '' INPUT: CROP: ENABLED: true SIZE:
    • 384
      • 600 TYPE: absolute_range FORMAT: RGB MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN:
  • 480
  • 512
  • 544
  • 576
  • 608
  • 640
  • 672
  • 704
  • 736
  • 768
  • 800 MIN_SIZE_TRAIN_SAMPLING: choice RANDOM_FLIP: horizontal MODEL: ANCHOR_GENERATOR: ANGLES:
      • -90
      • 0
      • 90 ASPECT_RATIOS:
      • 0.5
      • 1.0
      • 2.0 NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES:
      • 32
      • 64
      • 128
      • 256
      • 512 BACKBONE: FREEZE_AT: 2 NAME: build_vit_fpn_backbone CONFIG_PATH: '' DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES:
    • layer3
    • layer5
    • layer7
    • layer11 NORM: '' OUT_CHANNELS: 256 IMAGE_ONLY: true KEYPOINT_ON: false LOAD_PROPOSALS: false MASK_ON: true META_ARCHITECTURE: VLGeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: true INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN:
  • 127.5
  • 127.5
  • 127.5 PIXEL_STD:
  • 127.5
  • 127.5
  • 127.5 PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: false DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE:
    • false
    • false
    • false
    • false DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES:
    • res4 RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: true WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS:
    • 1.0
    • 1.0
    • 1.0
    • 1.0 FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES:
    • p3
    • p4
    • p5
    • p6
    • p7 IOU_LABELS:
    • 0
    • -1
    • 1 IOU_THRESHOLDS:
    • 0.4
    • 0.5 NMS_THRESH_TEST: 0.5 NORM: '' NUM_CLASSES: 10 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS:
      • 10.0
      • 10.0
      • 5.0
      • 5.0
      • 20.0
      • 20.0
      • 10.0
      • 10.0
      • 30.0
      • 30.0
      • 15.0
      • 15.0 IOUS:
    • 0.5
    • 0.6
    • 0.7 ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS:
    • 10.0
    • 10.0
    • 5.0
    • 5.0 CLS_AGNOSTIC_BBOX_REG: true CONV_DIM: 256 FC_DIM: 1024 NAME: FastRCNNConvFCHead NORM: '' NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: false ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES:
    • p2
    • p3
    • p4
    • p5 IOU_LABELS:
    • 0
    • 1 IOU_THRESHOLDS:
    • 0.5 NAME: CascadeROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 10 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: true SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS:
    • 512
    • 512
    • 512
    • 512
    • 512
  • 512
    • 512
    • 512
    • 512
    • 512 LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: false CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: '' NUM_CONV: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS:
    • 1.0
    • 1.0
    • 1.0
    • 1.0 BOUNDARY_THRESH: -1 CONV_DIMS:
    • -1 HEAD_NAME: StandardRPNHead IN_FEATURES:
    • p2
    • p3
    • p4
    • p5
    • p6 IOU_LABELS:
    • 0
    • -1
    • 1 IOU_THRESHOLDS:
    • 0.3
    • 0.7 LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES:
    • p2
    • p3
    • p4
    • p5 LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 10 VIT: DROP_PATH: 0.1 IMG_SIZE:
    • 224
    • 224 NAME: layoutlmv3_base OUT_FEATURES:
    • layer3
    • layer5
    • layer7
    • layer11 POS_TYPE: abs WEIGHTS: OUTPUT_DIR: SCIHUB_DATA_DIR_TRAIN: ~/publaynet/layout_scihub/train SEED: 42 SOLVER: AMP: ENABLED: true BACKBONE_MULTIPLIER: 1.0 BASE_LR: 0.0002 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 2000 CLIP_GRADIENTS: CLIP_TYPE: full_model CLIP_VALUE: 1.0 ENABLED: true NORM_TYPE: 2.0 GAMMA: 0.1 GRADIENT_ACCUMULATION_STEPS: 1 IMS_PER_BATCH: 32 LR_SCHEDULER_NAME: WarmupCosineLR MAX_ITER: 20000 MOMENTUM: 0.9 NESTEROV: false OPTIMIZER: ADAMW REFERENCE_WORLD_SIZE: 0 STEPS:
  • 10000 WARMUP_FACTOR: 0.01 WARMUP_ITERS: 333 WARMUP_METHOD: linear WEIGHT_DECAY: 0.05 WEIGHT_DECAY_BIAS: null WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: false FLIP: true MAX_SIZE: 4000 MIN_SIZES:
    • 400
    • 500
    • 600
    • 700
    • 800
    • 900
    • 1000
    • 1100
    • 1200 DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 1000 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: false NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0

命令: magic-pdf -m ocr -p 奥迪Q5混合动力技术培训.pdf 配置文件: { "bucket_info":{ "bucket-name-1":["ak", "sk", "endpoint"], "bucket-name-2":["ak", "sk", "endpoint"] }, "models-dir":"/home/gllg/PDF-Extract-Kit/models", "device-mode":"cuda", "table-config": { "is_table_recog_enable": false, "max_time": 400 } }

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

zuanzuanshao avatar Aug 16 '24 04:08 zuanzuanshao

可以上传一份样本吗?

myhloli avatar Aug 16 '24 04:08 myhloli

@myhloli 我在ubuntu上面测试了好几个都是乱码。有一个纯文本格式的,使用auto提取没有乱码,使用ocr模式就乱码 image

zuanzuanshao avatar Aug 16 '24 05:08 zuanzuanshao

image

我这边测试没发现乱码问题,如果你这边设备所有的ocr都出现乱码,大概率是paddleocr库和设备不兼容导致,可以去paddleocr的仓库反馈一下。

myhloli avatar Aug 16 '24 06:08 myhloli

好的,我再试试

On Fri, Aug 16, 2024 at 14:54 Xiaomeng Zhao @.***> wrote:

image.png (view on web) https://github.com/user-attachments/assets/476f9372-9481-48fd-a31e-bfd4776a15e5

我这边测试没发现乱码问题,如果你这边设备所有的ocr都出现乱码,大概率是paddleocr库和设备不兼容导致,可以去paddleocr的仓库反馈一下。

— Reply to this email directly, view it on GitHub https://github.com/opendatalab/MinerU/issues/444#issuecomment-2292931103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKYLUJIPP6HGGEZSNH6NLZRWOZXAVCNFSM6AAAAABMTMCRXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJSHEZTCMJQGM . You are receiving this because you authored the thread.Message ID: @.***>

zuanzuanshao avatar Aug 16 '24 07:08 zuanzuanshao

验证后发现是tesla p100显卡太老,出现兼容性问题。

zuanzuanshao avatar Aug 19 '24 05:08 zuanzuanshao

验证后发现是tesla p100显卡太老,出现兼容性问题。

请问你限制是怎么解决的,我也是这个显卡

243006306 avatar Sep 23 '24 01:09 243006306

没解决呢,paddleocr也提了issue,说显卡太老,不适配。 [image: image.png]

On Mon, Sep 23, 2024 at 9:30 AM 243006306 @.***> wrote:

验证后发现是tesla p100显卡太老,出现兼容性问题。

请问你限制是怎么解决的,我也是这个显卡

— Reply to this email directly, view it on GitHub https://github.com/opendatalab/MinerU/issues/444#issuecomment-2367092543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKYLUX3USXLFIRL7TZ75TZX5VKDAVCNFSM6AAAAABMTMCRXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXGA4TENJUGM . You are receiving this because you modified the open/close state.Message ID: @.***>

zuanzuanshao avatar Sep 23 '24 03:09 zuanzuanshao

没解决呢,paddleocr也提了issue,说显卡太老,不适配。 [image: image.png] On Mon, Sep 23, 2024 at 9:30 AM 243006306 @.> wrote: 验证后发现是tesla p100显卡太老,出现兼容性问题。 请问你限制是怎么解决的,我也是这个显卡 — Reply to this email directly, view it on GitHub <#444 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKYLUX3USXLFIRL7TZ75TZX5VKDAVCNFSM6AAAAABMTMCRXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXGA4TENJUGM . You are receiving this because you modified the open/close state.Message ID: @.>

大佬,我这边也是,头疼。 我我其他项目能够p100用paddle,但是这里面环境搞得太复杂了,估计确实解决不了

chiliuliu avatar Dec 24 '24 08:12 chiliuliu