thepipe [WIP] Add batch conversion files for PDF to Markdown

[WIP] Add batch conversion files for PDF to Markdown

Open Copilot opened this issue 3 weeks ago • 0 comments

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

Original prompt

Add a set of files to make it easy for users to batch-convert PDF files to Markdown with thepipe, provide a reproducible Docker environment, a Makefile, a runbook, a YAML report template, and a manual GitHub Actions workflow to run the conversion in CI. Files to add (create new files in the repository):

scripts/batch_convert.sh

#!/usr/bin/env bash
set -euo pipefail

# 简单批量转换脚本（Bash）
# 用法:
#   ./scripts/batch_convert.sh <input_dir> <output_dir>
# 说明:
# - 假设 thepipe CLI 提供: thepipe pdf2md <input.pdf> -o <output.md>
# - 若 CLI 不同，请修改 convert_cmd 变量或替换 run_conversion() 中的命令
# - 生成的日志文件为 logs/convert-YYYYMMDD-HHMMSS.log
# - 脚本会写入 templates/conversion_report.yaml 的简短报告（每次覆盖一个带时间戳的报告）

if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <input_dir> <output_dir>"
  exit 1
fi

INPUT_DIR="$1"
OUTPUT_DIR="$2"
TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
LOG_DIR="logs"
REPORT_DIR="reports"
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" "${REPORT_DIR}"

LOG_FILE="${LOG_DIR}/convert-${TIMESTAMP}.log"
REPORT_FILE="${REPORT_DIR}/report-${TIMESTAMP}.yaml"

# Modify this command template if your thepipe CLI differs.
# Use %IN% and %OUT% placeholders or replace directly with an actual command.
convert_cmd_template='thepipe pdf2md "%IN%" -o "%OUT%"'

echo "Start conversion: $(date)" | tee "${LOG_FILE}"
echo "input_dir: ${INPUT_DIR}" | tee -a "${LOG_FILE}"
echo "output_dir: ${OUTPUT_DIR}" | tee -a "${LOG_FILE}"
echo "report_file: ${REPORT_FILE}" | tee -a "${LOG_FILE}"

echo "files:" > "${REPORT_FILE}"

shopt -s nullglob
count=0
failed=0
for pdf in "${INPUT_DIR}"/*.pdf; do
  base="$(basename "$pdf" .pdf)"
  out="${OUTPUT_DIR}/${base}.md"
  echo "Converting: ${pdf} -> ${out}" | tee -a "${LOG_FILE}"
  cmd=${convert_cmd_template//'\"%IN%\"'/"\"${pdf}\""}
  cmd=${cmd//'"%OUT%"'/"\"${out}\""}
  # run conversion
  if eval ${cmd} >> "${LOG_FILE}" 2>&1; then
    echo "OK: ${out}" | tee -a "${LOG_FILE}"
    echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
    echo "    output: \"${out}\"" >> "${REPORT_FILE}"
    echo "    status: ok" >> "${REPORT_FILE}"
  else
    echo "FAILED: ${pdf}" | tee -a "${LOG_FILE}"
    echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
    echo "    output: \"${out}\"" >> "${REPORT_FILE}"
    echo "    status: failed" >> "${REPORT_FILE}"
    failed=$((failed+1))
  fi
  count=$((count+1))
done

if [ "$count" -eq 0 ]; then
  echo "No PDF files found in ${INPUT_DIR}" | tee -a "${LOG_FILE}"
else
  echo "Finished converting ${count} PDF(s). ${failed} failed." | tee -a "${LOG_FILE}"
fi

echo "End: $(date)" | tee -a "${LOG_FILE}"

scripts/convert_pdfs.py

#!/usr/bin/env python3
"""
Python bulk conversion wrapper.

用法:
  python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md

功能:
 - 遍历 input-dir 下所有 .pdf（按字母排序）
 - 为每个文件调用 thepipe CLI（默认：thepipe pdf2md <in> -o <out>）
 - 记录每个文件的状态到 reports/report-<timestamp>.yaml
 - 在发生错误时返回非零退出码（便于在任务调度器/CI 里检测）
 - 可通过 --cmd-template 自定义实际执行命令（占位符 %IN% 和 %OUT%）
"""
import argparse
import subprocess
import sys
from pathlib import Path
from datetime import datetime


def run_conversion(pdf: Path, out_md: Path, cmd_template: str):
    cmd = cmd_template.replace("%IN%", str(pdf)).replace("%OUT%", str(out_md))
    try:
        subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        return True, ""
    except subprocess.CalledProcessError as e:
        return False, e.stderr.decode(errors="replace")


def write_report(report_path: Path, entries):
    report_path.parent.mkdir(parents=True, exist_ok=True)
    try:
        import yaml
        with report_path.open("w", encoding="utf-8") as fh:
            yaml.safe_dump({"generated_at": datetime.utcnow().isoformat()+"Z", "entries": entries}, fh, sort_keys=False, allow_unicode=True)
    except Exception:
        # fallback plain text
        with report_path.open("w", encoding="utf-8") as fh:
            fh.write(f"generated_at: {datetime.utcnow().isoformat()}Z\n")
            for e in entries:
                fh.write(str(e) + "\n")


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--input-dir", "-i", required=True)
    p.add_argument("--output-dir", "-o", required=True)
    p.add_argument("--cmd-template", default='thepipe pdf2md "%IN%" -o "%OUT%"',
                   help='Command template to run; use %IN% and %OUT% placeholders.')
    args = p.parse_args()

    input_dir = Path(args.input_dir)
    output_dir = Path(args.output_dir)
    if not input_dir.exists() or not input_dir.is_dir():
        print(f"Input directory {input_dir} not found", file=sys.stderr)
        sys.exit(2)
    output_dir.mkdir(parents=True, exist_ok=True)

    pdfs = sorted(input_dir.glob("*.pdf"))
    if not pdfs:
        print(f"No PDF files found in {input_dir}")
        return

    ts = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
    report_path = Path("reports") / f"report-{ts}.yaml"
    entries = []
    any_failed = False

    for pdf in pdfs:
        out_md = output_dir / (pdf.stem + ".md")
  ...

</details>

*This pull request was created as a result of the following prompt from Copilot chat.*
> Add a set of files to make it easy for users to batch-convert PDF files to Markdown with thepipe, provide a reproducible Docker environment, a Makefile, a runbook, a YAML report template, and a manual GitHub Actions workflow to run the conversion in CI. Files to add (create new files in the repository):
> 
> 1) scripts/batch_convert.sh
> 
> ```
> #!/usr/bin/env bash
> set -euo pipefail
> 
> # 简单批量转换脚本（Bash）
> # 用法:
> #   ./scripts/batch_convert.sh <input_dir> <output_dir>
> # 说明:
> # - 假设 thepipe CLI 提供: thepipe pdf2md <input.pdf> -o <output.md>
> # - 若 CLI 不同，请修改 convert_cmd 变量或替换 run_conversion() 中的命令
> # - 生成的日志文件为 logs/convert-YYYYMMDD-HHMMSS.log
> # - 脚本会写入 templates/conversion_report.yaml 的简短报告（每次覆盖一个带时间戳的报告）
> 
> if [ "$#" -ne 2 ]; then
>   echo "Usage: $0 <input_dir> <output_dir>"
>   exit 1
> fi
> 
> INPUT_DIR="$1"
> OUTPUT_DIR="$2"
> TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
> LOG_DIR="logs"
> REPORT_DIR="reports"
> mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" "${REPORT_DIR}"
> 
> LOG_FILE="${LOG_DIR}/convert-${TIMESTAMP}.log"
> REPORT_FILE="${REPORT_DIR}/report-${TIMESTAMP}.yaml"
> 
> # Modify this command template if your thepipe CLI differs.
> # Use %IN% and %OUT% placeholders or replace directly with an actual command.
> convert_cmd_template='thepipe pdf2md "%IN%" -o "%OUT%"'
> 
> echo "Start conversion: $(date)" | tee "${LOG_FILE}"
> echo "input_dir: ${INPUT_DIR}" | tee -a "${LOG_FILE}"
> echo "output_dir: ${OUTPUT_DIR}" | tee -a "${LOG_FILE}"
> echo "report_file: ${REPORT_FILE}" | tee -a "${LOG_FILE}"
> 
> echo "files:" > "${REPORT_FILE}"
> 
> shopt -s nullglob
> count=0
> failed=0
> for pdf in "${INPUT_DIR}"/*.pdf; do
>   base="$(basename "$pdf" .pdf)"
>   out="${OUTPUT_DIR}/${base}.md"
>   echo "Converting: ${pdf} -> ${out}" | tee -a "${LOG_FILE}"
>   cmd=${convert_cmd_template//'\"%IN%\"'/"\"${pdf}\""}
>   cmd=${cmd//'"%OUT%"'/"\"${out}\""}
>   # run conversion
>   if eval ${cmd} >> "${LOG_FILE}" 2>&1; then
>     echo "OK: ${out}" | tee -a "${LOG_FILE}"
>     echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
>     echo "    output: \"${out}\"" >> "${REPORT_FILE}"
>     echo "    status: ok" >> "${REPORT_FILE}"
>   else
>     echo "FAILED: ${pdf}" | tee -a "${LOG_FILE}"
>     echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
>     echo "    output: \"${out}\"" >> "${REPORT_FILE}"
>     echo "    status: failed" >> "${REPORT_FILE}"
>     failed=$((failed+1))
>   fi
>   count=$((count+1))
> done
> 
> if [ "$count" -eq 0 ]; then
>   echo "No PDF files found in ${INPUT_DIR}" | tee -a "${LOG_FILE}"
> else
>   echo "Finished converting ${count} PDF(s). ${failed} failed." | tee -a "${LOG_FILE}"
> fi
> 
> echo "End: $(date)" | tee -a "${LOG_FILE}"
> ```
> 
> 2) scripts/convert_pdfs.py
> 
> ```
> #!/usr/bin/env python3
> """
> Python bulk conversion wrapper.
> 
> 用法:
>   python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
> 
> 功能:
>  - 遍历 input-dir 下所有 .pdf（按字母排序）
>  - 为每个文件调用 thepipe CLI（默认：thepipe pdf2md <in> -o <out>）
>  - 记录每个文件的状态到 reports/report-<timestamp>.yaml
>  - 在发生错误时返回非零退出码（便于在任务调度器/CI 里检测）
>  - 可通过 --cmd-template 自定义实际执行命令（占位符 %IN% 和 %OUT%）
> """
> import argparse
> import subprocess
> import sys
> from pathlib import Path
> from datetime import datetime
> 
> 
> def run_conversion(pdf: Path, out_md: Path, cmd_template: str):
>     cmd = cmd_template.replace("%IN%", str(pdf)).replace("%OUT%", str(out_md))
>     try:
>         subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>         return True, ""
>     except subprocess.CalledProcessError as e:
>         return False, e.stderr.decode(errors="replace")
> 
> 
> def write_report(report_path: Path, entries):
>     report_path.parent.mkdir(parents=True, exist_ok=True)
>     try:
>         import yaml
>         with report_path.open("w", encoding="utf-8") as fh:
>             yaml.safe_dump({"generated_at": datetime.utcnow().isoformat()+"Z", "entries": entries}, fh, sort_keys=False, allow_unicode=True)
>     except Exception:
>         # fallback plain text
>         with report_path.open("w", encoding="utf-8") as fh:
>             fh.write(f"generated_at: {datetime.utcnow().isoformat()}Z\n")
>             for e in entries:
>                 fh.write(str(e) + "\n")
> 
> 
> def main():
>     p = argparse.ArgumentParser()
>     p.add_argument("--input-dir", "-i", required=True)
>     p.add_argument("--output-dir", "-o", required=True)
>     p.add_argument("--cmd-template", default='thepipe pdf2md "%IN%" -o "%OUT%"',
>                    help='Command template to run; use %IN% and %OUT% placeholders.')
>     args = p.parse_args()
> 
>     input_dir = Path(args.input_dir)
>     output_dir = Path(args.output_dir)
>     if not input_dir.exists() or not input_dir.is_dir():
>         print(f"Input directory {input_dir} not found", file=sys.stderr)
>         sys.exit(2)
>     output_dir.mkdir(parents=True, exist_ok=True)
> 
>     pdfs = sorted(input_dir.glob("*.pdf"))
>     if not pdfs:
>         print(f"No PDF files found in {input_dir}")
>         return
> 
>     ts = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
>     report_path = Path("reports") / f"report-{ts}.yaml"
>     entries = []
>     any_failed = False
> 
>     for pdf in pdfs:
>         out_md = output_dir / (pdf.stem + ".md")
>         print(f"Converting {pdf} -> {out_md}")
>         ok, err = run_conversion(pdf, out_md, args.cmd_template)
>         entry = {"file": str(pdf), "output": str(out_md), "status": "ok" if ok else "failed"}
>         if not ok:
>             entry["error"] = err[:1000]
>             any_failed = True
>             print(f"Failed: {pdf}. See report {report_path} for details.", file=sys.stderr)
>         entries.append(entry)
> 
>     write_report(report_path, entries)
>     print(f"Wrote report: {report_path}")
>     if any_failed:
>         sys.exit(3)
> 
> 
> if __name__ == "__main__":
>     from datetime import datetime
>     main()
> ```
> 
> 3) Dockerfile
> 
> ```
> FROM python:3.11-slim
> 
> # system deps: poppler-utils (pdftotext), tesseract (如果需要 OCR)
> RUN apt-get update && apt-get install -y --no-install-recommends \
>     poppler-utils \
>     tesseract-ocr \
>     && rm -rf /var/lib/apt/lists/*
> 
> WORKDIR /app
> COPY . /app
> 
> # 如果 thepipe 是本 repo 的 python 包，可在这里安装：
> # RUN pip install -e .
> 
> # 安装 PyYAML（脚本里使用）
> RUN pip install --no-cache-dir pyyaml
> 
> ENTRYPOINT ["sh", "-c"]
> CMD ["python3 /app/scripts/convert_pdfs.py -i pdfs -o converted-md"]
> ```
> 
> 4) Makefile
> 
> ```
> .PHONY: install convert docker-build docker-run
> 
> # 本地安装（可选）
> install:
> 	python3 -m venv .venv
> 	. .venv/bin/activate && pip install --upgrade pip
> 	. .venv/bin/activate && pip install -r requirements.txt || true
> 	@echo "虚拟环境已创建：.venv ，请激活后运行脚本"
> 
> convert:
> 	python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
> 
> docker-build:
> 	docker build -t pdf2md-env .
> 
> docker-run:
> 	docker run --rm -v $(PWD)/pdfs:/data/pdfs -v $(PWD)/converted-md:/data/out pdf2md-env \
> 		python3 /app/scripts/convert_pdfs.py -i /data/pdfs -o /data/out
> ```
> 
> 5) docs/RUNBOOK.md
> 
> ````markdown
> # PDF -> Markdown 批量转换 Runbook
> 
> 目的
> - 提供一套“零到一”的可复现流程，方便任何人在本地或容器中把 PDF 批量转换为 Markdown，并能把操作过程记录成机器可读的报告与可审计的日志。
> 
> 准备（本地）
> 1. 将要转换的 PDF 放到 `pdfs/` 目录（或任意目录，运行脚本时指定）。
> 2. 确保已安装 thepipe CLI（或修改脚本中的命令模板为你实际使用的转换工具）。
>    - 示例（假定 thepipe 为 Python 包）：在仓库根目录运行 `pip install -e .`
> 3. 可选：创建 Python 虚拟环境（参见 Makefile install）。
> 
> 运行（简单）
> - Bash 脚本:
>   chmod +x scripts/batch_convert.sh
>   ./scripts/batch_convert.sh pdfs converted-md
> 
> - Python 脚本:
>   python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
> 
> 运行（容器）
> - 构建镜像:
>   docker build -t pdf2md-env .
> - 运行:
>   docker run --rm -v $(pwd)/pdfs:/data/pdfs -v $(pwd)/converted-md:/data/out pdf2md-env \
>     python3 /app/scripts/convert_pdfs.py -i /data/pdfs -o /data/out
> 
> 日志与报告
> - 脚本会在 `logs/` 写入详细日志文件（bash 脚本）。
> - Python 脚本会在 `reports/report-<timestamp>.yaml` 写入结构化报告，包含每个文件的 status/output/error（若失败）。
> - 报告示例见 templates/conversion_report.yaml。
> 
> 如何分享/复用
> - 把整个目录打包成 zip 发给别人（包含 scripts、Dockerfile、docs）。
> - 或者把 Docker 镜像上传到私有 registry，别人只需运行镜像。
> - 建议附上一个 small sample pdf（或指向可公开获取的示例）以便快速验证。
> 
> 记录变更（可选）
> - 在本地可使用 git 管理版本（但不用推到 GitHub 也可以）。
> - 每次重要变更（脚本改动、thepipe 版本升级）记录 CHANGELOG.md 或在 reports 中新增 metadata。
> 
> 排错提示
> - 如果脚本报 “thepipe: command not found”：
>   - 确认 thepipe 是否在 PATH；或激活虚拟环境；或在 Dockerfile 中安装它。
> - 输出 Markdown 为空或文本顺序错乱：
>   - 检查 thepipe 的参数（是否需要 OCR 步骤）；尝试先用 pdftotext 单文件调试。
> - 转换大量文件时：
>   - 分批处理或在脚本中添加并行处理（GNU parallel 或 multiprocessing）。
> 
> 如果你需要
> - 我可以把脚本改成并发版（多进程）；
> - 或者按你提供 thepipe 的实际 CLI，将脚本里的命令模板替换并再测试一次；
> - 或将这些文件打包成 zip 供你直接分发。
> ````
> 
> 6) templates/conversion_report.yaml
> 
> ```
> # conversion report template (YAML)
> # generated_at: 2025-11-14T12:00:00Z
> # entries:
> #   - file: /path/to/pdfs/example.pdf
> #     output: converted-md/example.md
> #     status: ok
> #     error: ""
> #   - file: /path/to/pdfs/bad.pdf
> #     output: converted-md/bad.md
> #     status: failed
> #     error: "Error message or excerpt"
> ```
> 
> 7) .github/workflows/convert-pdfs.yml
> 
> ```
> name: Convert PDFs to Markdown (manual)
> 
> on:
>   workflow_dispatch:
>     inputs:
>       pdf_dir:
>         description: 'Relative path to PDF directory (in repo)'
>         required: true
>         default: 'pdfs'
>       out_dir:
>         description: 'Output directory for markdown'
>         required: true
>         default: 'converted-md'
>       commit_results:
>         description: 'Commit converted files back to the repo?'
>         required: true
>         default: 'true'
> 
> jobs:
>   convert:
>     runs-on: ubuntu-latest
>     steps:
>       - name: Checkout repository
>         uses: actions/checkout@v4
>         with:
>           fetch-depth: 0
> 
>       - name: Set up Python (if needed)
>         uses: actions/setup-python@v4
>         with:
>           python-version: '3.11'
> 
>       - name: Install repository (if thepipe is python package) and dependencies
>         run: |
>           # 如果 thepipe 是本仓库的 python 包，可以在此安装
>           if [ -f "pyproject.toml" ] || [ -f "setup.py" ]; then
>             pip install -e .
>           fi
>           # 安装脚本可能需要的额外工具（按需修改）
>           pip install -r requirements.txt || true
> 
>       - name: Run conversion script
>         env:
>           INPUT_DIR: ${{ github.event.inputs.pdf_dir }}
>           OUTPUT_DIR: ${{ github.event.inputs.out_dir }}
>         run: |
>           mkdir -p "${OUTPUT_DIR}"
>           chmod +x scripts/batch_convert.sh || true
>           ./scripts/batch_convert.sh "${INPUT_DIR}" "${OUTPUT_DIR}"
> 
>       - name: Commit converted files
>         if: ${{ github.event.inputs.commit_results == 'true' }}
>         run: |
>           git config user.name "github-actions[bot]"
>           git config user.email "github-actions[bot]@users.noreply.github.com"
>           git add "${{ github.event.inputs.out_dir }}" || true
>           if git diff --staged --quiet; then
>             echo "No changes to commit"
>           else
>             git commit -m "chore: auto-convert PDFs to markdown (from workflow)"
>             git push
>           fi
> ```
> 
> Notes and assumptions:
> - I did not set a base branch; please let me know if you want the PR target to be a different branch than the repository default.
> - The scripts assume the CLI command is `thepipe pdf2md <in> -o <out>`; if the real CLI differs please provide the correct invocation and I will update the scripts.
> - The workflow is manual (workflow_dispatch) and by default attempts to commit results if the input commit_results is 'true'.
> 
> Please create a pull request with these additions in the repository emcf/thepipe. Thank you.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Nov 14 '25 12:11 Copilot

thepipe thepipe copied to clipboard

[WIP] Add batch conversion files for PDF to Markdown

thepipe
thepipe copied to clipboard