thepipe
thepipe copied to clipboard
[WIP] Add batch conversion files for PDF to Markdown
Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.
Original prompt
Add a set of files to make it easy for users to batch-convert PDF files to Markdown with thepipe, provide a reproducible Docker environment, a Makefile, a runbook, a YAML report template, and a manual GitHub Actions workflow to run the conversion in CI. Files to add (create new files in the repository):
- scripts/batch_convert.sh
#!/usr/bin/env bash
set -euo pipefail
# 简单批量转换脚本(Bash)
# 用法:
# ./scripts/batch_convert.sh <input_dir> <output_dir>
# 说明:
# - 假设 thepipe CLI 提供: thepipe pdf2md <input.pdf> -o <output.md>
# - 若 CLI 不同,请修改 convert_cmd 变量或替换 run_conversion() 中的命令
# - 生成的日志文件为 logs/convert-YYYYMMDD-HHMMSS.log
# - 脚本会写入 templates/conversion_report.yaml 的简短报告(每次覆盖一个带时间戳的报告)
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <input_dir> <output_dir>"
exit 1
fi
INPUT_DIR="$1"
OUTPUT_DIR="$2"
TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
LOG_DIR="logs"
REPORT_DIR="reports"
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" "${REPORT_DIR}"
LOG_FILE="${LOG_DIR}/convert-${TIMESTAMP}.log"
REPORT_FILE="${REPORT_DIR}/report-${TIMESTAMP}.yaml"
# Modify this command template if your thepipe CLI differs.
# Use %IN% and %OUT% placeholders or replace directly with an actual command.
convert_cmd_template='thepipe pdf2md "%IN%" -o "%OUT%"'
echo "Start conversion: $(date)" | tee "${LOG_FILE}"
echo "input_dir: ${INPUT_DIR}" | tee -a "${LOG_FILE}"
echo "output_dir: ${OUTPUT_DIR}" | tee -a "${LOG_FILE}"
echo "report_file: ${REPORT_FILE}" | tee -a "${LOG_FILE}"
echo "files:" > "${REPORT_FILE}"
shopt -s nullglob
count=0
failed=0
for pdf in "${INPUT_DIR}"/*.pdf; do
base="$(basename "$pdf" .pdf)"
out="${OUTPUT_DIR}/${base}.md"
echo "Converting: ${pdf} -> ${out}" | tee -a "${LOG_FILE}"
cmd=${convert_cmd_template//'\"%IN%\"'/"\"${pdf}\""}
cmd=${cmd//'"%OUT%"'/"\"${out}\""}
# run conversion
if eval ${cmd} >> "${LOG_FILE}" 2>&1; then
echo "OK: ${out}" | tee -a "${LOG_FILE}"
echo " - file: \"${pdf}\"" >> "${REPORT_FILE}"
echo " output: \"${out}\"" >> "${REPORT_FILE}"
echo " status: ok" >> "${REPORT_FILE}"
else
echo "FAILED: ${pdf}" | tee -a "${LOG_FILE}"
echo " - file: \"${pdf}\"" >> "${REPORT_FILE}"
echo " output: \"${out}\"" >> "${REPORT_FILE}"
echo " status: failed" >> "${REPORT_FILE}"
failed=$((failed+1))
fi
count=$((count+1))
done
if [ "$count" -eq 0 ]; then
echo "No PDF files found in ${INPUT_DIR}" | tee -a "${LOG_FILE}"
else
echo "Finished converting ${count} PDF(s). ${failed} failed." | tee -a "${LOG_FILE}"
fi
echo "End: $(date)" | tee -a "${LOG_FILE}"
- scripts/convert_pdfs.py
#!/usr/bin/env python3
"""
Python bulk conversion wrapper.
用法:
python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
功能:
- 遍历 input-dir 下所有 .pdf(按字母排序)
- 为每个文件调用 thepipe CLI(默认:thepipe pdf2md <in> -o <out>)
- 记录每个文件的状态到 reports/report-<timestamp>.yaml
- 在发生错误时返回非零退出码(便于在任务调度器/CI 里检测)
- 可通过 --cmd-template 自定义实际执行命令(占位符 %IN% 和 %OUT%)
"""
import argparse
import subprocess
import sys
from pathlib import Path
from datetime import datetime
def run_conversion(pdf: Path, out_md: Path, cmd_template: str):
cmd = cmd_template.replace("%IN%", str(pdf)).replace("%OUT%", str(out_md))
try:
subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
return True, ""
except subprocess.CalledProcessError as e:
return False, e.stderr.decode(errors="replace")
def write_report(report_path: Path, entries):
report_path.parent.mkdir(parents=True, exist_ok=True)
try:
import yaml
with report_path.open("w", encoding="utf-8") as fh:
yaml.safe_dump({"generated_at": datetime.utcnow().isoformat()+"Z", "entries": entries}, fh, sort_keys=False, allow_unicode=True)
except Exception:
# fallback plain text
with report_path.open("w", encoding="utf-8") as fh:
fh.write(f"generated_at: {datetime.utcnow().isoformat()}Z\n")
for e in entries:
fh.write(str(e) + "\n")
def main():
p = argparse.ArgumentParser()
p.add_argument("--input-dir", "-i", required=True)
p.add_argument("--output-dir", "-o", required=True)
p.add_argument("--cmd-template", default='thepipe pdf2md "%IN%" -o "%OUT%"',
help='Command template to run; use %IN% and %OUT% placeholders.')
args = p.parse_args()
input_dir = Path(args.input_dir)
output_dir = Path(args.output_dir)
if not input_dir.exists() or not input_dir.is_dir():
print(f"Input directory {input_dir} not found", file=sys.stderr)
sys.exit(2)
output_dir.mkdir(parents=True, exist_ok=True)
pdfs = sorted(input_dir.glob("*.pdf"))
if not pdfs:
print(f"No PDF files found in {input_dir}")
return
ts = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
report_path = Path("reports") / f"report-{ts}.yaml"
entries = []
any_failed = False
for pdf in pdfs:
out_md = output_dir / (pdf.stem + ".md")
...
</details>
*This pull request was created as a result of the following prompt from Copilot chat.*
> Add a set of files to make it easy for users to batch-convert PDF files to Markdown with thepipe, provide a reproducible Docker environment, a Makefile, a runbook, a YAML report template, and a manual GitHub Actions workflow to run the conversion in CI. Files to add (create new files in the repository):
>
> 1) scripts/batch_convert.sh
>
> ```
> #!/usr/bin/env bash
> set -euo pipefail
>
> # 简单批量转换脚本(Bash)
> # 用法:
> # ./scripts/batch_convert.sh <input_dir> <output_dir>
> # 说明:
> # - 假设 thepipe CLI 提供: thepipe pdf2md <input.pdf> -o <output.md>
> # - 若 CLI 不同,请修改 convert_cmd 变量或替换 run_conversion() 中的命令
> # - 生成的日志文件为 logs/convert-YYYYMMDD-HHMMSS.log
> # - 脚本会写入 templates/conversion_report.yaml 的简短报告(每次覆盖一个带时间戳的报告)
>
> if [ "$#" -ne 2 ]; then
> echo "Usage: $0 <input_dir> <output_dir>"
> exit 1
> fi
>
> INPUT_DIR="$1"
> OUTPUT_DIR="$2"
> TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
> LOG_DIR="logs"
> REPORT_DIR="reports"
> mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" "${REPORT_DIR}"
>
> LOG_FILE="${LOG_DIR}/convert-${TIMESTAMP}.log"
> REPORT_FILE="${REPORT_DIR}/report-${TIMESTAMP}.yaml"
>
> # Modify this command template if your thepipe CLI differs.
> # Use %IN% and %OUT% placeholders or replace directly with an actual command.
> convert_cmd_template='thepipe pdf2md "%IN%" -o "%OUT%"'
>
> echo "Start conversion: $(date)" | tee "${LOG_FILE}"
> echo "input_dir: ${INPUT_DIR}" | tee -a "${LOG_FILE}"
> echo "output_dir: ${OUTPUT_DIR}" | tee -a "${LOG_FILE}"
> echo "report_file: ${REPORT_FILE}" | tee -a "${LOG_FILE}"
>
> echo "files:" > "${REPORT_FILE}"
>
> shopt -s nullglob
> count=0
> failed=0
> for pdf in "${INPUT_DIR}"/*.pdf; do
> base="$(basename "$pdf" .pdf)"
> out="${OUTPUT_DIR}/${base}.md"
> echo "Converting: ${pdf} -> ${out}" | tee -a "${LOG_FILE}"
> cmd=${convert_cmd_template//'\"%IN%\"'/"\"${pdf}\""}
> cmd=${cmd//'"%OUT%"'/"\"${out}\""}
> # run conversion
> if eval ${cmd} >> "${LOG_FILE}" 2>&1; then
> echo "OK: ${out}" | tee -a "${LOG_FILE}"
> echo " - file: \"${pdf}\"" >> "${REPORT_FILE}"
> echo " output: \"${out}\"" >> "${REPORT_FILE}"
> echo " status: ok" >> "${REPORT_FILE}"
> else
> echo "FAILED: ${pdf}" | tee -a "${LOG_FILE}"
> echo " - file: \"${pdf}\"" >> "${REPORT_FILE}"
> echo " output: \"${out}\"" >> "${REPORT_FILE}"
> echo " status: failed" >> "${REPORT_FILE}"
> failed=$((failed+1))
> fi
> count=$((count+1))
> done
>
> if [ "$count" -eq 0 ]; then
> echo "No PDF files found in ${INPUT_DIR}" | tee -a "${LOG_FILE}"
> else
> echo "Finished converting ${count} PDF(s). ${failed} failed." | tee -a "${LOG_FILE}"
> fi
>
> echo "End: $(date)" | tee -a "${LOG_FILE}"
> ```
>
> 2) scripts/convert_pdfs.py
>
> ```
> #!/usr/bin/env python3
> """
> Python bulk conversion wrapper.
>
> 用法:
> python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
>
> 功能:
> - 遍历 input-dir 下所有 .pdf(按字母排序)
> - 为每个文件调用 thepipe CLI(默认:thepipe pdf2md <in> -o <out>)
> - 记录每个文件的状态到 reports/report-<timestamp>.yaml
> - 在发生错误时返回非零退出码(便于在任务调度器/CI 里检测)
> - 可通过 --cmd-template 自定义实际执行命令(占位符 %IN% 和 %OUT%)
> """
> import argparse
> import subprocess
> import sys
> from pathlib import Path
> from datetime import datetime
>
>
> def run_conversion(pdf: Path, out_md: Path, cmd_template: str):
> cmd = cmd_template.replace("%IN%", str(pdf)).replace("%OUT%", str(out_md))
> try:
> subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> return True, ""
> except subprocess.CalledProcessError as e:
> return False, e.stderr.decode(errors="replace")
>
>
> def write_report(report_path: Path, entries):
> report_path.parent.mkdir(parents=True, exist_ok=True)
> try:
> import yaml
> with report_path.open("w", encoding="utf-8") as fh:
> yaml.safe_dump({"generated_at": datetime.utcnow().isoformat()+"Z", "entries": entries}, fh, sort_keys=False, allow_unicode=True)
> except Exception:
> # fallback plain text
> with report_path.open("w", encoding="utf-8") as fh:
> fh.write(f"generated_at: {datetime.utcnow().isoformat()}Z\n")
> for e in entries:
> fh.write(str(e) + "\n")
>
>
> def main():
> p = argparse.ArgumentParser()
> p.add_argument("--input-dir", "-i", required=True)
> p.add_argument("--output-dir", "-o", required=True)
> p.add_argument("--cmd-template", default='thepipe pdf2md "%IN%" -o "%OUT%"',
> help='Command template to run; use %IN% and %OUT% placeholders.')
> args = p.parse_args()
>
> input_dir = Path(args.input_dir)
> output_dir = Path(args.output_dir)
> if not input_dir.exists() or not input_dir.is_dir():
> print(f"Input directory {input_dir} not found", file=sys.stderr)
> sys.exit(2)
> output_dir.mkdir(parents=True, exist_ok=True)
>
> pdfs = sorted(input_dir.glob("*.pdf"))
> if not pdfs:
> print(f"No PDF files found in {input_dir}")
> return
>
> ts = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
> report_path = Path("reports") / f"report-{ts}.yaml"
> entries = []
> any_failed = False
>
> for pdf in pdfs:
> out_md = output_dir / (pdf.stem + ".md")
> print(f"Converting {pdf} -> {out_md}")
> ok, err = run_conversion(pdf, out_md, args.cmd_template)
> entry = {"file": str(pdf), "output": str(out_md), "status": "ok" if ok else "failed"}
> if not ok:
> entry["error"] = err[:1000]
> any_failed = True
> print(f"Failed: {pdf}. See report {report_path} for details.", file=sys.stderr)
> entries.append(entry)
>
> write_report(report_path, entries)
> print(f"Wrote report: {report_path}")
> if any_failed:
> sys.exit(3)
>
>
> if __name__ == "__main__":
> from datetime import datetime
> main()
> ```
>
> 3) Dockerfile
>
> ```
> FROM python:3.11-slim
>
> # system deps: poppler-utils (pdftotext), tesseract (如果需要 OCR)
> RUN apt-get update && apt-get install -y --no-install-recommends \
> poppler-utils \
> tesseract-ocr \
> && rm -rf /var/lib/apt/lists/*
>
> WORKDIR /app
> COPY . /app
>
> # 如果 thepipe 是本 repo 的 python 包,可在这里安装:
> # RUN pip install -e .
>
> # 安装 PyYAML(脚本里使用)
> RUN pip install --no-cache-dir pyyaml
>
> ENTRYPOINT ["sh", "-c"]
> CMD ["python3 /app/scripts/convert_pdfs.py -i pdfs -o converted-md"]
> ```
>
> 4) Makefile
>
> ```
> .PHONY: install convert docker-build docker-run
>
> # 本地安装(可选)
> install:
> python3 -m venv .venv
> . .venv/bin/activate && pip install --upgrade pip
> . .venv/bin/activate && pip install -r requirements.txt || true
> @echo "虚拟环境已创建:.venv ,请激活后运行脚本"
>
> convert:
> python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
>
> docker-build:
> docker build -t pdf2md-env .
>
> docker-run:
> docker run --rm -v $(PWD)/pdfs:/data/pdfs -v $(PWD)/converted-md:/data/out pdf2md-env \
> python3 /app/scripts/convert_pdfs.py -i /data/pdfs -o /data/out
> ```
>
> 5) docs/RUNBOOK.md
>
> ````markdown
> # PDF -> Markdown 批量转换 Runbook
>
> 目的
> - 提供一套“零到一”的可复现流程,方便任何人在本地或容器中把 PDF 批量转换为 Markdown,并能把操作过程记录成机器可读的报告与可审计的日志。
>
> 准备(本地)
> 1. 将要转换的 PDF 放到 `pdfs/` 目录(或任意目录,运行脚本时指定)。
> 2. 确保已安装 thepipe CLI(或修改脚本中的命令模板为你实际使用的转换工具)。
> - 示例(假定 thepipe 为 Python 包):在仓库根目录运行 `pip install -e .`
> 3. 可选:创建 Python 虚拟环境(参见 Makefile install)。
>
> 运行(简单)
> - Bash 脚本:
> chmod +x scripts/batch_convert.sh
> ./scripts/batch_convert.sh pdfs converted-md
>
> - Python 脚本:
> python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
>
> 运行(容器)
> - 构建镜像:
> docker build -t pdf2md-env .
> - 运行:
> docker run --rm -v $(pwd)/pdfs:/data/pdfs -v $(pwd)/converted-md:/data/out pdf2md-env \
> python3 /app/scripts/convert_pdfs.py -i /data/pdfs -o /data/out
>
> 日志与报告
> - 脚本会在 `logs/` 写入详细日志文件(bash 脚本)。
> - Python 脚本会在 `reports/report-<timestamp>.yaml` 写入结构化报告,包含每个文件的 status/output/error(若失败)。
> - 报告示例见 templates/conversion_report.yaml。
>
> 如何分享/复用
> - 把整个目录打包成 zip 发给别人(包含 scripts、Dockerfile、docs)。
> - 或者把 Docker 镜像上传到私有 registry,别人只需运行镜像。
> - 建议附上一个 small sample pdf(或指向可公开获取的示例)以便快速验证。
>
> 记录变更(可选)
> - 在本地可使用 git 管理版本(但不用推到 GitHub 也可以)。
> - 每次重要变更(脚本改动、thepipe 版本升级)记录 CHANGELOG.md 或在 reports 中新增 metadata。
>
> 排错提示
> - 如果脚本报 “thepipe: command not found”:
> - 确认 thepipe 是否在 PATH;或激活虚拟环境;或在 Dockerfile 中安装它。
> - 输出 Markdown 为空或文本顺序错乱:
> - 检查 thepipe 的参数(是否需要 OCR 步骤);尝试先用 pdftotext 单文件调试。
> - 转换大量文件时:
> - 分批处理或在脚本中添加并行处理(GNU parallel 或 multiprocessing)。
>
> 如果你需要
> - 我可以把脚本改成并发版(多进程);
> - 或者按你提供 thepipe 的实际 CLI,将脚本里的命令模板替换并再测试一次;
> - 或将这些文件打包成 zip 供你直接分发。
> ````
>
> 6) templates/conversion_report.yaml
>
> ```
> # conversion report template (YAML)
> # generated_at: 2025-11-14T12:00:00Z
> # entries:
> # - file: /path/to/pdfs/example.pdf
> # output: converted-md/example.md
> # status: ok
> # error: ""
> # - file: /path/to/pdfs/bad.pdf
> # output: converted-md/bad.md
> # status: failed
> # error: "Error message or excerpt"
> ```
>
> 7) .github/workflows/convert-pdfs.yml
>
> ```
> name: Convert PDFs to Markdown (manual)
>
> on:
> workflow_dispatch:
> inputs:
> pdf_dir:
> description: 'Relative path to PDF directory (in repo)'
> required: true
> default: 'pdfs'
> out_dir:
> description: 'Output directory for markdown'
> required: true
> default: 'converted-md'
> commit_results:
> description: 'Commit converted files back to the repo?'
> required: true
> default: 'true'
>
> jobs:
> convert:
> runs-on: ubuntu-latest
> steps:
> - name: Checkout repository
> uses: actions/checkout@v4
> with:
> fetch-depth: 0
>
> - name: Set up Python (if needed)
> uses: actions/setup-python@v4
> with:
> python-version: '3.11'
>
> - name: Install repository (if thepipe is python package) and dependencies
> run: |
> # 如果 thepipe 是本仓库的 python 包,可以在此安装
> if [ -f "pyproject.toml" ] || [ -f "setup.py" ]; then
> pip install -e .
> fi
> # 安装脚本可能需要的额外工具(按需修改)
> pip install -r requirements.txt || true
>
> - name: Run conversion script
> env:
> INPUT_DIR: ${{ github.event.inputs.pdf_dir }}
> OUTPUT_DIR: ${{ github.event.inputs.out_dir }}
> run: |
> mkdir -p "${OUTPUT_DIR}"
> chmod +x scripts/batch_convert.sh || true
> ./scripts/batch_convert.sh "${INPUT_DIR}" "${OUTPUT_DIR}"
>
> - name: Commit converted files
> if: ${{ github.event.inputs.commit_results == 'true' }}
> run: |
> git config user.name "github-actions[bot]"
> git config user.email "github-actions[bot]@users.noreply.github.com"
> git add "${{ github.event.inputs.out_dir }}" || true
> if git diff --staged --quiet; then
> echo "No changes to commit"
> else
> git commit -m "chore: auto-convert PDFs to markdown (from workflow)"
> git push
> fi
> ```
>
> Notes and assumptions:
> - I did not set a base branch; please let me know if you want the PR target to be a different branch than the repository default.
> - The scripts assume the CLI command is `thepipe pdf2md <in> -o <out>`; if the real CLI differs please provide the correct invocation and I will update the scripts.
> - The workflow is manual (workflow_dispatch) and by default attempts to commit results if the input commit_results is 'true'.
>
> Please create a pull request with these additions in the repository emcf/thepipe. Thank you.
<!-- START COPILOT CODING AGENT TIPS -->
---
💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).