markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

feat: support automatic language code determination with `guesslang`

Open lh0x00 opened this issue 10 months ago • 2 comments

Pull Request Description

Summary

This PR enhances _CustomMarkdownify with a feature:

  1. Automatic Code Language Detection:
    • Uses guesslang to detect programming languages in code snippets.
    • Provides a fallback if guesslang is not installed.

Changes

  • Imported Guess from guesslang with a fallback.
  • Added code_language_callback to detect code languages.

How to Test

  1. Install guesslang via pip install guesslang.
  2. Convert a document with code snippets.
  3. Verify correct language detection.

Thank you for reviewing!

lh0x00 avatar Feb 10 '25 12:02 lh0x00

Love this idea, but I'm thinking of using magika for type detection (See #1108). Since magika can do code language detection as well, would you consider updating this PR to use it instead (to avoid another dependency)?

You should be able to just do:

from magika import Magika
m = Magika()
res = m.identify_bytes(b".... contents of the code block ...")
if res.status == "ok" and res.prediction.output.group in ["text", "code"]:
    language =  res.prediction.output.label

Or something like that.

afourney avatar Mar 08 '25 18:03 afourney

I rebase it with your magika branch, and update it as requested, it should be merged after magika. Thanks!

lh0x00 avatar Mar 10 '25 03:03 lh0x00