markitdown
markitdown copied to clipboard
feat: support automatic language code determination with `guesslang`
Pull Request Description
Summary
This PR enhances _CustomMarkdownify with a feature:
- Automatic Code Language Detection:
- Uses
guesslangto detect programming languages in code snippets. - Provides a fallback if
guesslangis not installed.
- Uses
Changes
- Imported
Guessfromguesslangwith a fallback. - Added
code_language_callbackto detect code languages.
How to Test
- Install
guesslangviapip install guesslang. - Convert a document with code snippets.
- Verify correct language detection.
Thank you for reviewing!
Love this idea, but I'm thinking of using magika for type detection (See #1108). Since magika can do code language detection as well, would you consider updating this PR to use it instead (to avoid another dependency)?
You should be able to just do:
from magika import Magika
m = Magika()
res = m.identify_bytes(b".... contents of the code block ...")
if res.status == "ok" and res.prediction.output.group in ["text", "code"]:
language = res.prediction.output.label
Or something like that.
I rebase it with your magika branch, and update it as requested, it should be merged after magika. Thanks!