Data Extraction from Markdown with Tables
I am just exploring whether langextract is a good alternative to extract values from a Markdown file that includes structured tables? So If I have 3 tables, each with two columns, is using langextract is a good option to retrieve the value for a specific row? If yes - any guidance on how to set up the examples in those scenario?
Hello. It is viable to do so. You'll need to write a clear prompt (what should the LX extract for you) and have a separate list of fields that you'd like to extract. At first, you may need to load your data (.md file) and then, after gettingthe result from lx.extract(...), you need to map the extracted entities to a key-value pair. Creating a global dictionary would help here to store the required key-value pairs. Afterward, you can save the structured output into a .jsonl file and then visualize the .jsonl file via lx.visualize(...) in HTML format.
Hope it helps.
p.s. The examples should contain the fields you wanna extract. Each field can be extraction_class and its corresponding answer should be in extraction_text. Ofc via LangExtract you can set attributes if needed.
Thanks for the comprehensive response @MeshkatShB, I completely agree. TL;DR: markdown is a pure text representation that you can model in a few-shot example for the LLM.
If anyone has a nice example of this working well, I’d love to add it as an example since I think it’s a cool use case. Another pointer is that attributes allow for slightly more flexibility vs. the extraction_class, which you can think of as the “main” attribute. For more historical context, LX was originally written with extraction_class first, but as LLMs became more powerful, attributes were added later :)