langextract icon indicating copy to clipboard operation
langextract copied to clipboard

How to ensure reliability

Open hnsqls opened this issue 4 months ago • 4 comments

Why is it that such a simple task can go wrong? The first time I ran the program, the extraction was correct, but after running it several more times, there were chaotic outputs. I'm using the locally deployed Ollama model qwen7:b. Additionally, the Chinese characters in the generated visualization results are garbled. How can I solve these issues?

code such as

import langextract as lx
import textwrap


# 1. 定义「提取任务」:告诉模型要提取什么、怎么提取(简化版本)
prompt = "从文本中提取人名"


# 2. 提供「示例」:给模型参考,让提取更精准(简化示例)
examples = [
    lx.data.ExampleData(
        text="张三是项目经理",
        extractions=[
            lx.data.Extraction(
                extraction_class="personName",
                extraction_text="张三"
            )
        ]
    )
]


# 3. 准备「输入文本」:你要提取信息的内容(简化文本)
input_text = "李四负责技术开发,李明负责项目管理"



# 4. 运行提取(使用本地 Ollama 模型)
try:
    result = lx.extract(
        text_or_documents=input_text,
        prompt_description=prompt,
        examples=examples,
        language_model_type=lx.inference.OllamaLanguageModel,
        model_id="qwen:7b",  # 使用你刚启动的本地模型
        model_url="http://localhost:11434",
        fence_output=False,
        use_schema_constraints=False
    )
    
    print("✅ 模型连接成功!")
except Exception as e:
    print(f"❌ 连接失败: {e}")
    print("请确保 Ollama 服务正在运行")
    exit(1)


    
# 5. 查看结果
print("=== 提取结果 ===")
for extraction in result.extractions:
    print(f"类别: {extraction.extraction_class}")
    print(f"文本: {extraction.extraction_text}")
    print(f"属性: {extraction.attributes}")
    print("-" * 30)

# 6. 可视化结果
print("\n🎨 生成可视化...")

# 步骤1:保存提取结果为 JSONL 文件
print("📁 保存提取结果到文件...")
lx.io.save_annotated_documents(
    [result],                           # 提取结果列表
    output_name="extraction_results.jsonl",  # 输出文件名
    output_dir="./temp"                 # 输出目录(temp目录)
)
print("✅ 已保存到: temp/extraction_results.jsonl")

# 步骤2:生成 HTML 可视化
print("🌐 生成 HTML 可视化...")
html_content = lx.visualize("temp/extraction_results.jsonl")  # 从文件生成可视化
with open("temp/visualization.html", "w", encoding="utf-8") as f:
    f.write(html_content)
print("✅ 已生成: temp/visualization.html")

print("\n🎉 可视化完成!")
print("📂 生成的文件:")
print("  - temp/extraction_results.jsonl (提取结果数据)")
print("  - temp/visualization.html (可视化网页)")
print("\n💡 打开 temp/visualization.html 查看可视化结果")

hnsqls avatar Aug 08 '25 07:08 hnsqls

用它的多语种分支看看

peterhunter99001-cyber avatar Aug 08 '25 07:08 peterhunter99001-cyber

{"extractions": [{"extraction_class": "personName", "extraction_text": "����", "char_interval": null, "alignment_status": null, "extraction_index": 1, "group_index": 0, "description": null, "attributes": {}}, {"extraction_class": "personName", "extraction_text": "����", "char_interval": null, "alignment_status": null, "extraction_index": 2, "group_index": 1, "description": null, "attributes": {}}], "text": "���ĸ���������,����������Ŀ����", "document_id": "doc_1fb9b717"}

hnsqls avatar Aug 08 '25 08:08 hnsqls

#98 It might be solved in my latest PR, please try pulling the latest main branch to try.

wade6716 avatar Aug 08 '25 13:08 wade6716

Hi @hnsqls,

PR #239 should potentially address the Ollama reliability issues you reported. The PR includes specific optimizations for Ollama that improve extraction consistency and success rates.

If reliability issues persist after the PR is merged, please reopen with details from your logs and your specific model configuration so we can investigate further.

Thank you for reporting this issue and helping improve LangExtract!

aksg87 avatar Sep 12 '25 11:09 aksg87