LegalQA icon indicating copy to clipboard operation
LegalQA copied to clipboard

Korean LegalQA using SentenceKoBART

  • LegalQA using SentenceKoBART
    • Setup
    • Index
    • Train
      • Learn to Rank with KoBERT
    • Search
      • With REST API
      • From the terminal
        • Approximate KNN Search
    • Presentation
    • Demo
    • Links
    • FAQ
      • Why this dataset?
      • LFS quota is exceeded
    • Citation
    • License

LegalQA using SentenceKoBART

Implementation of legal QA system based on SentenceKoBART

  • How to train SentenceKoBART
  • Based on Neural Search Engine Jina v2.0
  • Provide Korean legal QA data(1,830 pairs)
  • Apply approximate KNN search with Faiss, Annoy, Hnswlib.

Setup

# install git lfs , https://github.com/git-lfs/git-lfs/wiki/Installation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git clone https://github.com/haven-jeon/LegalQA.git
cd LegalQA
git lfs pull
# If the lfs quota is exceeded, please download it with the command below.
# wget http://gogamza.ipdisk.co.kr:80/gogamzapubs/VOL1/URLs/models/SentenceKoBART.bin
# mv SentenceKoBART.bin model/
pip install -r requirements.txt

Index

python app.py -t index

GPU-based indexing available as an option

  • pods/encode.yml - device: cuda

Train

The SentenceKoBART is not a model tuned based on the legal task, so it guarantees good recall, but requires adjustment in terms of precision. By re-ranking the results of top-k using a cross-encoder, we can supplement in terms of precision.

  • Model : Ranking for general purpose
  • Learn to Rank : Ranking for task specific purpose

Learn to Rank with KoBERT

Initial training is done by classifying whether the title of the dataset and the question are related pairs like below.

Why BERT?

  • To use BERT NSP power.

[CLS] title [SEP] question [SEP]

title question label
오토바이의 고속도로 주행금지가 행복추구권 등을 침해한 것은 아닌지 여부 甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ... positive
피해자과실로 인한 교통사고로 개인택시사업면허가 취소된 경우 甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ... negative
python app.py -t train

The trained model is saved in the rerank_model directory.

We provide a KoBERT model tuned with LegalQA(gogamza/kobert-legalqa-v1).

Search

With REST API

To start the Jina server for REST API:

# python app.py -t query_restful --query_flow flows/query_numpy_rerank.yml
python app.py -t query_restful 

Then use a client to query:

curl --request POST -d '{"parameters": {"top_k": 1},  "data": ["상속 관련 문의"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:1234/search'

Or use Jinabox with endpoint http://127.0.0.1:1234/search

From the terminal

# python app.py -t query --query_flow flows/query_numpy_rerank.yml
python app.py -t query

Approximate KNN Search

python app.py -t query_restful --query_flow flows/query_hnswlib_rerank.yml

python app.py -t query_restful --query_flow flows/query_faiss_rerank.yml

python app.py -t query_restful --query_flow flows/query_annoy_rerank.yml

  • Retrieval time(sec.)
    • AMD Ryzen 5 PRO 4650U, 16 GB Memory
    • Average of 100 searches
    • Excluding BertReRanker
top-k Numpy Hnswlib Faiss Annoy
10 1.433 0.101 0.131 0.118

Presentation

Demo

Links

FAQ

Why this dataset?

Legal data is composed of technical terms, so it is difficult to search if you are not familiar with these terms. Because of these characteristics, I thought it was a good example to show the effectiveness of neural IR.

LFS quota is exceeded

You can download SentenceKoBART.bin from one of the two links below.

  • http://gogamza.ipdisk.co.kr:80/gogamzapubs/VOL1/URLs/models/SentenceKoBART.bin
  • https://komodels.s3.ap-northeast-2.amazonaws.com/models/SentenceKoBART.bin

Citation

Model training, data crawling, and demo system were all supported by the AWS Hero program.

@misc{heewon2021,
author = {Heewon Jeon},
title = {LegalQA using SentenceKoBART},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/LegalQA}}

License

  • QA data data/legalqa.jsonlines is crawled in www.freelawfirm.co.kr based on robots.txt. Commercial use other than academic use is prohibited.
  • We are not responsible for any legal decisions we make based on the resources provided here.