Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Data augmentation for legal datasets

Open huu4ontocord opened this issue 2 years ago • 1 comments

Convert the dataset https://huggingface.co/datasets/cuad into instructions. Convert some of the questions already present into more natural instructions. Instead of "Highlight the X" -> "Please help me find" or "What is the X".

Save the data as jsonl format and we can upload to the HF hub when done.

This will be used (after some cleanup) as an additional training data to diversify our dataset.

huu4ontocord avatar Jan 14 '23 18:01 huu4ontocord

i'd like to pick up this issue. the idea is basically the same as you describe above:

  1. extract contract related keywords (or 'Details') as X from the question
  2. convert the question to instructions: "Highlight the X" -> "Please help me find {X}" or "What is the {X}".

sample

"context": {long....text}
"answers": {'text': ['DISTRIBUTOR AGREEMENT']}
'question': 'Highlight the parts (if any) of this contract related to "Document Name" that should be reviewed by a lawyer. Details: The name of the contract'
 -> 
'instructions': "Please help me find the Document Name" or 'What is the Document Name? '

checkout the colab code here, to see weather i'm in the right direction.

And i have uploaded the augmented dataset to HF hub cuad-instructions

zirui avatar Jan 28 '23 11:01 zirui

Closing old data issue.

andreaskoepf avatar Jun 14 '23 08:06 andreaskoepf