Open-Assistant
Open-Assistant copied to clipboard
Data augmentation for legal datasets
Convert the dataset https://huggingface.co/datasets/cuad into instructions. Convert some of the questions already present into more natural instructions. Instead of "Highlight the X" -> "Please help me find" or "What is the X".
Save the data as jsonl format and we can upload to the HF hub when done.
This will be used (after some cleanup) as an additional training data to diversify our dataset.
i'd like to pick up this issue. the idea is basically the same as you describe above:
- extract contract
related
keywords (or 'Details') asX
from thequestion
- convert the
question
to instructions: "Highlight the X" -> "Please help me find {X}" or "What is the {X}".
sample
"context": {long....text}
"answers": {'text': ['DISTRIBUTOR AGREEMENT']}
'question': 'Highlight the parts (if any) of this contract related to "Document Name" that should be reviewed by a lawyer. Details: The name of the contract'
->
'instructions': "Please help me find the Document Name" or 'What is the Document Name? '
checkout the colab code here, to see weather i'm in the right direction.
And i have uploaded the augmented dataset to HF hub cuad-instructions
Closing old data issue.