openai-python
openai-python copied to clipboard
`openai tools fine_tunes.prepare_data` does not accept indented JSON files
To reproduce:
pip install --user openai[datalib]==0.26.4
# Works fine:
echo '[{"prompt": "Here is my example input 1", "completion": "Complete to 1"}, {"prompt": "Here is my example input 2", "completion": "Complete to 2"}]' > unindented.json
openai tools fine_tunes.prepare_data --quiet --file unindented.json
# Doesn't work:
cat > to_indented.py << EOF
import json
with open('unindented.json', 'rt') as f:
data = json.loads(f.read())
# Simple rewrite of the "unindented.json": output to indented version
with open('indented', 'wt') as f:
f.write(json.dumps(data, indent=2))
EOF
python to_indented.py
openai tools fine_tunes.prepare_data -f indented.json
The main issue with this is (at the risk of being over-obvious) that I like my JSON files indented for readability. Thanks for making this available!
My first suspicion is this line: https://github.com/openai/openai-python/blob/main/openai/validators.py#L525 The (supposed) json file path is passed directly to pandas to read a dataframe from it, but it somehow fails. Since this is executed in a huge try/except block, you could try to either remove the try/except clauses and see what pandas makes of this and why it thinks it's an invalid json file.
Another way to approach this is to make a intermediate representation, by opening the file, reading the contents and passing it to json.loads before passing it on to pandas. But I'm not very familiar with pandas, so I'm not sure what pandas expects as input from a list of dicts.
@BorisPower @joe-at-openai can y'all look at this? I suspect this broke with https://github.com/openai/openai-python/pull/190 for json parsing, specifically on df = pd.read_json(fname, lines=True, dtype=str).fillna("") since .json files would span multiple lines
@hallacy Yeah, no problem!
https://github.com/openai/openai-python/pull/389 should fix this