openai-python icon indicating copy to clipboard operation
openai-python copied to clipboard

`openai tools fine_tunes.prepare_data` does not accept indented JSON files

Open reinvantveer opened this issue 2 years ago • 2 comments
trafficstars

To reproduce:

pip install --user openai[datalib]==0.26.4 

# Works fine:
echo '[{"prompt": "Here is my example input 1", "completion": "Complete to 1"}, {"prompt": "Here is my example input 2", "completion": "Complete to 2"}]' > unindented.json
openai tools fine_tunes.prepare_data --quiet --file unindented.json

# Doesn't work:
cat > to_indented.py << EOF
import json
with open('unindented.json', 'rt') as f:
    data = json.loads(f.read())
# Simple rewrite of the "unindented.json": output to indented version
with open('indented', 'wt') as f:
    f.write(json.dumps(data, indent=2))
EOF
python to_indented.py
openai tools fine_tunes.prepare_data -f indented.json

reinvantveer avatar Jan 31 '23 12:01 reinvantveer

The main issue with this is (at the risk of being over-obvious) that I like my JSON files indented for readability. Thanks for making this available!

reinvantveer avatar Jan 31 '23 13:01 reinvantveer

My first suspicion is this line: https://github.com/openai/openai-python/blob/main/openai/validators.py#L525 The (supposed) json file path is passed directly to pandas to read a dataframe from it, but it somehow fails. Since this is executed in a huge try/except block, you could try to either remove the try/except clauses and see what pandas makes of this and why it thinks it's an invalid json file.

Another way to approach this is to make a intermediate representation, by opening the file, reading the contents and passing it to json.loads before passing it on to pandas. But I'm not very familiar with pandas, so I'm not sure what pandas expects as input from a list of dicts.

reinvantveer avatar Feb 01 '23 20:02 reinvantveer

@BorisPower @joe-at-openai can y'all look at this? I suspect this broke with https://github.com/openai/openai-python/pull/190 for json parsing, specifically on df = pd.read_json(fname, lines=True, dtype=str).fillna("") since .json files would span multiple lines

hallacy avatar Apr 08 '23 15:04 hallacy

@hallacy Yeah, no problem!

joe-at-openai avatar Apr 08 '23 23:04 joe-at-openai

https://github.com/openai/openai-python/pull/389 should fix this

hallacy avatar Apr 10 '23 15:04 hallacy