Creating your own dataset load_dataset issue

Open fancellu opened this issue 1 year ago • 2 comments

https://huggingface.co/learn/nlp-course/chapter5/5?fw=pt

https://discuss.huggingface.co/t/chapter-5-questions/11744/83?u=fancellu

issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")

barfs with

TypeError: Couldn't cast array of type timestamp[s] to null

Someone else saw the same too in Sept 2023

Mar 15 '24 20:03 fancellu

When I split into 1k line files, and run load_dataset on each, it all works fine!

To make this easier to solve, here is my poison payload, zipped up

datasets-issues.zip

Mar 15 '24 21:03 fancellu

Also, if I remove pull_requests from the json, the filtered jsonl loads just fine too. e.g.

import json

filtered_lines = []
with open("datasets-issues.jsonl", "r") as f:  
  for line in f:    
    data = json.loads(line.strip())  # Parse each line as JSON
    if not data.get("pull_request"):  # Check if "pull_request" key is absent
      filtered_lines.append(line)

# Write the filtered lines to a new file
with open("filtered_jsonl.jsonl", "w") as f:
  f.writelines(filtered_lines)

Mar 16 '24 07:03 fancellu