tatoeba2
tatoeba2 copied to clipboard
Incorrectly formatted CSV files
Hi. Some of the dump files on the Downloads page are incorrectly formatted.
The details field on the user_languages.csv file, for example, allows tabs and newlines, which should not be allowed in a TSV file. They should be replaced with spaces. Also, the file contains some lines with empty fields, which should also be filled with spaces.
The query field in the queries.csv file allows commas and newlines, which should not be allowed in a CSV file. The file should be converted to TSV. Also, queries.csv is either not encoded with UTF-8, although it should be, or is corrupted, because I get a decoding error when reading it in Python using the line of code below.
for line in open("queries.csv", encoding = "utf-8"): pass
queries.csv should be updated once a year, excluding queries made in the previous year, to prevent manipulation of Tatominer.
TSV files should use the extension .tsv instead of .csv.
sentences.csv, sentences_detailed.csv, and sentences_base.csv could be consolidated into a single file.
user_languages.csv could be renamed languages.tsv and users_sentences.csv could be renamed reviews.tsv.
users_sentences.csv should be compressed, like the rest of the files.