tatoeba2 icon indicating copy to clipboard operation
tatoeba2 copied to clipboard

Incorrectly formatted CSV files

Open cangareijo opened this issue 2 years ago • 6 comments

Hi. Some of the dump files on the Downloads page are incorrectly formatted.

The details field on the user_languages.csv file, for example, allows tabs and newlines, which should not be allowed in a TSV file. They should be replaced with spaces. Also, the file contains some lines with empty fields, which should also be filled with spaces.

The query field in the queries.csv file allows commas and newlines, which should not be allowed in a CSV file. The file should be converted to TSV. Also, queries.csv is either not encoded with UTF-8, although it should be, or is corrupted, because I get a decoding error when reading it in Python using the line of code below.

for line in open("queries.csv", encoding = "utf-8"): pass

queries.csv should be updated once a year, excluding queries made in the previous year, to prevent manipulation of Tatominer.

TSV files should use the extension .tsv instead of .csv.

sentences.csv, sentences_detailed.csv, and sentences_base.csv could be consolidated into a single file.

user_languages.csv could be renamed languages.tsv and users_sentences.csv could be renamed reviews.tsv.

users_sentences.csv should be compressed, like the rest of the files.

cangareijo avatar Aug 22 '22 10:08 cangareijo