OneStopEnglishCorpus Texts-Together-OneCSVperFile are not in UTF-8

I am experiencing some difficulties when loading the files using python's Pandas library since they do not appear to be in the standard utf-8 format.

I tried to use the charade to determine the original encoding and the suggested one is ISO-8859-2, but even converting from ISO-8859-2 to UTF-8 produces wrong characters.

Is it possible to address this? It would just require to convert the set of CSV to UTF-8 and replace the ones currently on Zenodo.

May 10 '20 18:05 gsarti

Just another user passing by. UTF-8 and ascii don't work for sure. cp-1252 is your best bet. Delete WNL Scarlett file before you run code.

But be minded that cp-1252 also produces wrong characters.

Nov 28 '21 02:11 brucewlee

ran into the same issue today. hasn't been solved as far as i can see. really nasty to see some noise introduced into the dataset due to misaligned encodings :(

Jan 09 '23 09:01 mnbucher

On TextsSeparatedByReading Level, try this script. This will give the output files that you need. @mnbucher

from glob import glob
import csv
import os
import pandas as pd
import random
import shutil
from collections import defaultdict

path_list = [("Adv-Txt",3),
			("Int-Txt",2),
			("Ele-Txt",1)
			]

# num_of_texts_per_grade = 5


def txt2csv(path,level):
	processed_list = []
	index = 0
	output_file_name = "output/" + path + ".csv"
	output = open(output_file_name, "w")
	writemachine = csv.writer(output)
	for fname in glob("OneStop_Raw/"+ path +"/*.txt"):
		index += 1
		print("Analyzing", fname)
		target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
		target_text = target_file.read()
		target_text = target_text.encode("ascii","ignore")
		target_text = target_text.decode()
		print("Adding", fname)
		writemachine.writerow([index, target_text,str(level),path])
		processed_list.append([index, target_text,str(level),path])
		target_file.close()
	output.close()
	return index, processed_list


def pairwise_txt2csv(path_list):
	adv_list = []
	int_list = []
	ele_list = []
	all_list = []
	index1 = 0
	index2 = 0
	index3 = 0
	all_index = 0
	for fname in glob("OneStop_Raw/Adv-Txt/*.txt"):
		row_dict = defaultdict()
		index1 += 1
		print("Analyzing", fname)
		target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
		target_text = target_file.read()
		target_text = target_text.encode("ascii","ignore")
		target_text = target_text.decode()
		print("Adding", fname)
		adv_list.append((" ".join(target_text.split()),fname))
		target_file.close()
	for fname in glob("OneStop_Raw/Int-Txt/*.txt"):
		row_dict = defaultdict()
		index2 += 1
		print("Analyzing", fname)
		target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
		target_text = target_file.read()
		target_text = target_text.encode("ascii","ignore")
		target_text = target_text.decode()
		print("Adding", fname)
		int_list.append((" ".join(target_text.split()),fname))
		target_file.close()
	for fname in glob("OneStop_Raw/Ele-Txt/*.txt"):
		row_dict = defaultdict()
		index3 += 1
		print("Analyzing", fname)
		target_file = open(fname, 'r', encoding='utf-8', errors='ignore')
		target_text = target_file.read()
		target_text = target_text.encode("ascii","ignore")
		target_text = target_text.decode()
		print("Adding", fname)
		ele_list.append((" ".join(target_text.split()),fname))
		target_file.close()
	if index1 == index2 == index3:
		print ("LENGTH SAME")

	"""sort by file name alphabetical order to ensure paraphrases of the same text is paired"""
	adv_list.sort(key=lambda tup: tup[1])
	int_list.sort(key=lambda tup: tup[1])
	ele_list.sort(key=lambda tup: tup[1])

	for idx in range(len(adv_list)):
		all_index+=1
		all_list.append({"3":adv_list[idx][0], "2":int_list[idx][0], "1":ele_list[idx][0], "f3":adv_list[idx][1], "f2":int_list[idx][1], "f1":ele_list[idx][1]})

		print(adv_list[idx][1], int_list[idx][1], ele_list[idx][1])
	print("pairwise number of texts:"+str(all_index))
	return all_index, all_list


def final_output(index, processed_list):
	balanced_list = random.sample(processed_list, len(processed_list))
	print("...final_output...")
	output = open("output/final_output.csv", "a")
	writemachine = csv.writer(output)
	for row in balanced_list:
		writemachine.writerow(row)
		print("writing..." + str(row[0]))

def pairwise_final_output(index, processed_list):
	this_df = pd.DataFrame(processed_list)
	   


if __name__ == '__main__':
	try:
		shutil.rmtree("output")
		os.mkdir("output")
	except:
		os.mkdir("output")
	output_file = open("output/final_output.csv", "w")
	for path in path_list:
		index, processed_list = txt2csv(path[0],path[1])
		final_output(index, processed_list)
	pairwise_index, pairwise_processed_list = pairwise_txt2csv(path_list)
	pairwise_final_output(pairwise_index,pairwise_processed_list)
	count_1,count_2,count_3 = count_texts()
	print ("l1:"+ str(count_1)+"\n"+
			"l2:"+ str(count_2)+"\n"+
			"l3:"+ str(count_3))

Jan 09 '23 19:01 brucewlee

hi @brucewlee thanks for the code. i managed to write my own version and hope i minimized the noise of the encoding in it. currently confused by the dataset size though. the paper states that "The corpus consists of 189 texts, each in three versions (567 in total". but i actually have 7278 samples after parsing all TXTs. has the dataset been massively extended after the original publication?

Jan 11 '23 08:01 mnbucher

@mnbucher No. In my experience, the corpus size is the same as stated in the paper. In the above script, see txt2csv instead of pairwise_txt2csv. The latter creates pairwise instances.

Jan 11 '23 14:01 brucewlee

OneStopEnglishCorpus OneStopEnglishCorpus copied to clipboard

Texts-Together-OneCSVperFile are not in UTF-8

OneStopEnglishCorpus
OneStopEnglishCorpus copied to clipboard