PreSumm
PreSumm copied to clipboard
Step 4. Format to Simpler Json Files
Could you clear to what should the Step 4. Format to Simpler Json Files do . my case : i have my own data-set . i am trying to apply these steps on it. Now I performed to Step 3. Sentence Splitting and Tokenization and generated Json files . regarding my own data-set step 4 did not perform any thing. after studying the code related to step 4 in function called format_to_lines -->data_builder.py . this function compare my json file by name with mapping file with the same name in URL directory. I think the isse in this loop
for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
temp.append(hashhex(line.strip()))
corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}
the corpus_mapping[corpus_type] length are
corpus_mapping valid 13368 corpus_mapping test 11490 corpus_mapping train 287227
train_files,` valid_files, test_files = [], [], []
print("glob jason",glob.glob(pjoin(args.raw_path, '*.json')))
for f in glob.glob(pjoin(args.raw_path, '*.json')):
print("f",f)
real_name = f.split('/')[-1].split('.')[0]
print("real_name",real_name)
if (real_name in corpus_mapping['valid']):
valid_files.append(f)
elif (real_name in corpus_mapping['test']):
test_files.append(f)
elif (real_name in corpus_mapping['train']):
train_files.append(f)
# else:
# train_files.append(f)
print("len train_files, valid_files, test_files ",len(train_files), len(valid_files), len(test_files ))
len train_files, valid_files, test_files 0 0 0 could you help me ?
@fatmas1982 did you manage to resolve the issue
No
Even I'm facing the same issue. Let me know if anyone know the solution to this issue or any other way to preprocess the data
I dont know tell now
same problem
I am also running into issues trying to preprocess my own data for fine-tuning, I'm not sure how i should format my mapping files for custom data
I was able to get mine working by removing the call to hashhex in temp.append(hashhex(line.strip())); the original code seems to hash the URLs in the mapping files to generate the filenames to go into each set. I instead made it append the raw file names, not sure if that helps
I was able to get mine working by removing the call to hashhex in temp.append(hashhex(line.strip())); the original code seems to hash the URLs in the mapping files to generate the filenames to go into each set. I instead made it append the raw file names, not sure if that helps
I tried removing the call to hashhex in temp.append(hashhex(line.strip())); but there is no difference, I'm still getting nothing
For me, the 'real_name' variable was not getting set because I'm working on a windows machine and windows uses '\' instead of '/' in its path. So, in format_to_lines(args)
when i changed real_name = f.split('/')[-1].split('.')[0]
to real_name = f.split('\\')[-1].split('.')[0]
. It worked for me.
Hi there, I faced the same issue and I find that this is caused by the following code in the data_builder.py:
# build the corpus_mapping dict according to the files in urls
corpus_mapping = {}
for corpus_type in ['valid', 'test', 'train']:
temp = []
for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
temp.append(hashhex(line.strip()))
corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}
train_files, valid_files, test_files = [], [], []
for f in glob.glob(pjoin(args.raw_path, '*.json')):
real_name = f.split('/')[-1].split('.')[0]
#since the name of our datafile is not in the corpus_mapping dict, all the following conditions would not be satisfied
if (real_name in corpus_mapping['valid']):
valid_files.append(f)
elif (real_name in corpus_mapping['test']):
test_files.append(f)
elif (real_name in corpus_mapping['train']):
train_files.append(f)
Since we use our own dataset, it does not appear in the corpus_mapping dict, which is used for the cnn train/test/valid dataset splitting. Therefore, the list of train_files/test_files/valid_files would be [].
For me, I removed the if (real_name in corpus_mapping['XXX']):
conditions, and set a ratio for data splitting.
e.g.
cur = 0
valid_test_ratio = 0.01
all_size = len(glob.glob(pjoin(args.raw_path, '*.json')))
for f in glob.glob(pjoin(args.raw_path, '*.json')):
real_name = f.split('/')[-1].split('.')[0]
if (cur < valid_test_ratio*all_size):
valid_files.append(f)
elif (cur < valid_test_ratio*2*all_size):
test_files.append(f)
else:
train_files.append(f)
cur += 1
It works for me. :)
Hi there, I faced the same issue and I find that this is caused by the following code in the data_builder.py:
# build the corpus_mapping dict according to the files in urls corpus_mapping = {} for corpus_type in ['valid', 'test', 'train']: temp = [] for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')): temp.append(hashhex(line.strip())) corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp} train_files, valid_files, test_files = [], [], [] for f in glob.glob(pjoin(args.raw_path, '*.json')): real_name = f.split('/')[-1].split('.')[0] #since the name of our datafile is not in the corpus_mapping dict, all the following conditions would not be satisfied if (real_name in corpus_mapping['valid']): valid_files.append(f) elif (real_name in corpus_mapping['test']): test_files.append(f) elif (real_name in corpus_mapping['train']): train_files.append(f)
Since we use our own dataset, it does not appear in the corpus_mapping dict, which is used for the cnn train/test/valid dataset splitting. Therefore, the list of train_files/test_files/valid_files would be [].
For me, I removed the
if (real_name in corpus_mapping['XXX']):
conditions, and set a ratio for data splitting. e.g.cur = 0 valid_test_ratio = 0.01 all_size = len(glob.glob(pjoin(args.raw_path, '*.json'))) for f in glob.glob(pjoin(args.raw_path, '*.json')): real_name = f.split('/')[-1].split('.')[0] if (cur < valid_test_ratio*all_size): valid_files.append(f) elif (cur < valid_test_ratio*2*all_size): test_files.append(f) else: train_files.append(f) cur += 1
It works for me. :)
Worked for me thanks !!!
For me, the 'real_name' variable was not getting set because I'm working on a windows machine and windows uses '' instead of '/' in its path. So, in
format_to_lines(args)
when i changedreal_name = f.split('/')[-1].split('.')[0]
toreal_name = f.split('\\')[-1].split('.')[0]
. It worked for me.
thanks! It work in my project!