PreSumm Step 4. Format to Simpler Json Files

Could you clear to what should the Step 4. Format to Simpler Json Files do . my case : i have my own data-set . i am trying to apply these steps on it. Now I performed to Step 3. Sentence Splitting and Tokenization and generated Json files . regarding my own data-set step 4 did not perform any thing. after studying the code related to step 4 in function called format_to_lines -->data_builder.py . this function compare my json file by name with mapping file with the same name in URL directory. I think the isse in this loop

for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

the corpus_mapping[corpus_type] length are

corpus_mapping valid 13368 corpus_mapping test 11490 corpus_mapping train 287227

train_files,` valid_files, test_files = [], [], []
    print("glob jason",glob.glob(pjoin(args.raw_path, '*.json')))
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        print("f",f)
        real_name = f.split('/')[-1].split('.')[0]
        print("real_name",real_name)
        if (real_name in corpus_mapping['valid']):
            valid_files.append(f)
        elif (real_name in corpus_mapping['test']):
            test_files.append(f)
        elif (real_name in corpus_mapping['train']):
            train_files.append(f)
        # else:
        #     train_files.append(f)
    print("len train_files, valid_files, test_files ",len(train_files), len(valid_files), len(test_files ))

len train_files, valid_files, test_files 0 0 0 could you help me ?

Sep 29 '19 06:09 fatmas1982

@fatmas1982 did you manage to resolve the issue

Nov 11 '19 07:11 cuthbertjohnkarawa

No

Apr 08 '20 21:04 fatmas1982

Even I'm facing the same issue. Let me know if anyone know the solution to this issue or any other way to preprocess the data

Apr 11 '20 13:04 NandaKishoreJoshi

I dont know tell now

Apr 11 '20 18:04 fatmas1982

same problem

Apr 13 '20 18:04 Ghani-25

I am also running into issues trying to preprocess my own data for fine-tuning, I'm not sure how i should format my mapping files for custom data

Apr 15 '20 20:04 mmcmahon13

I was able to get mine working by removing the call to hashhex in temp.append(hashhex(line.strip())); the original code seems to hash the URLs in the mapping files to generate the filenames to go into each set. I instead made it append the raw file names, not sure if that helps

Apr 15 '20 21:04 mmcmahon13

I was able to get mine working by removing the call to hashhex in temp.append(hashhex(line.strip())); the original code seems to hash the URLs in the mapping files to generate the filenames to go into each set. I instead made it append the raw file names, not sure if that helps

I tried removing the call to hashhex in temp.append(hashhex(line.strip())); but there is no difference, I'm still getting nothing

Apr 20 '20 05:04 AanchalA

For me, the 'real_name' variable was not getting set because I'm working on a windows machine and windows uses '\' instead of '/' in its path. So, in format_to_lines(args) when i changed real_name = f.split('/')[-1].split('.')[0] to real_name = f.split('\\')[-1].split('.')[0]. It worked for me.

Apr 20 '20 05:04 AanchalA

Hi there, I faced the same issue and I find that this is caused by the following code in the data_builder.py:

    # build the corpus_mapping dict according to the files in urls
    corpus_mapping = {}
    for corpus_type in ['valid', 'test', 'train']:
        temp = []
        for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

    train_files, valid_files, test_files = [], [], []
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        #since the name of our datafile is not in the corpus_mapping dict, all the following conditions would not be satisfied
        if (real_name in corpus_mapping['valid']):
            valid_files.append(f)
        elif (real_name in corpus_mapping['test']):
            test_files.append(f)
        elif (real_name in corpus_mapping['train']):
            train_files.append(f)

Since we use our own dataset, it does not appear in the corpus_mapping dict, which is used for the cnn train/test/valid dataset splitting. Therefore, the list of train_files/test_files/valid_files would be [].

For me, I removed the if (real_name in corpus_mapping['XXX']): conditions, and set a ratio for data splitting. e.g.

    cur = 0
    valid_test_ratio = 0.01
    all_size = len(glob.glob(pjoin(args.raw_path, '*.json')))
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        if (cur < valid_test_ratio*all_size):
            valid_files.append(f)
        elif (cur < valid_test_ratio*2*all_size):
            test_files.append(f)
        else:
            train_files.append(f)
        cur += 1

It works for me. :)

Jun 26 '20 02:06 imJiawen

Hi there, I faced the same issue and I find that this is caused by the following code in the data_builder.py:

    # build the corpus_mapping dict according to the files in urls
    corpus_mapping = {}
    for corpus_type in ['valid', 'test', 'train']:
        temp = []
        for line in open(pjoin(args.map_path, 'mapping_' + corpus_type + '.txt')):
            temp.append(hashhex(line.strip()))
        corpus_mapping[corpus_type] = {key.strip(): 1 for key in temp}

    train_files, valid_files, test_files = [], [], []
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        #since the name of our datafile is not in the corpus_mapping dict, all the following conditions would not be satisfied
        if (real_name in corpus_mapping['valid']):
            valid_files.append(f)
        elif (real_name in corpus_mapping['test']):
            test_files.append(f)
        elif (real_name in corpus_mapping['train']):
            train_files.append(f)

Since we use our own dataset, it does not appear in the corpus_mapping dict, which is used for the cnn train/test/valid dataset splitting. Therefore, the list of train_files/test_files/valid_files would be [].

For me, I removed the if (real_name in corpus_mapping['XXX']): conditions, and set a ratio for data splitting. e.g.

    cur = 0
    valid_test_ratio = 0.01
    all_size = len(glob.glob(pjoin(args.raw_path, '*.json')))
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        real_name = f.split('/')[-1].split('.')[0]
        if (cur < valid_test_ratio*all_size):
            valid_files.append(f)
        elif (cur < valid_test_ratio*2*all_size):
            test_files.append(f)
        else:
            train_files.append(f)
        cur += 1

It works for me. :)

Worked for me thanks !!!

Jul 14 '22 08:07 kush-2418

For me, the 'real_name' variable was not getting set because I'm working on a windows machine and windows uses '' instead of '/' in its path. So, in format_to_lines(args) when i changed real_name = f.split('/')[-1].split('.')[0] to real_name = f.split('\\')[-1].split('.')[0]. It worked for me.

thanks! It work in my project!

Mar 09 '24 08:03 WSChange

PreSumm PreSumm copied to clipboard

Step 4. Format to Simpler Json Files

PreSumm
PreSumm copied to clipboard