BERT-pytorch IndexError: list index out of range

Jan 09 '19 02:01 ghost

No description provided.

@Marsxia Check if the blank '\t' in your 'corpus.small' file. The examples in the readme file are not ready-to-use actually.

Mar 01 '19 09:03 marcwww

No description provided.

@Marsxia Check if the blank '\t' in your 'corpus.small' file. The examples in the readme file are not ready-to-use actually. But I have the blank '\t' in my file, I also met this problem.

Mar 05 '19 13:03 JasonLiu-THU

I am getting the same error but couldn't resolve it

Mar 22 '19 15:03 riktimmondal

@Marsxia @riktimmondal faced this problem. Text cleanup while generating text file fixed the issue. Cannot point out the specifics, but modifying below code to your case might help:

def cleanText(text):
        
    text = text.replace('\\n','')
    text = text.replace('\\','')
    #text = text.replace('\t', '')
    #text = re.sub('\[(.*?)\]','',text) #removes [this one]
    text = re.sub('(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\s',
                ' __url__ ',text) #remove urls
    #text = re.sub('\'','',text)
    #text = re.sub(r'\d+', ' __number__ ', text) #replaces numbers
    #text = re.sub('\W', ' ', text)
    text = re.sub(' +', ' ', text)
    text = text.replace('\t', '')
    text = text.replace('\n', '')
    return text

file_write = []

for file_ in file_list:
    curr_file = file_path+file_
    f_ = open(curr_file, "r")
    curr_text = f_.readlines()[0]
    curr_text = cleanText(curr_text)
    curr_text = curr_text[2:]
    curr_text_list = curr_text.split('.')
    if split_text in curr_text_list:
        curr_text_list_trim = curr_text_list[0:curr_text_list.index(split_text)]
    else:
        curr_text_list_trim = curr_text_list
    if len(curr_text_list_trim)>5:
        for ele in curr_text_list_trim:
            if len(ele)>10:
                file_write.append(ele.strip()+'.')
        file_write.append("")
        
#remove empty line at the end
file_write = file_write[0:len(file_write)-2]

Mar 26 '19 06:03 vdpappu

I also met with this issue but cannot solve it. is there any body could help?

Apr 01 '19 11:04 junchen14

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

Apr 02 '19 01:04 aluminumbox

i tried using this 2 lines with duplicated of them in dataset Welcome to the \t the jungle\n I can stay \t here all night\n and i face the same error:

May 26 '19 07:05 MohamedLotfyElrefai

Change the code at line 23 in dataset.py, from split("\t") --> split("\t")

Aug 02 '19 07:08 iiiHunter

If you use the demo in README, change the code at line 23 in dataset.py, from split("\t") --> split("\\t").

Aug 06 '19 08:08 songyingxin

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

Thanks a lot! After a night of debug, I fix this problem. First change the code dataset.py at line 23: self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] than download this file https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. push this file to the $YOUPROJECT/data/,then input bert-vocab -c data/corpus.small -o data/vocab.small bert -c data/corpus.small -v data/vocab.small -o output/bert.model You can see the program can run normally.

Sep 04 '19 15:09 qiaomeng

This is how I solve this problem. My corpus is like this： Welcome to the\tthe jungle I can stay\there all night And, change the code at line 23 in dataset.py （This py file is the py file where you reported the wrong location） from "self.lines = [line[:-1].split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]"
to "self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]"

Nov 23 '20 14:11 limengqigithub

i also meet the same question,but i found the above solution is useless. then,i download the corpus.small as above says https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. i found all question is solved. i suspect it's the problem caused by editor ,it's rediculious. i find when i use vim ,i autoset \t as four space,this is the cause of the question. i open the corpus.small by ubuntu text editor to find this.

ps: when i solve the question,i try the following data again,there is no question.

Welcome to the \t the jungle\n
I can stay \t here all night\n

Nov 24 '20 10:11 Emir-Liu

BERT-pytorch BERT-pytorch copied to clipboard

IndexError: list index out of range

BERT-pytorch
BERT-pytorch copied to clipboard