BERT-pytorch icon indicating copy to clipboard operation
BERT-pytorch copied to clipboard

IndexError: list index out of range

Open ghost opened this issue 6 years ago • 13 comments

ghost avatar Jan 09 '19 02:01 ghost

No description provided.

@Marsxia Check if the blank '\t' in your 'corpus.small' file. The examples in the readme file are not ready-to-use actually.

marcwww avatar Mar 01 '19 09:03 marcwww

No description provided.

@Marsxia Check if the blank '\t' in your 'corpus.small' file. The examples in the readme file are not ready-to-use actually. But I have the blank '\t' in my file, I also met this problem.

JasonLiu-THU avatar Mar 05 '19 13:03 JasonLiu-THU

I am getting the same error but couldn't resolve it

riktimmondal avatar Mar 22 '19 15:03 riktimmondal

@Marsxia @riktimmondal faced this problem. Text cleanup while generating text file fixed the issue. Cannot point out the specifics, but modifying below code to your case might help:

def cleanText(text):
        
    text = text.replace('\\n','')
    text = text.replace('\\','')
    #text = text.replace('\t', '')
    #text = re.sub('\[(.*?)\]','',text) #removes [this one]
    text = re.sub('(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\s',
                ' __url__ ',text) #remove urls
    #text = re.sub('\'','',text)
    #text = re.sub(r'\d+', ' __number__ ', text) #replaces numbers
    #text = re.sub('\W', ' ', text)
    text = re.sub(' +', ' ', text)
    text = text.replace('\t', '')
    text = text.replace('\n', '')
    return text
file_write = []

for file_ in file_list:
    curr_file = file_path+file_
    f_ = open(curr_file, "r")
    curr_text = f_.readlines()[0]
    curr_text = cleanText(curr_text)
    curr_text = curr_text[2:]
    curr_text_list = curr_text.split('.')
    if split_text in curr_text_list:
        curr_text_list_trim = curr_text_list[0:curr_text_list.index(split_text)]
    else:
        curr_text_list_trim = curr_text_list
    if len(curr_text_list_trim)>5:
        for ele in curr_text_list_trim:
            if len(ele)>10:
                file_write.append(ele.strip()+'.')
        file_write.append("")
        
#remove empty line at the end
file_write = file_write[0:len(file_write)-2]

vdpappu avatar Mar 26 '19 06:03 vdpappu

I also met with this issue but cannot solve it. is there any body could help?

junchen14 avatar Apr 01 '19 11:04 junchen14

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

aluminumbox avatar Apr 02 '19 01:04 aluminumbox

i tried using this 2 lines with duplicated of them in dataset Welcome to the \t the jungle\n I can stay \t here all night\n and i face the same error: image

MohamedLotfyElrefai avatar May 26 '19 07:05 MohamedLotfyElrefai

Change the code at line 23 in dataset.py, from split("\t") --> split("\t")

iiiHunter avatar Aug 02 '19 07:08 iiiHunter

If you use the demo in README, change the code at line 23 in dataset.py, from split("\t") --> split("\\t").

songyingxin avatar Aug 06 '19 08:08 songyingxin

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

Thanks a lot! After a night of debug, I fix this problem. First change the code dataset.py at line 23: self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] than download this file https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. push this file to the $YOUPROJECT/data/,then input bert-vocab -c data/corpus.small -o data/vocab.small bert -c data/corpus.small -v data/vocab.small -o output/bert.model You can see the program can run normally.

qiaomeng avatar Sep 04 '19 15:09 qiaomeng

This is how I solve this problem. My corpus is like this: Welcome to the\tthe jungle I can stay\there all night And, change the code at line 23 in dataset.py (This py file is the py file where you reported the wrong location) from "self.lines = [line[:-1].split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]"
to "self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]"

limengqigithub avatar Nov 23 '20 14:11 limengqigithub

i also meet the same question,but i found the above solution is useless. then,i download the corpus.small as above says https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. i found all question is solved. i suspect it's the problem caused by editor ,it's rediculious. i find when i use vim ,i autoset \t as four space,this is the cause of the question. i open the corpus.small by ubuntu text editor to find this.

ps: when i solve the question,i try the following data again,there is no question.

Welcome to the \t the jungle\n
I can stay \t here all night\n

Emir-Liu avatar Nov 24 '20 10:11 Emir-Liu