Code huấn luyện PhoBERT không hoạt động

Open vanlinhtruongdang opened this issue 5 months ago • 1 comments

Chào anh, em là sinh viên K16 UIT và hiện đang tìm hiểu công trình của anh để thực hiện dự án

Em không thể chạy file PhoBERT_HOS.ipynb vì có sự sai sót về cột dữ liệu trong dataset. Cụ thể là hàm prepare_data:

def prepare_data(file_path):
    df = pd.read_csv(file_path)

    # remove nan
    df = df.dropna()
    df = df.reset_index(drop=True)

    texts = df["text"].tolist()
    spans = df["spans"].tolist()

    # convert spans to binary representation
    binary_spans = []
    for span in spans:
        binary_span = []
        span = span.split(" ")
        for s in span:
            if s == "O":
                binary_span.append(0)
            else:
                binary_span.append(1)
        binary_spans.append(binary_span)

    return texts, binary_spans

Hàm này sử dụng 2 cột là text và spans, tuy nhiên các file csv như train_BIO_Word.csv hay train_BIO_syllable.csv đều không có cột này. Mong anh hồi đáp để chạy được file training này

Jul 15 '25 03:07 vanlinhtruongdang

Hi @vanlinhtruongdang

Several changes are required to fix the issue you mentioned above.

Change the data paths:

# Change to Span_Extraction dataset
train_path = ("../../data/Span_Extraction_based_version/train.csv")
dev_path = ("../../data/Span_Extraction_based_version/dev.csv")
test_path = ("../../data/Test_data/test.csv")

Edit the prepare_data as follow:

import json

def prepare_data(file_path):
    df = pd.read_csv(file_path)

    # remove nan
    df = df.dropna()
    df = df.reset_index(drop=True)

    texts = df['content'].tolist()
    spans = df['index_spans'].tolist()

    # convert spans to binary representation
    binary_spans = []
    for span in spans:
        span = json.loads(span)
        binary_span = [0 if i not in span else 1 for i in range(len(texts[0]))]
        binary_spans.append(binary_span)

    return texts, binary_spans

I hope it can help you solve the issue.

Jul 27 '25 07:07 VuHuy-cse-9