gpt-2-output-dataset icon indicating copy to clipboard operation
gpt-2-output-dataset copied to clipboard

Questions about the meaning of data set attribute representation

Open zh57398 opened this issue 3 years ago • 2 comments

About your dataset, does the "length" attribute represent the length of the "text" attribute? Or something else? I don't think it means the length of the "text" attribute, for example, in the file "medium-345m-k40 train.jsonl ”"Length" = 1024, but I calculated the length of text is equal to 4750, so I want to know the meaning of "length" attribute. I look forward to your reply. Thank you very much.

zh57398 avatar Apr 06 '21 02:04 zh57398

If you're referring to the length parameter as per this:

def interact_model(
    model_name='345M', #345M/774M on Pi4B 8G only (memory allocation issue) 1558 too big for Pi4b8G
    seed=None,
    nsamples=1,
    batch_size=1,
    length=140,
    temperature=1.2,
    top_k=48,
    top_p=0.7,
    models_dir='models',
):

Then length refers to the maximum number of words the output will contain. I keep mine short & sweet at 140 max length because I use GPT-2 for my robots for a conversational response. But if you want it to write an article, it certainly can...

DaveXanatos avatar Apr 07 '21 20:04 DaveXanatos

First of all, thank you very much for your reply, but I still don't understand. I can understand that 1024 is the maximum length. I understand the "text" attribute as the text generated by gpt-2. I'm not sure if my understanding is correct? If correct, the "length" attribute should be equal to the length of "text". In the dataset you provided, I calculated the length of the "text" attribute, but it is not equal to the given value of the "length" attribute, so I want to know what the "length" attribute stands for?Looking forward to your help and reply.

zh57398 avatar Apr 09 '21 07:04 zh57398