plotmachines
plotmachines copied to clipboard
Multiple issues to get running:
- function run_batch in train.py
def run_batch(model, args, device, compute_loss_fct):
for arg in args:
if arg is not None:
arg = arg.to(device)
output = model(*args, device=device)
allloss = compute_loss_fct(output, args[0], args[1])
return allloss.mean()
args never get converted to cuda devices, and subsequent functions fail almost immediately, could never have run successfully on a cuda device.
def run_batch(model, args, device, compute_loss_fct):
i=0
for arg in args:
if arg is not None:
args[i] = arg.to(device)
i += 1
output = model(*args, device=device)
allloss = compute_loss_fct(output, args[0], args[1])
return allloss.mean()
seems to work fine.
- in preprocessing/README.MD:
2. Steps for extracting outlines
Run the extract_outlines.py to extract the outline-labeled documents that can be used as input to the train Plotmachines fine-tuning models.
The output will provide you with a csv of the outlines and stories where each row is a paragraph from a story. The columns are:
- story id: our format is "storyid_{int}" with the {int} after the underscore being this paragraph's index in the story (starting a t 0)
- key/abstract: this is a binary signifier for us to know where the data came from, but it's just in "K" for every row, in wikiplot s
- outline: the outline with points delimited by [SEP]
- discourse tag: I/B/C for intro, body, conclusion paragraphs respectively
- num_paragraphs: total number of paragraphs in this story
- paragraph: the paragraph text
- previous paragraph: text from the previous paragraph in the story
the referenced plots and titles files need to be 'pre-processed' themselves to remove any newlines and replace with a '
' to avoid being garbled by extract_outlines.py - could not have worked as supplied.
3. Steps for splitting into train/dev/test splits
Please use the splits from wikiplots_splits.txt to construct the train, validation and text datasets that were used in the paper. Note that some stories may be need to be removed (marked "flagged") due to potentially offensive and/or harmful content.
#!/bin/bash
rm -rf train_encoded.csv val_encoeded.csv test_encoded.csv
echo in script $1
while read -r line
do
inp=(${line})
plot=${inp[0]}
outfile=${inp[-1]}
grep "${plot}" *RAKE*.csv >> ${outfile}_encoded.csv
done < $1
will do it from the created wikiplot.kwRAKE.csv
references to 'dev_encoded.csv' throughout should probably be val_encoded.csv, somewhat inconsistent
- function run_batch in train.py
def run_batch(model, args, device, compute_loss_fct): for arg in args: if arg is not None: arg = arg.to(device) output = model(*args, device=device) allloss = compute_loss_fct(output, args[0], args[1]) return allloss.mean()
args never get converted to cuda devices, and subsequent functions fail almost immediately, could never have run successfully on a cuda device.
def run_batch(model, args, device, compute_loss_fct): i=0 for arg in args: if arg is not None: args[i] = arg.to(device) i += 1 output = model(*args, device=device) allloss = compute_loss_fct(output, args[0], args[1]) return allloss.mean()
seems to work fine.
- in preprocessing/README.MD:
2. Steps for extracting outlines
Run the extract_outlines.py to extract the outline-labeled documents that can be used as input to the train Plotmachines fine-tuning models. The output will provide you with a csv of the outlines and stories where each row is a paragraph from a story. The columns are:
- story id: our format is "storyid_{int}" with the {int} after the underscore being this paragraph's index in the story (starting a t 0)
- key/abstract: this is a binary signifier for us to know where the data came from, but it's just in "K" for every row, in wikiplot s
- outline: the outline with points delimited by [SEP]
- discourse tag: I/B/C for intro, body, conclusion paragraphs respectively
- num_paragraphs: total number of paragraphs in this story
- paragraph: the paragraph text
- previous paragraph: text from the previous paragraph in the story
the referenced plots and titles files need to be 'pre-processed' themselves to remove any newlines and replace with a '
' to avoid being garbled by extract_outlines.py - could not have worked as supplied.
3. Steps for splitting into train/dev/test splits
Please use the splits from wikiplots_splits.txt to construct the train, validation and text datasets that were used in the paper. Note that some stories may be need to be removed (marked "flagged") due to potentially offensive and/or harmful content.
#!/bin/bash rm -rf train_encoded.csv val_encoeded.csv test_encoded.csv echo in script $1 while read -r line do inp=(${line}) plot=${inp[0]} outfile=${inp[-1]} grep "${plot}" *RAKE*.csv >> ${outfile}_encoded.csv done < $1
will do it from the created wikiplot.kwRAKE.csv
references to 'dev_encoded.csv' throughout should probably be val_encoded.csv, somewhat inconsistent
Hi, I run into the same preprocessing problem in 2. Steps for extracting outlines, could you share the processed file? Thanks!
- function run_batch in train.py
def run_batch(model, args, device, compute_loss_fct): for arg in args: if arg is not None: arg = arg.to(device) output = model(*args, device=device) allloss = compute_loss_fct(output, args[0], args[1]) return allloss.mean()
args never get converted to cuda devices, and subsequent functions fail almost immediately, could never have run successfully on a cuda device.
def run_batch(model, args, device, compute_loss_fct): i=0 for arg in args: if arg is not None: args[i] = arg.to(device) i += 1 output = model(*args, device=device) allloss = compute_loss_fct(output, args[0], args[1]) return allloss.mean()
seems to work fine.
- in preprocessing/README.MD:
2. Steps for extracting outlines
Run the extract_outlines.py to extract the outline-labeled documents that can be used as input to the train Plotmachines fine-tuning models. The output will provide you with a csv of the outlines and stories where each row is a paragraph from a story. The columns are:
- story id: our format is "storyid_{int}" with the {int} after the underscore being this paragraph's index in the story (starting a t 0)
- key/abstract: this is a binary signifier for us to know where the data came from, but it's just in "K" for every row, in wikiplot s
- outline: the outline with points delimited by [SEP]
- discourse tag: I/B/C for intro, body, conclusion paragraphs respectively
- num_paragraphs: total number of paragraphs in this story
- paragraph: the paragraph text
- previous paragraph: text from the previous paragraph in the story
the referenced plots and titles files need to be 'pre-processed' themselves to remove any newlines and replace with a ' ' to avoid being garbled by extract_outlines.py - could not have worked as supplied.
3. Steps for splitting into train/dev/test splits
Please use the splits from wikiplots_splits.txt to construct the train, validation and text datasets that were used in the paper. Note that some stories may be need to be removed (marked "flagged") due to potentially offensive and/or harmful content.
#!/bin/bash rm -rf train_encoded.csv val_encoeded.csv test_encoded.csv echo in script $1 while read -r line do inp=(${line}) plot=${inp[0]} outfile=${inp[-1]} grep "${plot}" *RAKE*.csv >> ${outfile}_encoded.csv done < $1
will do it from the created wikiplot.kwRAKE.csv references to 'dev_encoded.csv' throughout should probably be val_encoded.csv, somewhat inconsistent
Hi, I run into the same preprocessing problem in 2. Steps for extracting outlines, could you share the processed file? Thanks!
Hi, have you solved your problem? I run into the same problem.
- function run_batch in train.py
def run_batch(model, args, device, compute_loss_fct): for arg in args: if arg is not None: arg = arg.to(device) output = model(*args, device=device) allloss = compute_loss_fct(output, args[0], args[1]) return allloss.mean()
args never get converted to cuda devices, and subsequent functions fail almost immediately, could never have run successfully on a cuda device.
def run_batch(model, args, device, compute_loss_fct): i=0 for arg in args: if arg is not None: args[i] = arg.to(device) i += 1 output = model(*args, device=device) allloss = compute_loss_fct(output, args[0], args[1]) return allloss.mean()
seems to work fine.
- in preprocessing/README.MD:
2. Steps for extracting outlines
Run the extract_outlines.py to extract the outline-labeled documents that can be used as input to the train Plotmachines fine-tuning models. The output will provide you with a csv of the outlines and stories where each row is a paragraph from a story. The columns are:
- story id: our format is "storyid_{int}" with the {int} after the underscore being this paragraph's index in the story (starting a t 0)
- key/abstract: this is a binary signifier for us to know where the data came from, but it's just in "K" for every row, in wikiplot s
- outline: the outline with points delimited by [SEP]
- discourse tag: I/B/C for intro, body, conclusion paragraphs respectively
- num_paragraphs: total number of paragraphs in this story
- paragraph: the paragraph text
- previous paragraph: text from the previous paragraph in the story
the referenced plots and titles files need to be 'pre-processed' themselves to remove any newlines and replace with a '
' to avoid being garbled by extract_outlines.py - could not have worked as supplied.
3. Steps for splitting into train/dev/test splits
Please use the splits from wikiplots_splits.txt to construct the train, validation and text datasets that were used in the paper. Note that some stories may be need to be removed (marked "flagged") due to potentially offensive and/or harmful content.
#!/bin/bash rm -rf train_encoded.csv val_encoeded.csv test_encoded.csv echo in script $1 while read -r line do inp=(${line}) plot=${inp[0]} outfile=${inp[-1]} grep "${plot}" *RAKE*.csv >> ${outfile}_encoded.csv done < $1
will do it from the created wikiplot.kwRAKE.csv
references to 'dev_encoded.csv' throughout should probably be val_encoded.csv, somewhat inconsistent
grep
will work better here with _
after ${plot}
to avoid matching plot-10
, plot-100
, plot-101
, etc... grep
's -m 1
flag might accomplish the same thing, too
#!/bin/bash
rm -rf train_encoded.csv val_encoeded.csv test_encoded.csv
echo in script $1
while read -r line
do
inp=(${line})
plot=${inp[0]}
outfile=${inp[-1]}
grep "${plot}_" *RAKE*.csv >> ${outfile}_encoded.csv
done < $1