manubot-ai-editor
manubot-ai-editor copied to clipboard
Add support for custom prompts and files metadata via YAML
General
- The status of this issue is work-in-progress (will be discussed in our next progress update meeting).
- If you have any comments on this new functionality, feel free to comment on this issue.
- Lines starting with
comment:
below represent internal comments for discussion with the software engineering team.
Problem
Currently, the Manubot AI Editor offers a fixed set of section-specific prompts for advanced manuscript revision. These set of section-specific prompts are automatically generated using the manuscript title, its keywords, and the section the text belongs to. However, these prompts are fixed and have specific instructions to improve the text by following some guidelines that might not be the ones a user is expecting. For example, the GitHub user @dhimmel tried to use our tool in one manuscript but reported an aggressive rewriting, whereas he only needed basic copyediting (typos, grammar issues, etc.) and "shortening of select sections, possibly with custom prompts."
Proposed solution
Add two files that allow users to 1) write custom prompts (this file is easily sharable with other users) and 2) define how prompts are applied to manuscript files (this file is specific to the repository and not intended to be shared). Both files are placed in the root folder of the manuscript repository.
ai_revision-prompts.yaml
- This file is a YAML file.
- This file has the custom prompts.
- The prompts defined here can access different pieces of information/metadata about the manuscript.
- This file is easily sharable with the community, so it doesn't have any manuscript/repository-specific information.
The file has the following structure:
# Potential future feature: variables and templating can be defined here (YAML anchors, etc).
# if we use "prompts_files" as the top-level key, they prompt names are interpreted as regex for file matching
# if we use "prompts" as the top-level key, they they are meant to be referenced from the config file
prompts_files:
prompt_name: |
Prompt content that can access the {manuscript.title} or the {manuscript.keywords}
another_prompt_name: |
Another prompt definition that does not access any manuscript's metadata.
\.md$: |
This would be a default prompt.
Notes:
- Variables and templating is a work-in-progress feature and is not included in this iteration. It might come for free using YAML's anchors, but we are not gonna test it now.
- Prompt's names also act as a regex that can match file names. This is intended to make prompts more shareable without additional configuration. This feature is assessed per prompt and enabled only if a prompt goes unused in
ai_revision-config.yaml
(or if that file does not exist). If the feature is enabled for a prompt, then it automatically uses the prompt with filenames matching theprompt_name
regex. For example, having a prompt namedabstract
will apply to all files containingabstract
in their names. - Each paragraph in the manuscript is always revised by only one prompt (or not revised at all if no default prompt is provided).
- Referencing
{manuscript.title}
returns a string with the manuscript's title. - Referencing
{manuscript.keywords}
returns a string with keywords separated by,
(comma + space), such askeyword1, keyword2, keyword3
.
ai_revision-config.yaml
- In this issue, this file will specify how prompts (defined in
ai_revision-prompts.yaml
) are applied to files. - In the future, this file is intended to contain other configuration entries for the AI Revision workflow.
The file has the following structure:
files:
matchings:
# in-order list for matching. for each file, find the first entry that matches file(s) and
# apply prompt(s).
- files:
# always interpreted as regex
- abstract
- 04\..*-supplement\.md
prompt: prompt_name
# default prompt for files not matched in list above. can also be omitted for no
# fallback (file is ignored). also, regex matching above can accommodate
# "quasi-defaults" for higher-level-granularity distinctions (maybe like .md files
# and .txt files?), i.e. patterns that match many but not all files.
default_prompt: some_fallback_prompt
# file(s) to ignore (not revise). overrides `default_prompt` and `matchings`.
ignore:
- data
- quote-that-shouldnt-be-revised
Full examples
Only ai_revision-prompts.yaml
is defined
- Example based on the PhenoPLIER manuscript repository.
- File names here are different than in the original manuscript to accommodate for this case (no
ai_revision-config.yaml
file).
Files under content/
folder (file names modified from the original manuscript):
final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results.framework.md
04.05.01.results.crispr.md
04.15.results.drug_disease_prediction.md
04.20.00.results.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml
ai_revision-prompts.yaml
prompts_files:
abstract: |
Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
the research problem/question is clear,
the solution proposed is clear,
the text grammar is correct,
spelling errors are fixed,
and the text is in active voice and has a clear sentence structure
introduction|discussion: |
Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
the research problem/question is clear,
the solution proposed is clear,
the text grammar is correct,
spelling errors are fixed,
and the text is in active voice and has a clear sentence structure
results: |
Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
most references to figures and tables are kept,
the details are enough to clearly explain the outcomes,
sentences are concise and to the point,
the text minimizes the use of jargon,
the text grammar is correct,
spelling errors are fixed,
and the text has a clear sentence structure
methods: |
Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
most of the citations to other academic papers are kept,
most of the technical details are kept,
most references to equations (such as "Equation (@id)") are kept,
all equations definitions (such as '*equation_definition') are included with newlines before and after,
the most important symbols in equations are defined,
the text grammar is correct,
spelling errors are fixed,
and the text has a clear sentence structure
references: null
\.md$: |
Proofread the following paragraph
Notes:
- Note we use
prompts_files
as the top-level key name. - The same prompt is used for files that contain the introduction or discussion sections.
ai_revision-config.yaml
This file does not exist in this example.
Both ai_revision-prompts.yaml
and ai_revision-config.yaml
are defined
- This example follows exactly the same file names in the PhenoPLIER manuscript repository.
- The matching between prompts and files should be exactly the same as in the previous example, although here, we manually specify all matchings using the
ai_revision-config.yaml
file.
Files under content/
folder:
final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results_framework.md
04.05.01.crispr.md
04.15.drug_disease_prediction.md
04.20.00.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml
ai_revision-prompts.yaml
prompts:
abstract: |
Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
the research problem/question is clear,
the solution proposed is clear,
the text grammar is correct,
spelling errors are fixed,
and the text is in active voice and has a clear sentence structure
introduction_discussion: |
Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
the research problem/question is clear,
the solution proposed is clear,
the text grammar is correct,
spelling errors are fixed,
and the text is in active voice and has a clear sentence structure
results: |
Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
most references to figures and tables are kept,
the details are enough to clearly explain the outcomes,
sentences are concise and to the point,
the text minimizes the use of jargon,
the text grammar is correct,
spelling errors are fixed,
and the text has a clear sentence structure
methods: |
Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
most of the citations to other academic papers are kept,
most of the technical details are kept,
most references to equations (such as "Equation (@id)") are kept,
all equations definitions (such as '*equation_definition') are included with newlines before and after,
the most important symbols in equations are defined,
the text grammar is correct,
spelling errors are fixed,
and the text has a clear sentence structure
default: |
Proofread the following paragraph
Notes:
- Note we use
prompts
as the top-level key name since prompts will be referenced from the config file below.
ai_revision-config.yaml
files:
matchings:
- files:
- abstract
prompt: abstract
- files:
- introduction
prompt: introduction_discussion
- files:
- 04\..+\.md
prompt: results
- files:
- discussion
prompt: introduction_discussion
- files:
- methods
prompt: methods
default_prompt: default
ignore:
- front\-matter
- acknowledgements
- supplementary_material
- references
Notes:
- This example too verbose, and it shows clearly that having prompt names that can also be used as regex for file matching in
ai_revision-prompts.yaml
(suggested by @vincerubinetti) is really convenient. - This example could be converted easily to a mix between "prompts matching file names" and "files that need specific prompts matching" (like for the Results section where not all files have the "results" in their names).
Only a single, generic prompt is defined
- This example follows exactly the same file names in Daniel's article on connectivity search.
- Daniel only wanted to proofread the manuscript, not use section-specific prompts.
Files under content/
folder:
images/
media/
00.front-matter.md
01.abstract.md
05.main-text.md
90.back-matter.md
manual-references-2023-04-06.json
manual-references.yaml
metadata.yaml
response-to-reviewers.md
ai_revision-prompts.yaml
prompts:
\.md$: |
Proofread the following paragraph
ai_revision-config.yaml
files:
ignore:
- front\-matter
- back\-matter
- response\-to\-reviewers
Notes:
- This example could be written using only the
ai_revision-prompts.yaml
file withprompts_files
as the top-level key instead ofprompts
and adding one "empty prompt" for each of the ignore list entries (front\-matter: null
, etc).
Testing
- New/updated unit tests that focus on the parsing of the new files and the correct revision of manuscript files.
- our unit tests currently have mock models that "revise" a paragraph by returning the same paragraph, randomly swapping characters, etc, that could be used.
- Fork existing Manubot-based manuscript to perform global testing (triggering the
ai_revision
workflow from the GitHub interface as a user would do). We could also ask for feedback from the manuscript's authors.