manubot-ai-editor Add support for custom prompts and files metadata via YAML

Add support for custom prompts and files metadata via YAML

Open miltondp opened this issue 1 year ago • 15 comments

General

The status of this issue is work-in-progress (will be discussed in our next progress update meeting).
If you have any comments on this new functionality, feel free to comment on this issue.
Lines starting with comment: below represent internal comments for discussion with the software engineering team.

Problem

Currently, the Manubot AI Editor offers a fixed set of section-specific prompts for advanced manuscript revision. These set of section-specific prompts are automatically generated using the manuscript title, its keywords, and the section the text belongs to. However, these prompts are fixed and have specific instructions to improve the text by following some guidelines that might not be the ones a user is expecting. For example, the GitHub user @dhimmel tried to use our tool in one manuscript but reported an aggressive rewriting, whereas he only needed basic copyediting (typos, grammar issues, etc.) and "shortening of select sections, possibly with custom prompts."

Proposed solution

Add two files that allow users to 1) write custom prompts (this file is easily sharable with other users) and 2) define how prompts are applied to manuscript files (this file is specific to the repository and not intended to be shared). Both files are placed in the root folder of the manuscript repository.

`ai_revision-prompts.yaml`

This file is a YAML file.
This file has the custom prompts.
The prompts defined here can access different pieces of information/metadata about the manuscript.
This file is easily sharable with the community, so it doesn't have any manuscript/repository-specific information.

The file has the following structure:

# Potential future feature: variables and templating can be defined here (YAML anchors, etc).

# if we use "prompts_files" as the top-level key, they prompt names are interpreted as regex for file matching
# if we use "prompts" as the top-level key, they they are meant to be referenced from the config file
prompts_files:
  prompt_name: |
      Prompt content that can access the {manuscript.title} or the {manuscript.keywords}
  another_prompt_name: |
      Another prompt definition that does not access any manuscript's metadata.
  \.md$: |
    This would be a default prompt.

Notes:

Variables and templating is a work-in-progress feature and is not included in this iteration. It might come for free using YAML's anchors, but we are not gonna test it now.
Prompt's names also act as a regex that can match file names. This is intended to make prompts more shareable without additional configuration. This feature is assessed per prompt and enabled only if a prompt goes unused in ai_revision-config.yaml (or if that file does not exist). If the feature is enabled for a prompt, then it automatically uses the prompt with filenames matching the prompt_name regex. For example, having a prompt named abstract will apply to all files containing abstract in their names.
Each paragraph in the manuscript is always revised by only one prompt (or not revised at all if no default prompt is provided).
Referencing {manuscript.title} returns a string with the manuscript's title.
Referencing {manuscript.keywords} returns a string with keywords separated by , (comma + space), such as keyword1, keyword2, keyword3.

`ai_revision-config.yaml`

In this issue, this file will specify how prompts (defined in ai_revision-prompts.yaml) are applied to files.
In the future, this file is intended to contain other configuration entries for the AI Revision workflow.

The file has the following structure:

files:
  matchings:
    # in-order list for matching. for each file, find the first entry that matches file(s) and
    #  apply prompt(s).
    - files:
        # always interpreted as regex
        - abstract
        - 04\..*-supplement\.md
      prompt: prompt_name
  
  # default prompt for files not matched in list above. can also be omitted for no
  #  fallback (file is ignored). also, regex matching above can accommodate
  #  "quasi-defaults" for higher-level-granularity distinctions (maybe like .md files
  #  and .txt files?), i.e. patterns that match many but not all files.
  default_prompt: some_fallback_prompt
  
  # file(s) to ignore (not revise). overrides `default_prompt` and `matchings`.
  ignore:
    - data
    - quote-that-shouldnt-be-revised

Full examples

Only `ai_revision-prompts.yaml` is defined

Example based on the PhenoPLIER manuscript repository.
File names here are different than in the original manuscript to accommodate for this case (no ai_revision-config.yaml file).

Files under content/ folder (file names modified from the original manuscript):

final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results.framework.md
04.05.01.results.crispr.md
04.15.results.drug_disease_prediction.md
04.20.00.results.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml

`ai_revision-prompts.yaml`

prompts_files:
  abstract: |
    Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  introduction|discussion: |
    Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  results: |
    Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      most references to figures and tables are kept,
      the details are enough to clearly explain the outcomes,
      sentences are concise and to the point,
      the text minimizes the use of jargon,
      the text grammar is correct,
      spelling errors are fixed,
      and the text has a clear sentence structure

  methods: |
    Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
       most of the citations to other academic papers are kept,
       most of the technical details are kept,
       most references to equations (such as "Equation (@id)") are kept,
       all equations definitions (such as '*equation_definition') are included with newlines before and after,
       the most important symbols in equations are defined,
       the text grammar is correct,
       spelling errors are fixed,
       and the text has a clear sentence structure

  references: null

  \.md$: |
    Proofread the following paragraph

Notes:

Note we use prompts_files as the top-level key name.
The same prompt is used for files that contain the introduction or discussion sections.

`ai_revision-config.yaml`

This file does not exist in this example.

Both `ai_revision-prompts.yaml` and `ai_revision-config.yaml` are defined

This example follows exactly the same file names in the PhenoPLIER manuscript repository.
The matching between prompts and files should be exactly the same as in the previous example, although here, we manually specify all matchings using the ai_revision-config.yaml file.

Files under content/ folder:

final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results_framework.md
04.05.01.crispr.md
04.15.drug_disease_prediction.md
04.20.00.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml

`ai_revision-prompts.yaml`

prompts:
  abstract: |
    Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  introduction_discussion: |
    Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  results: |
    Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      most references to figures and tables are kept,
      the details are enough to clearly explain the outcomes,
      sentences are concise and to the point,
      the text minimizes the use of jargon,
      the text grammar is correct,
      spelling errors are fixed,
      and the text has a clear sentence structure

  methods: |
    Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
       most of the citations to other academic papers are kept,
       most of the technical details are kept,
       most references to equations (such as "Equation (@id)") are kept,
       all equations definitions (such as '*equation_definition') are included with newlines before and after,
       the most important symbols in equations are defined,
       the text grammar is correct,
       spelling errors are fixed,
       and the text has a clear sentence structure

  default: |
    Proofread the following paragraph

Notes:

Note we use prompts as the top-level key name since prompts will be referenced from the config file below.

`ai_revision-config.yaml`

files:
  matchings:
    - files:
        - abstract
      prompt: abstract
    - files:
        - introduction
      prompt: introduction_discussion
    - files:
        - 04\..+\.md
      prompt: results
    - files:
        - discussion
      prompt: introduction_discussion
    - files:
        - methods
      prompt: methods
  
  default_prompt: default
  
  ignore:
    - front\-matter
    - acknowledgements
    - supplementary_material
    - references

Notes:

This example too verbose, and it shows clearly that having prompt names that can also be used as regex for file matching in ai_revision-prompts.yaml (suggested by @vincerubinetti) is really convenient.
This example could be converted easily to a mix between "prompts matching file names" and "files that need specific prompts matching" (like for the Results section where not all files have the "results" in their names).

Only a single, generic prompt is defined

This example follows exactly the same file names in Daniel's article on connectivity search.
Daniel only wanted to proofread the manuscript, not use section-specific prompts.

Files under content/ folder:

images/
media/
00.front-matter.md
01.abstract.md
05.main-text.md
90.back-matter.md
manual-references-2023-04-06.json
manual-references.yaml
metadata.yaml
response-to-reviewers.md

`ai_revision-prompts.yaml`

prompts:
  \.md$: |
    Proofread the following paragraph

`ai_revision-config.yaml`

files:
  ignore:
    - front\-matter
    - back\-matter
    - response\-to\-reviewers

Notes:

This example could be written using only the ai_revision-prompts.yaml file with prompts_files as the top-level key instead of prompts and adding one "empty prompt" for each of the ignore list entries (front\-matter: null, etc).

Testing

New/updated unit tests that focus on the parsing of the new files and the correct revision of manuscript files.
- our unit tests currently have mock models that "revise" a paragraph by returning the same paragraph, randomly swapping characters, etc, that could be used.
Fork existing Manubot-based manuscript to perform global testing (triggering the ai_revision workflow from the GitHub interface as a user would do). We could also ask for feedback from the manuscript's authors.

Aug 01 '23 19:08 miltondp

manubot-ai-editor manubot-ai-editor copied to clipboard

Add support for custom prompts and files metadata via YAML

General

Problem

Proposed solution

ai_revision-prompts.yaml

ai_revision-config.yaml

Full examples

Only ai_revision-prompts.yaml is defined

ai_revision-prompts.yaml

ai_revision-config.yaml

Both ai_revision-prompts.yaml and ai_revision-config.yaml are defined

ai_revision-prompts.yaml

ai_revision-config.yaml

Only a single, generic prompt is defined

ai_revision-prompts.yaml

ai_revision-config.yaml

Testing

manubot-ai-editor
manubot-ai-editor copied to clipboard

`ai_revision-prompts.yaml`

`ai_revision-config.yaml`

Only `ai_revision-prompts.yaml` is defined

`ai_revision-prompts.yaml`

`ai_revision-config.yaml`

Both `ai_revision-prompts.yaml` and `ai_revision-config.yaml` are defined

`ai_revision-prompts.yaml`

`ai_revision-config.yaml`

`ai_revision-prompts.yaml`

`ai_revision-config.yaml`