LLaMA-Cult-and-More icon indicating copy to clipboard operation
LLaMA-Cult-and-More copied to clipboard

Papers

Open shm007g opened this issue 1 year ago • 7 comments

shm007g avatar Apr 19 '23 03:04 shm007g

[Instruction tuning with GPT-4, Microsoft, 2023.04]

  • This paper intent to build the first Self-Instruct LLM using GPT-4 response based on LLaMA-7B. blog.
  • 1st, it collect 52K English instruction dataset (and 52K Chinese from translation) using GPT-4 with 52K prompt input of alpaca dataset. Then it performs supervised fine-tune (as Stanford Alpaca) on this dataset and get model LLaMA-GPT4(-CN)(-7B).
  • 2nd, it train a reward model based on OPT-1.3B. Due to high cost of labeling comparison dataset and GPT-4's judging quality ability, It use GPT-4 to assign scores [1,10] for different responses for each prompt input.
  • 3rd, to evalute self-instruct tuned model on unseen instructions, it choose 3 instruction following dataset, as User-Oriented-Instructions-252, Vicuna-Instructions-80, Unnatural Instructions.
    • It use Amazon Mechanical Turk to perform human evaluation model generation results on User-Oriented-Instructions-252 dataset with a 3H alignment criteria. GPT-4-Instruction-tuned LLaMA-GPT4(-7B) lead to very camparable performance with original GPT-4.
    • image
    • It use GPT-4 to perform automatic evaluation of different SOTA models on Vicuna-Instructions-80. For each evalution, it ask GPT-4 to rate the response quality between 2 models with score from 1 to 10. LLaMA-GPT4(7B) fine-tuned on GPT-4 outputs works better than Alpaca-13B (fine-tuned on ChatGPT). It shows GPT-4 outputs is much better for instruction tuning.
    • image
    • Automaic evaluation on chinese Vicuna-Instructions-80 from GPT-4 translation. Vicuna-13B works good as well.
    • image
    • It perfrom ROUGE-L on unnatural instructions evaluated with 9K samples. It shows LLaMA-GPT4-7B perform better when response is long.
    • image
  • Most importantly, about reward model, in figure 4, it shows 1.3B regression reward model fine-tuned on GPT-4 generated camparison dataset, work just as well as original GPT-4. It shows a very promissing way to perform RLHF and whole 3 step finet-tune like ChatGPT in fulture work.

shm007g avatar Apr 19 '23 03:04 shm007g

[PaLM 2 Technical Report, Google, 2023.05]

  • scaling law: power law to equal proportion(1:1), find out data size is at least as important as model size; data selection and efficient architecture/objectives can improve performace as well; design a more multilingual and diverse pre-training mixture extends across hundreds of languages and domains; build on strong UL2(20B in Paper); largest PaLM 2-L is significant smaller than largest PaLM(-540B) but much better.
  • scaling law experiment: there is optimal param size at each compute scale, 10^22 FLOPs as 10.7B, 10^21 as 3.35B, 10^20 as 1.04B.
  • model size: three variants of PaLM 2: a Small (S), Medium (M), and Large (L) version. PaLM 2 refers to the Large version. Blogs says there will be four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn.
  • Evaluation: six high level categories for academic benchmark: classification and question answering, reasoning, coding, translation and natural language generation. language proficiency exams for human benchmark.
  • (1) Language proficiency exams(multilingual): PaLM 2 pass all 6 professional language proficiency exams follow the C2 definition, include chinese, japenese, italian, french, spanish, german. Performed generic instruction finetuning without exam contents, pass exams with zero-shot prompting and native human evaluation.
  • (2) Classification and question answering: dataset commonly used in LLM literature and multilingual capabilities.
  • (2.1) English QA and classification tasks(one-shot setting)
    • Open-domain closed-book question answering tasks: TriviaQA (Joshi et al., 2017), Natural Questions2 (Kwiatkowski et al., 2019), and WebQuestions (Berant et al., 2013)
    • Cloze and completion tasks: LAMBADA (Paperno et al., 2016), HellaSwag (Zellers et al., 2019), and StoryCloze (Mostafazadeh et al., 2016)
    • Winograd-style tasks: Winograd (Levesque et al., 2012) and WinoGrande (Sakaguchi et al., 2021)
    • Reading comprehension: SQuAD v2 (Rajpurkar et al., 2018) and RACE (Lai et al., 2017)
    • Common sense reasoning: PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and OpenBookQA (Mihaylov et al., 2018)
    • SuperGLUE (Wang et al., 2019)
    • Natural language inference: Adversarial NLI (ANLI; Nie et al., 2020)
  • (2.2) Multilingual QA (one-shot and no-content setting): TyDi QA (Clark et al., 2020)
  • (2.3) Multilingual toxicity classification
    • Toxicity classification with CivilComments
    • Multilingual toxicity classification with Jigsaw Multilingual
  • (3) Reasoning:
  • (3.1) representative reasoning datasets in a few-shot setting: WinoGrande (Sakaguchi et al., 2021), ARC-C (Clark et al., 2018), DROP (Dua et al.,2019), StrategyQA (Geva et al., 2021), CommonsenseQA (CSQA; Talmor et al., 2019), XCOPA (Ponti et al., 2020), and BIG-Bench (BB) Hard(23 tasks from 200+, where LLM performed below average human) (Suzgun et al., 2022). competitive with GPT-4.
    • Multilingual common sense reasoning: XCOPA
    • BIG-Bench (BB) Hard: 23 tasks from 200+, where LLM performed below average human, like multi-step arithmetic problems(multistep_arithmetic)
  • (3.2) Mathematical reasoning: finetuned on flan dataset (1800 tasks, at least 20 instruction templates per task)
    • MATH (Hendrycks et al., 2021), which contains 12,500 problems from high school competitions in 7 mathematics subject areas
    • GSM8K (Cobbe et al., 2021), a dataset of 8,500 grade school math word problems
    • MGSM (Shi et al., 2023), a multilingual version of GSM8K with translations of a subset of examples into ten typologically diverse languages.
  • (4) Coding: train the PaLM 2-S model on an extended, code-heavy, heavily multilingual data mixture, resulting model PaLM 2-S*.
    • Code Generation: 3 coding datasets: HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and ARCADE (Yin et al., 2022), PaLM 2-S* outperforms PaLM-540B-Coder on all benchmarks with few-shot setting.
    • Multilingual Evaluation: BabelCode (Orlanski et al., 2023) which translates HumanEval into a variety of other programming languages including c++, java, go, haskell and julia.
  • (5) Translation
    • WMT21 Experimental Setup: automatic metric using BLEURT, human metric using Multidimensional Quality Metrics (MQM) with hired professional translators
    • Regional translation experimental setup: FRMT benchmark
    • Potential misgendering harms
  • (6) Natural language generation: ROUGE on 1-shot-learning setting
    • XLSum (Hasan et al., 2021), which asks a model to summarize a news article
    • WikiLingua (Ladhak et al., 2020), which focuses on generating section headers for step-by-step instructions from WikiHow
    • XSum (Narayan et al., 2018), which tasks a model with generating a news article’s first sentence
    • Potential harms and bias: ParlAI Dialogue Safety, RealToxicityPrompts, BBQ Bias Benchmark for QA, Multilingual Representational Bias
    • Multilingual capabilities: Explaining jokes, Explaining translation ambiguities, Translating into dialects, Expanding abbrevations and fixing typos, Converting formal text into colloquial chat text, Transliterating into new scripts
  • (7) Memorization

shm007g avatar May 12 '23 07:05 shm007g

[GPT-4 Technical Report, OpenAI, 2023.03]

  • no further details about architecture (including model size), hardware, training compute, dataset construction, traning method, or similar.
  • Multi-modal: accept image and text inputs and produce text outputs.
  • Academic and professional exams (for human): exhibits human-level performance on the majority of these exams.
  • traditional NLP benchmark: outperforms previous LLM and system; academic benchmarks: (MMLU, HellaSwag, AI2 Reasoning Challenage(ARC), WinoGrande, HumanEval, DROP, GSM-8K)
  • HumanEval dataset: log pass rate predictable... for loss predictable...
  • inverse scaling prize: Hindsight Neglect, GPT-4 reverse the trend.
  • open sourcing OpenAI Evals: https://github.com/openai/evals
  • Visual Input: parallel to text-only setting;
  • hallucinations: GPT-4 reduces hallucinations to GPT-3.5 with 19% point higher on OpenAI Internal evaluations (which contains learning, technology, writing, history, math, science, recommendation, code, business).
  • TruthfulQA: RLHF post-training GPT-4 is much better than GPT-3.5; Lack knowledge of event after September 2021, majority data cuts off that date.
  • not fully reliable: hallucinations, limit context windows, do not learn from experience.
  • bring novel safety challenges:
  • developed a infrastructure and optimization methods: predictable behavior across multiple scales.
  • GPT-4 System Card: more than half length of the paper

shm007g avatar May 15 '23 07:05 shm007g

[Sparks of Artificial General Intelligence: Early experiments with GPT-4, MSFT, 2023.04]

  • (1) refine: refined over span of a month
  • (2) Multimodal and interdisciplinary composition: not only does demonstrate a high level of proficiency in different domains such as literature, medicine, law, mathematics, physical sciences, and programming, but it is also able to combine skills; understand image and text input and can manipulate text and image in geniue way, not just copy it; does not understand harmony in music.
  • (3) Code: reason about code execution, simulate the effects of instructions, and explain the results in natural language, even pseudocode;
  • HumanEval, desciption to code benchmark; Leetcode, 100 sample per level, in first 5 attempts; real world: data visualization, front-end / game development, write code for deep learning, interface with Latex;
  • understand existing code, reasoning about code execution; executing python code(plugin?);
  • (4) Mathematical abilities
  • GSM8K: an elementary school math dataset contains 8000 questions on topics such as arithmetic, fractions, geometry, and word problems;
  • MATH: a high school math dataset contains 12500 questions on topics such as algebra, calculus, trigonometry, and probability;
  • MMMLU-STEM: 2000 multiple choice question covering high school and college STEM topics;
  • specially fine-tuned math model named Minerva, score between text-davinci-003 and GPT-4; GPT-4 have many mistake on MATH due to arithmetic and calculation mistaskes;
  • image
  • Fermi questions: requires both quantitative thinking and general knowledge; don't make much progress;
  • Higher-Level mathematics: 2022 international mathematic Olympiad;
  • (5) Real World Interaction: tool use and embodied interaction;
  • (6) Interaction with humans: successfully passes Sally-Anne test, a classic false-belief test; miscommunication and misunderstanding; explainability;
  • (7) Discriminative capabilities: different aspect, situations; personally identiable information (PII); text anonymization benchmark (TAB); TruthfulQA, for misconceptions and face-checking;
  • (8) Limitations: Lack of planning in arithmetic/reasoning problems; long term memory;

shm007g avatar May 15 '23 08:05 shm007g

OpenAI Research

InstructGPT: [Training language models to follow instructions with human feedback, OpenAI, 2022.03]

GPT3: [Language Models are Few-Shot Learners, OpenAI, 2020.05]

GPT2

GPT1

other research

https://openai.com/research/techniques-for-training-large-neural-networks https://openai.com/research/sparse-transformer https://openai.com/research/measuring-goodharts-law https://openai.com/research/webgpt https://openai.com/research

shm007g avatar May 25 '23 02:05 shm007g

Prompt Tuning

  • prompt tuning
  • prefix tuning
  • p-tuning
  • p-tuning-v2

[Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021/01, Stanford]

[The Power of Scale for Parameter-Efficient Prompt Tuning, 2021/09, Google]

  • conditioning a frozen model with soft prompts; outperform GPT-3's few-shot learning on discrete text prompts on downstream tasks; benifits in robustness to domain transfer and efficient "prompt ensembling".
  • model tuning/fine-tuning, all model parameter are tuned; prompt design with task description and examples with frozen big models; soft prompt perform much better than prompt design and comparable performance with model tuning when param goes big;
  • 247628f05842337977dacf291f57094c
  • other methods: automate prompt design like search discrete space of words; prefix-tuning backpropagates errors to prefix tensor/activations;
  • this paper, prompt tuning;

[GPT Understands, Too, 2021/03, Tsinghua, Peking, BAAI]

[P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks, 2022/03, Tsinghua, BAAI]

shm007g avatar May 31 '23 05:05 shm007g

Google Research

T5, Flan-T5

Pathway, UL2, MoE


[LLM are zero-shot rankers for recommender system] [Amazon, textbooks are all you need: learning language representation for sequence recommandation] A new alternative to RLHF just dropped! https://twitter.com/rasbt/status/1663883300522295296 [Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/abs/2305.18290 ] https://github.com/eric-mitchell/direct-preference-optimization https://github.com/LAION-AI/Open-Assistant/discussions/3347 distilling step by step: outperforming llm with less training data and smaller model size

shm007g avatar May 31 '23 05:05 shm007g