Prompt Engineering 101 - Elevating Your Prompts to New Heights

The art of prompting (also called prompt engineering) stands as a gateway to unlocking the true potential of text generation tools like ChatGPT. A well-crafted prompt with just the right instructions can unlock not only answers but also completely new use cases.

Let’s establish a common ground: In the realm of textual data, a prompt is a user input, most often in natural language, functioning as the instruction that guides a large language model (LLM) to generate something or take a specific action. Here are four general tips to optimize the performance of these models for your needs:

Understand Limitations: Familiarize yourself with ChatGPT’s operational constraints to tailor prompts for optimal performance.
1. Keep it Short: While the processing capabilities of ChatGPT and its counterparts have soared to new heights (with GPT-4 Turbo handling up to 128k tokens, equivalent to about 250 pages), it’s essential to keep prompts concise. Very long texts, despite expanded capacities, may not yield optimal results.(Liu et al., 2023)
2. Stay away from current: Note that ChatGPT is trained with a fixed cutoff date. While some adaptations can leverage additional tools or internet sources through Retrieval Augmented Generation (RAG, see Lewis et al. (2020)), up-to-date information is usually not reliable.
3. Tasks like counting letters or words and some math problems will not work, either because tokenization or model restrictions.
Instruct: Craft your prompt as a clear instruction and split instructions from the information/example(s) with ### or """ You can set the desired tone, specify the structure of the output – be it a table, bullet points or leave blanks for expansion.
Refine Iteratively: You have two options to explore: a) refine the initial prompt based on model responses, or b) supplement instructions through additional chat prompts. As b) uses both the model’s output and your inputs as memory/context, the results can be quite different.
Check Results: Before you delve deeper, always check the results. This step ensures that the generated output aligns with your expectations. Regularly assessing outcomes allows for quick adjustments and improvements. LLMs are prone to confabulations (also known as hallucinations), meaning they can come up with imaginary facts, Links, Papers etc. There are two simple prompt additions that can reduce the amount of these confabulations:
1. “I don’t know”: Enhance output reliability by prompting ChatGPT to respond with “I don’t know” when uncertain. Note that this approach isn’t foolproof, as it doesn’t eliminate all hallucinations. Additionally, future ChatGPT updates may impact results.
2. “What did I ask you to do”: By asking ChatGPT to reproduce the original question, you can follow its Train of Thought and adapt if something was missed.

For-Reasoning Prompts

This category encompasses prompting techniques designed to enhance ChatGPT’s reasoning abilities. The primary strategies involve breaking down complex tasks into subtasks and offering reasoned instructions. These approaches provide ChatGPT with additional time and guidance, aiding it in deriving more accurate results.

“Few-shot”

Few-shot learning (Brown et al., 2020) is a paradigm where a model can perform a task with minimal examples or shots of data. Unlike traditional approaches that require large datasets, few-shot learning enables a model to generalize and make accurate predictions based on a small set of examples, often just a handful. Rather than stipulating the desired outcome, you provide explicit and representative examples that demonstrate the proper approach to solving the task. See my Blog Post on example

“Chain-of-thought”

Chain of Thought (CoT) Prompting Wei et al. (2022), enhances few-shot learning by incorporating the reasoning process leading to the answers in the example. Unlike standard “Few-shot” prompting, it provides not just questions and answers but also the logical steps for arriving at solutions. However, it’s important to note that chain-of-thought prompting may adversely impact performance compared to the traditional few-shot approach. In such cases, employing “Self-consistency” prompting may offer a remedy.

“Think Step-by-Step”

Kojima et al. (2022) describe using LLM as zero-shot learners, meaning that the systems need no example to solve complex problems. Their approach involves instructing ChatGPT to reason through a task sequentially (zero-shot Chain of Thought). Two steps are involved: By prompting ChatGPT to “think step-by-step,” a step-by-step description is generated, which is used as part of a second prompt to come to the final conclusion. The goal is to minimize the likelihood of missing crucial steps in the reasoning process.

“Least-to-most”

Least-to-most prompting (Zhou et al., 2022) involves breaking down complex tasks into smaller, more manageable subtasks. This strategy includes two stages:

Decomposition: The prompt initially provides examples demonstrating how to decompose complex problems, followed by the specific question to be decomposed.
Subproblem Solving: This stage’s prompts are created based on three components:the initial problem description, a potentially empty list of previously answered subquestions and generated solutions, and the next question to be addressed (can be the initial question).

“Selection-inference” and “Faithful reasoning”

Creswell et al. (2022) introduced an extension to the chain-of-thought technique, which involves breaking down the generation of explanations and answers into smaller, modular parts. In this approach, the first prompt, known as the ‘selection prompt’, picks a relevant subset of facts from the text. Subsequently, a second prompt, the ‘inference prompt’, deduces a conclusion from the selected and limited facts. These prompts are iteratively alternated in a loop, creating multiple steps of reasoning that lead to a final answer, as illustrated in the accompanying figure. This method was extended with Faithful Reasoning Creswell and Shanahan (2022), including ideas for when to halt the loop and further reduce hallucinations. The application of both techniques is not straightforward, as it requires the fine-tuning of both selection and inference language models.

“Maieutic”

Jung et al. (2022) introduced maieutic prompting, which generates a tree of potential explanations, both correct and incorrect, and analyzes their relationships to deduce the correct set. This complex but innovative technique explores the Socratic method of questioning to elicit ideas and determine logically integral explanations. For most of your use-cases this will be overkill and hard to apply.

Monte-Carlo Prompts

The techniques in this category improve the reliability by using repeated sampling, similar to a Monte Carlo simulation, to comprehend and predict outcomes based on their likelihood. Essentially, the model is called multiple times, and the collected answers are summarized to derive a result. If applicable, it is recommended to set a higher temperature parameter for the results to have a higher variance. However, it’s crucial to be aware that these methods incur a higher cost, as the model is sampled repeatedly.

“Self-consistency”

With Self-consistency (Wang et al., 2022) after generating multiple diverse results via Chain of Thought Prompting, the final result is determined by a majority vote or another metric. So you pick the answer that occurs most often.

Tree/Graph of Thought

These methods extend Self-consistency by representing the reasoning steps (“thoughts”) in a tree (Tree of Thought,Yao et al. (2023)) or Graph (Graph of Thought, Besta et al. (2023)), and having the model self-evaluate nodes for tree/graph search algorithms. To use these advanced methods you will need to have some coding experience. (GitHub Repos ToT and GoT).

Outlook

Only a short while ago, we discussed the emerging role of a Prompt Engineer, responsible for crafting prompts for LLMs. However, recent advancements, such as Auto-CoT (Zhang et al., 2022), Automatic Prompt Engineer (APE)(Zhou et al., 2022), and Optimization by Prompting (OPRO)(Yang et al., 2023), reveal that LLMs themselves excel at prompt engineering. These techniques not only match but often surpass human-level performance, demonstrating the capacity to guide models toward truthfulness and informativeness. It suggests that in the future, the most effective prompts will be generated by LLMs.

Bibliography

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: solving elaborate problems with large language models. 2023. arXiv:arXiv:2308.09687. ↩

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, 2020. arXiv:2005.14165. ↩

Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. 2022. arXiv:arXiv:2208.14271. ↩

Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: exploiting large language models for interpretable logical reasoning. 2022. arXiv:arXiv:2205.09712. ↩

Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: logically consistent reasoning with recursive explanations. 2022. arXiv:arXiv:2205.11822. ↩

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. 2022. arXiv:arXiv:2205.11916. ↩

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. 2020. arXiv:arXiv:2005.11401. ↩

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: how language models use long contexts. 2023. arXiv:arXiv:2307.03172. ↩

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. 2022. arXiv:arXiv:2203.11171. ↩

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 2022. arXiv:arXiv:2201.11903. ↩

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. 2023. arXiv:arXiv:2309.03409. ↩

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. 2023. arXiv:arXiv:2305.10601. ↩

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. 2022. arXiv:arXiv:2210.03493. ↩

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. 2022. arXiv:arXiv:2205.10625. ↩

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. 2022. arXiv:arXiv:2211.01910. ↩