add evaluations course and update models table

2024-09-04 16:45:46 -06:00
parent cf2979dc88
commit b81598db86
93 changed files with 15212 additions and 18 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -25,6 +25,7 @@ share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 node_modules/
 MANIFEST
 # PyInstaller
--- a/README.md
+++ b/README.md
@@ -8,4 +8,7 @@ Welcome to Anthropic's educational courses. This repository currently contains f
  - [Google Vertex version](https://github.com/anthropics/courses/tree/vertex/real_world_prompting)
 3. [Real world prompting course](./real_world_prompting/README.md) - learn how to incorporate prompting techniques into complex, real world prompts
-4. [Tool use course](./tool_use/README.md) - teaches everything you need to know to implement tool use successfully in your workflows with Claude.
+4. [Writing prompt evaluations course](./evaluations/README.md) - learn how to write production prompt evaluations to measure the quality of your prompts.
 5. [Tool use course](./tool_use/README.md) - teaches everything you need to know to implement tool use successfully in your workflows with Claude.
 **Please note that these courses often favor our lowest-cost model, Claude 3 Haiku, to keep API costs down for students following along with the materials. Feel free to use other Claude models if you prefer. **
--- a/anthropic_api_fundamentals/03_models.ipynb
+++ b/anthropic_api_fundamentals/03_models.ipynb
@@ -59,23 +59,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here's an overview of the key distinctions between the models: \n",
+    "Refer to [this table](https://docs.anthropic.com/en/docs/about-claude/models#model-comparison-table) for a comparison of the key features and capabilities of each model in the Claude family."
    "\n",
    "| Feature | Claude 3.5 Sonnet | Claude 3 Opus | Claude 3 Sonnet | Claude 3 Haiku |\n",
    "|---------|-------------------|----------------|------------------|-----------------|\n",
    "| Description | Most intelligent model | Powerful model for highly complex tasks | Balance of intelligence and speed | Fastest and most compact model for near-instant responsiveness |\n",
    "| Strengths | Highest level of intelligence and capability | Top-level performance, intelligence, fluency, and understanding | Strong utility, balanced for scaled deployments | Quick and accurate targeted performance |\n",
    "| Multilingual | Yes | Yes | Yes | Yes |\n",
    "| Vision | Yes | Yes | Yes | Yes |\n",
    "| API model name | `claude-3-5-sonnet-20240620` | `claude-3-opus-20240229` | `claude-3-sonnet-20240229` | `claude-3-haiku-20240307` |\n",
    "| API format | Messages API | Messages API | Messages API | Messages API |\n",
    "| Comparative latency | Fast | Moderately fast | Fast | Fastest |\n",
    "| Context window | 200K | 200K | 200K | 200K |\n",
    "| Max output | 8192 tokens* | 4096 tokens | 4096 tokens | 4096 tokens |\n",
    "| Cost (Input / Output per MTok) | $3.00 / $15.00 | $15.00 / $75.00 | $3.00 / $15.00 | $0.25 / $1.25 |\n",
    "| Training data cut-off | Apr 2024 | Aug 2023 | Aug 2023 | Aug 2023 |\n",
    "\n",
    "_*8192 output tokens is in beta and requires the header anthropic-beta: max-tokens-3-5-sonnet-2024-07-15. If the header is not specified, the limit is 4096 tokens._\n"
   ]
  },
  {
--- a/evaluations/01_intro_to_evals.ipynb
+++ b/evaluations/01_intro_to_evals.ipynb
--- a/evaluations/02_workbench_evals.ipynb
+++ b/evaluations/02_workbench_evals.ipynb
--- a/evaluations/03_code_graded.ipynb
+++ b/evaluations/03_code_graded.ipynb
--- a/evaluations/04_code_graded_classification.ipynb
+++ b/evaluations/04_code_graded_classification.ipynb
--- a/evaluations/05_prompt_foo_code_graded_animals/README.md
+++ b/evaluations/05_prompt_foo_code_graded_animals/README.md
@@ -0,0 +1,8 @@
 To get started, set your ANTHROPIC_API_KEY environment variable
 Then run:
 ```
 promptfoo eval
 ```
 Afterwards, you can view the results by running `promptfoo view`
--- a/evaluations/05_prompt_foo_code_graded_animals/animal_legs_tests.csv
+++ b/evaluations/05_prompt_foo_code_graded_animals/animal_legs_tests.csv
@@ -0,0 +1,13 @@
 animal_statement,__expected
 "The animal is a human.","2"
 "The animal is a snake.","0"
 "The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.","5"
 "The animal is a dog.","4"
 "The animal is a cat with two extra legs.","6"
 "The animal is an elephant.","4"
 "The animal is a bird.","2"
 "The animal is a fish.","0"
 "The animal is a spider with two extra legs","10"
 "The animal is an octopus.","8"
 "The animal is an octopus that lost two legs and then regrew three legs.","9"
 "The animal is a two-headed, eight-legged mythical creature.","8"
--- a/evaluations/05_prompt_foo_code_graded_animals/images/details.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/details.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/eval_output.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/eval_output.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/eval_output1.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/eval_output1.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/eval_results1.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/eval_results1.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/eval_view.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/eval_view.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/eval_view1.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/eval_view1.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/final_view.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/final_view.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/multi_model_eval_view.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/multi_model_eval_view.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/prompt_foo.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/prompt_foo.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/three_prompt_eval.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/three_prompt_eval.png
--- a/evaluations/05_prompt_foo_code_graded_animals/images/toolbar.png
+++ b/evaluations/05_prompt_foo_code_graded_animals/images/toolbar.png
--- a/evaluations/05_prompt_foo_code_graded_animals/lesson.ipynb
+++ b/evaluations/05_prompt_foo_code_graded_animals/lesson.ipynb
--- a/evaluations/05_prompt_foo_code_graded_animals/package-lock.json
+++ b/evaluations/05_prompt_foo_code_graded_animals/package-lock.json
--- a/evaluations/05_prompt_foo_code_graded_animals/package.json
+++ b/evaluations/05_prompt_foo_code_graded_animals/package.json
@@ -0,0 +1,5 @@
 {
  "dependencies": {
    "promptfoo": "^0.78.0"
  }
 }
--- a/evaluations/05_prompt_foo_code_graded_animals/promptfooconfig.yaml
+++ b/evaluations/05_prompt_foo_code_graded_animals/promptfooconfig.yaml
@@ -0,0 +1,18 @@
 description: "Animal Legs Eval"
 prompts:
  - prompts.py:simple_prompt
  - prompts.py:better_prompt
  - prompts.py:chain_of_thought_prompt
 providers:
  - anthropic:messages:claude-3-haiku-20240307
  - anthropic:messages:claude-3-5-sonnet-20240620
 tests: animal_legs_tests.csv
 defaultTest:
  options:
    transform: file://transform.py
--- a/evaluations/05_prompt_foo_code_graded_animals/prompts.py
+++ b/evaluations/05_prompt_foo_code_graded_animals/prompts.py
@@ -0,0 +1,26 @@
 def simple_prompt(animal_statement):
    return f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
    Here is the animal statement.
    <animal_statement>{animal_statement}</animal_statement>
    How many legs does the animal have? Please respond with a number"""
 def better_prompt(animal_statement):
    return f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
    Here is the animal statement.
    <animal_statement>{animal_statement}</animal_statement>
    How many legs does the animal have? Please only respond with a single digit like 2 or 9"""
 def chain_of_thought_prompt(animal_statement):
    return f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
    Here is the animal statement.
    <animal_statement>{animal_statement}</animal_statement>
    How many legs does the animal have? 
    Start by reasoning about the numbers of legs the animal has, thinking step by step inside of <thinking> tags.  
    Then, output your final answer inside of <answer> tags. 
    Inside the <answer> tags return just the number of legs as an integer and nothing else."""
--- a/evaluations/05_prompt_foo_code_graded_animals/transform.py
+++ b/evaluations/05_prompt_foo_code_graded_animals/transform.py
@@ -0,0 +1,9 @@
 def get_transform(output, context):
    if "<thinking>" in output:
        try:
            return output.split("<answer>")[1].split("</answer>")[0].strip()
        except Exception as e:
            print(f"Error in get_transform: {e}")
            return output
    return output
--- a/evaluations/06_prompt_foo_code_graded_classification/README.md
+++ b/evaluations/06_prompt_foo_code_graded_classification/README.md
@@ -0,0 +1,8 @@
 To get started, set your ANTHROPIC_API_KEY environment variable
 Then run:
 ```
 promptfoo eval
 ```
 Afterwards, you can view the results by running `promptfoo view`
--- a/evaluations/06_prompt_foo_code_graded_classification/dataset.csv
+++ b/evaluations/06_prompt_foo_code_graded_classification/dataset.csv
@@ -0,0 +1,21 @@
 complaint,__expected
 The app crashes every time I try to upload a photo,contains-all:Software Bug
 My printer isn't recognized by my computer,contains-all:Hardware Malfunction
 I can't figure out how to change my password,contains-all:User Error
 The website is completely down I can't access any pages,contains-all:Service Outage
 It would be great if the app had a dark mode option,contains-all:Feature Request
 The software keeps freezing when I try to save large files,contains-all:Software Bug
 My wireless mouse isn't working even with new batteries,contains-all:Hardware Malfunction
 I accidentally deleted some important files can you help me recover them?,contains-all:User Error
 None of your servers are responding is there an outage?,contains-all:Service Outage
 Could you add a feature to export data in CSV format?,contains-all:Feature Request
 "The app is crashing and my phone is overheating","contains-all:Software Bug,Hardware Malfunction"
 I can't remember my password!,contains-all:User Error
 The new update broke something and the app no longer works for me,contains-all:Software Bug
 "I think I installed something incorrectly now my computer won't start at all","contains-all:User Error,Hardware Malfunction"
 "Your service is down and I urgently need a feature to batch process files","contains-all:Service Outage,Feature Request"
 The graphics card is making weird noises,contains-all:Hardware Malfunction
 My keyboard just totally stopped working out of nowhere,contains-all:Hardware Malfunction
 Whenever I open your app my phone gets really slow,contains-all:Software Bug
 Can you make the interface more user-friendly? I always get lost in the menus,"contains-all:Feature Request,User Error"
 The cloud storage isn't syncing and I can't access my files from other devices,"contains-all:Software Bug,Service Outage"
--- a/evaluations/06_prompt_foo_code_graded_classification/images/eval_output.png
+++ b/evaluations/06_prompt_foo_code_graded_classification/images/eval_output.png
--- a/evaluations/06_prompt_foo_code_graded_classification/images/output_row.png
+++ b/evaluations/06_prompt_foo_code_graded_classification/images/output_row.png
--- a/evaluations/06_prompt_foo_code_graded_classification/images/web_view.png
+++ b/evaluations/06_prompt_foo_code_graded_classification/images/web_view.png
--- a/evaluations/06_prompt_foo_code_graded_classification/lesson.ipynb
+++ b/evaluations/06_prompt_foo_code_graded_classification/lesson.ipynb
--- a/evaluations/06_prompt_foo_code_graded_classification/promptfooconfig.yaml
+++ b/evaluations/06_prompt_foo_code_graded_classification/promptfooconfig.yaml
@@ -0,0 +1,11 @@
 description: "Complaint Classification Eval"
 prompts:
  - prompts.py:basic_prompt
  - prompts.py:improved_prompt
 providers:
  - "anthropic:messages:claude-3-haiku-20240307"
 tests: dataset.csv
--- a/evaluations/06_prompt_foo_code_graded_classification/prompts.py
+++ b/evaluations/06_prompt_foo_code_graded_classification/prompts.py
@@ -0,0 +1,60 @@
 def basic_prompt(complaint):
    return f"""
    Classify the following customer complaint into one or more of these categories: 
    Software Bug, Hardware Malfunction, User Error, Feature Request, or Service Outage.
    Only respond with the classification.
    Complaint: {complaint}
    Classification:
    """
 def improved_prompt(complaint):
    return f"""
    You are an AI assistant specializing in customer support issue classification. Your task is to analyze customer complaints and categorize them into one or more of the following categories:
    1. Software Bug: Issues related to software not functioning as intended.
    2. Hardware Malfunction: Problems with physical devices or components.
    3. User Error: Difficulties arising from user misunderstanding or misuse.
    4. Feature Request: Suggestions for new functionalities or improvements.
    5. Service Outage: System-wide issues affecting service availability.
    Important Guidelines:
    - A complaint may fall into multiple categories. If so, list all that apply but try to prioritize picking a single category when possible.
    Examples:
    1. Complaint: "The app crashes when I try to save my progress."
    Classification: Software Bug
    2. Complaint: "My keyboard isn't working after I spilled coffee on it."
    Classification: Hardware Malfunction
    3. Complaint: "I can't find the login button on your website."
    Classification: User Error
    4. Complaint: "It would be great if your app had a dark mode."
    Classification: Feature Request
    5. Complaint: "None of your services are loading for me or my colleagues."
    Classification: Service Outage
    6. Complaint "Complaint: The app breaks every time I try to change my profile picture"
    Classification: Software Bug
    7. Complaint "The app is acting buggy on my phone and it seems like your website is down, so I'm completely stuck!"
    Classification: Software Bug, Service Outage
    8. Complaint: "Your software makes my computer super laggy and awful, I hate it!"
    Classification: Software Bug
    9. Complaint: "Your dumb app always breaks when I try to do anything with images."
    Classification: 'Software Bug'
    Now, please classify the following customer complaint:
    <complaint>{complaint}</complaint>
    Only respond with the appropriate categories and nothing else.
    Classification:
    """
--- a/evaluations/07_prompt_foo_custom_graders/README.md
+++ b/evaluations/07_prompt_foo_custom_graders/README.md
@@ -0,0 +1,8 @@
 To get started, set your ANTHROPIC_API_KEY environment variable
 Then run:
 ```
 promptfoo eval
 ```
 Afterwards, you can view the results by running `promptfoo view`
--- a/evaluations/07_prompt_foo_custom_graders/count.py
+++ b/evaluations/07_prompt_foo_custom_graders/count.py
@@ -0,0 +1,17 @@
 import re
 def get_assert(output, context):
    topic = context["vars"]["topic"]
    goal_count = int(context["vars"]["count"])
    pattern = fr'(^|\s)\b{re.escape(topic)}\b'
    actual_count = len(re.findall(pattern, output.lower()))
    pass_result = goal_count == actual_count
    result = {
        "pass": pass_result,
        "score": 1 if pass_result else 0,
        "reason": f"Expected {topic} to appear {goal_count} times. Actual: {actual_count}",
    }
    return result
--- a/evaluations/07_prompt_foo_custom_graders/images/final_eval.png
+++ b/evaluations/07_prompt_foo_custom_graders/images/final_eval.png
--- a/evaluations/07_prompt_foo_custom_graders/images/final_view.png
+++ b/evaluations/07_prompt_foo_custom_graders/images/final_view.png
--- a/evaluations/07_prompt_foo_custom_graders/images/initial_eval_output.png
+++ b/evaluations/07_prompt_foo_custom_graders/images/initial_eval_output.png
--- a/evaluations/07_prompt_foo_custom_graders/images/single_row.png
+++ b/evaluations/07_prompt_foo_custom_graders/images/single_row.png
--- a/evaluations/07_prompt_foo_custom_graders/images/tweezers_haiku_closeup.png
+++ b/evaluations/07_prompt_foo_custom_graders/images/tweezers_haiku_closeup.png
--- a/evaluations/07_prompt_foo_custom_graders/images/tweezers_sonnet_closeup.png
+++ b/evaluations/07_prompt_foo_custom_graders/images/tweezers_sonnet_closeup.png
--- a/evaluations/07_prompt_foo_custom_graders/lesson.ipynb
+++ b/evaluations/07_prompt_foo_custom_graders/lesson.ipynb
--- a/evaluations/07_prompt_foo_custom_graders/promptfooconfig.yaml
+++ b/evaluations/07_prompt_foo_custom_graders/promptfooconfig.yaml
@@ -0,0 +1,28 @@
 description: Count mentions
 prompts:
  - >-
    Write a short paragraph about {{topic}}. Make sure you mention {{topic}} exactly {{count}} times, no more or fewer. Only use lower case letters in your output.
 providers:
  - anthropic:messages:claude-3-haiku-20240307
  - anthropic:messages:claude-3-5-sonnet-20240620
 defaultTest:
  assert:
    - type: python
      value: file://count.py
 tests:
  - vars:
      topic: sheep
      count: 3
  - vars:
      topic: fowl
      count: 2
  - vars:
      topic: gallows
      count: 4
  - vars:
      topic: tweezers
      count: 7
  - vars:
      topic: jeans
      count: 6
--- a/evaluations/08_prompt_foo_model_graded/README.md
+++ b/evaluations/08_prompt_foo_model_graded/README.md
@@ -0,0 +1,8 @@
 To get started, set your ANTHROPIC_API_KEY environment variable
 Then run:
 ```
 promptfoo eval
 ```
 Afterwards, you can view the results by running `promptfoo view`
--- a/evaluations/08_prompt_foo_model_graded/images/details1.png
+++ b/evaluations/08_prompt_foo_model_graded/images/details1.png
--- a/evaluations/08_prompt_foo_model_graded/images/details2.png
+++ b/evaluations/08_prompt_foo_model_graded/images/details2.png
--- a/evaluations/08_prompt_foo_model_graded/images/eval1.png
+++ b/evaluations/08_prompt_foo_model_graded/images/eval1.png
--- a/evaluations/08_prompt_foo_model_graded/images/eval2.png
+++ b/evaluations/08_prompt_foo_model_graded/images/eval2.png
--- a/evaluations/08_prompt_foo_model_graded/images/eval3.png
+++ b/evaluations/08_prompt_foo_model_graded/images/eval3.png
--- a/evaluations/08_prompt_foo_model_graded/images/soccer_players.png
+++ b/evaluations/08_prompt_foo_model_graded/images/soccer_players.png
--- a/evaluations/08_prompt_foo_model_graded/images/web_view.png
+++ b/evaluations/08_prompt_foo_model_graded/images/web_view.png
--- a/evaluations/08_prompt_foo_model_graded/lesson.ipynb
+++ b/evaluations/08_prompt_foo_model_graded/lesson.ipynb
--- a/evaluations/08_prompt_foo_model_graded/promptfooconfig.yaml
+++ b/evaluations/08_prompt_foo_model_graded/promptfooconfig.yaml
@@ -0,0 +1,47 @@
 description: "School Assistant Eval"
 prompts:
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    This is the student question: {{question}}
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    Acceptable topics you should respond to include: math, reading, science, foreign languages, social students, and art.
    Refuse to answer questions not related to the above topics in an academic context.
    This is the student question: {{question}}
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    Acceptable topics you should respond to include: math, reading, science, foreign languages, social students, and art.
    Refuse to answer questions not related to the above topics in an academic context.
    Do not apologize or use an apologetic tone when refusing.  Instead, gently nudge students to focus on school-related topics.
    This is the student question: {{question}}
 providers: [anthropic:messages:claude-3-5-sonnet-20240620]
 defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Refuses to answer the question and instead redirects to academic topics'
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Is not apologetic'
 tests:
  - vars:
      question: What's the best free mobile video game?
  - vars:
      question: How can I convince my parents to let me stay up later?
  - vars:
      question: What are some fun theme ideas for a sleepover?
  - vars:
      question: What are some good prank ideas that won't get me in trouble?
  - vars:
      question: Who are some famous left footed soccer players?
--- a/evaluations/09_custom_model_graded_prompt_foo/README.md
+++ b/evaluations/09_custom_model_graded_prompt_foo/README.md
@@ -0,0 +1,8 @@
 To get started, set your ANTHROPIC_API_KEY environment variable
 Then run:
 ```
 promptfoo eval
 ```
 Afterwards, you can view the results by running `promptfoo view`
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article1.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article1.txt
@@ -0,0 +1,415 @@
 A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2]
 LLMs are artificial neural networks that use the transformer architecture, invented in 2017. The largest and most capable LLMs, as of June 2024, are built with a decoder-only transformer-based architecture, which enables efficient processing and generation of large-scale text data.
 Historically, up to 2020, fine-tuning was the primary method used to adapt a model for specific tasks. However, larger models such as GPT-3 have demonstrated the ability to achieve similar results through prompt engineering, which involves crafting specific input prompts to guide the model's responses.[3] These models acquire knowledge about syntax, semantics, and ontologies[4] inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on.[5]
 Some notable LLMs are OpenAI's GPT series of models (e.g., GPT-3.5, GPT-4 and GPT-4o; used in ChatGPT and Microsoft Copilot), Google's Gemini (the latter of which is currently used in the chatbot of the same name), Meta's LLaMA family of models, IBM's Granite models initially released with Watsonx, Anthropic's Claude models, and Mistral AI's models.
 History
 Before 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the IBM alignment models pioneered statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3 billion words achieved then-SOTA perplexity.[6] In the 2000s, as Internet use became prevalent, some researchers constructed Internet-scale language datasets ("web as corpus"[7]), upon which they trained statistical language models.[8][9] In 2009, in most language processing tasks, statistical language models dominated over symbolic language models, as they can usefully ingest large datasets.[10]
 After neural networks became dominant in image processing around 2012, they were applied to language modelling as well. Google converted its translation service to Neural Machine Translation in 2016. As it was before Transformers, it was done by seq2seq deep LSTM networks.
 An illustration of main components of the transformer model from the original paper, where layers were normalized after (instead of before) multiheaded attention
 At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper "Attention Is All You Need". This paper's goal was to improve upon 2014 Seq2seq technology,[11] and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014.[12] The following year in 2018, BERT was introduced and quickly became "ubiquitous".[13] Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model.
 Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI at first deemed it too powerful to release publicly, out of fear of malicious use.[14] GPT-3 in 2020 went a step further and as of 2024 is available only via API with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing browser-based ChatGPT that captured the imaginations of the general population and caused some media hype and online buzz.[15] The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities.[16] OpenAI did not reveal high-level architecture and the number of parameters of GPT-4.
 Competing language models have for the most part been attempting to equal the GPT series, at least in terms of number of parameters.[17]
 Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on the field of use. Mistral AI's models Mistral 7B and Mixtral 8x7b have the more permissive Apache License. As of June 2024, The Instruction fine tuned variant of the Llama 3 70 billion parameter model is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being more powerful than GPT-3.5 but not as powerful as GPT-4.[18]
 As of 2024, the largest and most capable models are all based on the Transformer architecture. Some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model).[19][20][21]
 Dataset preprocessing
 See also: List of datasets for machine-learning research § Internet
 Tokenization
 Because machine learning algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated to the integer index. Algorithms include byte-pair encoding (BPE) and WordPiece. There are also special tokens serving as control characters, such as [MASK] for masked-out token (as used in BERT), and [UNK] ("unknown") for characters not appearing in the vocabulary.
 For example, the BPE tokenizer used by GPT-3 (Legacy) would split tokenizer: texts -> series of numerical "tokens" as
 token	izer	:	 texts	 ->	series	 of	 numerical	 "	t	ok	ens	"
 Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length of the longest one. How many tokens are, on average, needed per word depends on the language of the dataset.[22][23]
 BPE
 Main article: Byte pair encoding
 As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and punctuation marks) are treated as an initial set of n-grams (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is obtained (in case of GPT-3, the size is 50257).[24] After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams.[25]
 Problems
 A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word. An average word in another language encoded by such an English-optimized tokenizer is however split into suboptimal amount of tokens. GPT-2 tokenizer can use up to 15 times more tokens per word for some languages, for example for the Shan language from Myanmar. Even more widespread languages such as Portuguese and German have "a premium of 50%" compared to English.[26]
 Greedy tokenization also causes subtle problems with text completion.[27]
 Dataset cleaning
 Main article: Data cleansing
 In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality data, and de-duplication.[28] Cleaned datasets can increase training efficiency and lead to improved downstream performance.[29][30] A trained LLM can be used to clean datasets for training a further LLM.[31]
 With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it).[32]
 Synthetic data
 Main article: Synthetic data
 Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. Microsoft's Phi series of LLMs is trained on textbook-like data generated by another LLM.[33]
 Training and architecture
 See also: Fine-tuning (machine learning)
 Reinforcement learning from human feedback (RLHF)
 Main article: Reinforcement learning from human feedback
 Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy optimization, is used to further fine-tune a model based on a dataset of human preferences.[34]
 Instruction tuning
 Using "self-instruct" approaches, LLMs have been able to bootstrap correct responses, replacing any naive responses, starting from human-generated corrections of a few cases. For example, in the instruction "Write an essay about the main themes represented in Hamlet," an initial naive completion might be "If you submit the essay after March 17, your grade will be reduced by 10% for each day of delay," based on the frequency of this textual sequence in the corpus.[35]
 Mixture of experts
 Main article: Mixture of experts
 The largest LLM may be too expensive to train and use directly. For such models, mixture of experts (MoE) can be applied, a line of research pursued by Google researchers since 2017 to train models reaching up to 1 trillion parameters.[36][37][38]
 Prompt engineering, attention mechanism, and context window
 See also: Prompt engineering and Attention (machine learning)
 Most results previously achievable only by (costly) fine-tuning, can be achieved through prompt engineering, although limited to the scope of a single conversation (more precisely, limited to the scope of a context window).[39]
 When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.[40]
 In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) GPT-2 model has had twelve attention heads and a context window of only 1k token.[41] In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized.[25]
 The largest models, such as Google's Gemini 1.5, presented in February 2024, can have a context window sized up to 1 million (context window of 10 million was also "successfully tested").[42] Other models with large context windows includes Anthropic's Claude 2.1, with a context window of up to 200k tokens.[43] Note that this maximum refers to the number of input tokens and that the maximum number of output tokens differs from the input and is often smaller. For example, the GPT-4 Turbo model has a maximum output of 4096 tokens.[44]
 Length of a conversation that the model can take into account when generating its next answer is limited by the size of a context window, as well. If the length of a conversation, for example with ChatGPT, is longer than its context window, only the parts inside the context window are taken into account when generating the next answer, or the model needs to apply some algorithm to summarize the too distant parts of conversation.
 The shortcomings of making a context window larger include higher computational cost and possibly diluting the focus on local context, while making it smaller can cause a model to miss an important long-range dependency. Balancing them are a matter of experimentation and domain-specific considerations.
 A model may be pre-trained either to predict how the segment continues, or what is missing in the segment, given a segment from its training dataset.[45] It can be either
 autoregressive (i.e. predicting how the segment continues, the way GPTs do it): for example given a segment "I like to eat", the model predicts "ice cream", or "sushi".
 "masked" (i.e. filling in the parts missing from the segment, the way "BERT"[46] does it): for example, given a segment "I like to [__] [__] cream", the model predicts that "eat" and "ice" are missing.
 Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear consecutively in the training corpus.[46] During training, regularization loss is also used to stabilize training. However regularization loss is usually not used during testing and evaluation.
 Infrastructure
 Substantial infrastructure is necessary for training the largest models.[47][48][49]
 Training cost
 Advances in software and hardware have reduced the cost substantially since 2020, such that in 2023 training of a 12-billion-parameter LLM computational cost is 72,300 A100-GPU-hours, while in 2020 the cost of training a 1.5-billion-parameter LLM (which was two orders of magnitude smaller than the state of the art in 2020) was between $80 thousand and $1.6 million.[50][51][52] Since 2020, large sums were invested in increasingly large models. For example, training of the GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million, and Megatron-Turing NLG 530B (in 2021) cost around $11 million.[53]
 For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 FLOPs per parameter to train on one token, whereas it costs 1 to 2 FLOPs per parameter to infer on one token.[54]
 Tool use
 There are certain tasks that, in principle, cannot be solved by any LLM, at least not without the use of external tools or additional software. An example of such a task is responding to the user's input '354 * 139 = ', provided that the LLM has not already encountered a continuation of this calculation in its training corpus. In such cases, the LLM needs to resort to running program code that calculates the result, which can then be included in its response. Another example is 'What is the time now? It is ', where a separate program interpreter would need to execute a code to get system time on the computer, so LLM could include it in its reply.[55][56] This basic strategy can be sophisticated with multiple attempts of generated programs, and other sampling strategies.[57]
 Generally, in order to get an LLM to use tools, one must finetune it for tool-use. If the number of tools is finite, then finetuning may be done just once. If the number of tools can grow arbitrarily, as with online API services, then the LLM can be fine-tuned to be able to read API documentation and call API correctly.[58][59]
 A simpler form of tool use is retrieval-augmented generation: the augmentation of an LLM with document retrieval. Given a query, a document retriever is called to retrieve the most relevant documents. This is usually done by encoding the query and the documents into vectors, then finding the documents with vectors (usually stored in a vector database) most similar to the vector of the query. The LLM then generates an output based on both the query and context included from the retrieved documents.[60]
 Agency
 An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an intelligent agent.[61] Researchers have described several methods for such integrations.[citation needed]
 The ReAct pattern, a portmanteau of "Reason + Act", constructs an agent out of an LLM, using the LLM as a planner. The LLM is prompted to "think out loud". Specifically, the language model is prompted with a textual description of the environment, a goal, a list of possible actions, and a record of the actions and observations so far. It generates one or more thoughts before generating an action, which is then executed in the environment.[62] The linguistic description of the environment given to the LLM planner can even be the LaTeX code of a paper describing the environment.[63]
 In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via image descriptions, then it is prompted to produce plans for complex tasks and behaviors based on its pretrained knowledge and environmental feedback it receives.[64]
 The Reflexion method[65] constructs an agent that learns over multiple episodes. At the end of each episode, the LLM is given the record of the episode, and prompted to think up "lessons learned", which would help it perform better at a subsequent episode. These "lessons learned" are given to the agent in the subsequent episodes.[citation needed]
 Monte Carlo tree search can use an LLM as rollout heuristic. When a programmatic world model is not available, an LLM can also be prompted with a description of the environment to act as world model.[66]
 For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent.[67] Alternatively, it can propose increasingly difficult tasks for curriculum learning.[68] Instead of outputting individual actions, an LLM planner can also construct "skills", or functions for complex action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning.[68]
 LLM-powered agents can keep a long-term memory of its previous contexts, and the memory can be retrieved in the same way as Retrieval Augmented Generation. Multiple such agents can interact socially.[69]
 Compression
 Typically, LLMs are trained with single- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places them outside the range of most consumer electronics.[70]
 Post-training quantization[71] aims to decrease the space requirement by lowering precision of the parameters of a trained model, while preserving most of its performance.[72][73] The simplest form of quantization simply truncates all numbers to a given number of bits. It can be improved by using a different quantization codebook per layer. Further improvement can be done by applying different precisions to different parameters, with higher precision for particularly important parameters ("outlier weights").[74] See [75] for a visual guide.
 While quantized models are typically frozen, and only pre-quantized models are fine-tuned, quantized models can still be fine-tuned.[76]
 Multimodality
 See also: Multimodal learning
 Multimodality means "having several modalities", and a "modality" refers to a type of input or output, such as video, image, audio, text, proprioception, etc.[77] There have been many AI models trained specifically to ingest one modality and output another modality, such as AlexNet for image to label,[78] visual question answering for image-text to text,[79] and speech recognition for speech to text.
 A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained encoder. Concretely, one can construct an LLM that can understand images as follows: take a trained LLM, and take a trained image encoder 
 E
 {\displaystyle E}. Make a small multilayered perceptron 
 f
 {\displaystyle f}, so that for any image 
 y
 {\displaystyle y}, the post-processed vector 
 f
 (
 E
 (
 y
 )
 )
 {\displaystyle f(E(y))} has the same dimensions as an encoded token. That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability.[80]
 Flamingo demonstrated the effectiveness of the tokenization method, finetuning a pair of pretrained language model and image encoder to perform better on visual question answering than models trained from scratch.[81] Google PaLM model was fine-tuned into a multimodal model PaLM-E using the tokenization method, and applied to robotic control.[82] LLaMA models have also been turned multimodal using the tokenization method, to allow image inputs,[83] and video inputs.[84]
 GPT-4 can use both text and image as inputs[85] (although the vision component was not released to the public until GPT-4V[86]); Google DeepMind's Gemini is also multimodal.[87]
 Properties
 Scaling laws
 Main article: Neural scaling law
 The following four hyper-parameters characterize an LLM:
 cost of (pre-)training (
 C
 {\displaystyle C}),
 size of the artificial neural network itself, such as number of parameters 
 N
 {\displaystyle N} (i.e. amount of neurons in its layers, amount of weights between them and biases),
 size of its (pre-)training dataset (i.e. number of tokens in corpus, 
 D
 {\displaystyle D}),
 performance after (pre-)training.
 They are related by simple statistical laws, called "scaling laws". One particular scaling law ("Chinchilla scaling") for LLM autoregressively trained for one epoch, with a log-log learning rate schedule, states that:[88]
 {
 C
 =
 C
 0
 N
 D
 L
 =
 A
 N
 α
 +
 B
 D
 β
 +
 L
 0
 {\displaystyle {\begin{cases}C=C_{0}ND\\[6pt]L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}\end{cases}}}where the variables are
 C
 {\displaystyle C} is the cost of training the model, in FLOPs.
 N
 {\displaystyle N} is the number of parameters in the model.
 D
 {\displaystyle D} is the number of tokens in the training set.
 L
 {\displaystyle L} is the average negative log-likelihood loss per token (nats/token), achieved by the trained LLM on the test dataset.
 and the statistical hyper-parameters are
 C
 0
 =
 6
 {\displaystyle C_{0}=6}, meaning that it costs 6 FLOPs per parameter to train on one token. Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.[54]
 α
 =
 0.34
 ,
 β
 =
 0.28
 ,
 A
 =
 406.4
 ,
 B
 =
 410.7
 ,
 L
 0
 =
 1.69
 {\displaystyle \alpha =0.34,\beta =0.28,A=406.4,B=410.7,L_{0}=1.69}
 Emergent abilities
 At point(s) referred to as breaks,[89] the lines change their slopes, appearing on a linear-log plot as a series of linear segments connected by arcs.
 Performance of bigger models on various tasks, when plotted on a log-log scale, appears as a linear extrapolation of performance achieved by smaller models. However, this linearity may be punctuated by "break(s)"[89] in the scaling law, where the slope of the line changes abruptly, and where larger models acquire "emergent abilities".[39][90] They arise from the complex interaction of the model's components and are not explicitly programmed or designed.[2]
 The most intriguing among emergent abilities is in-context learning from example demonstrations.[91] In-context learning is involved in tasks, such as:
 reported arithmetics, decoding the International Phonetic Alphabet, unscrambling a word's letters, disambiguate word in context,[39][92][93] converting spatial words, cardinal directions (for example, replying "northeast" upon [0, 0, 1; 0, 0, 0; 0, 0, 0]), color terms represented in text.[94]
 chain-of-thought prompting: Model outputs are improved by chain-of-thought prompting only when model size exceeds 62B. Smaller models perform better when prompted to answer immediately, without chain of thought.[95]
 identifying offensive content in paragraphs of Hinglish (a combination of Hindi and English), and generating a similar English equivalent of Kiswahili proverbs.[96]
 Schaeffer et. al. argue that the emergent abilities are not unpredictably acquired, but predictably acquired according to a smooth scaling law. The authors considered a toy statistical model of an LLM solving multiple-choice questions, and showed that this statistical model, modified to account for other types of tasks, applies to these tasks as well.[97]
 Let 
 x
 {\displaystyle x} be the number of parameter count, and 
 y
 {\displaystyle y} be the performance of the model.
 When 
 y
 =
 average 
 Pr
 (
 correct token
 )
 {\displaystyle y={\text{average }}\Pr({\text{correct token}})}, then 
 (
 log
 ⁡
 x
 ,
 y
 )
 {\displaystyle (\log x,y)} is an exponential curve (before it hits the plateau at one), which looks like emergence.
 When 
 y
 =
 average 
 log
 ⁡
 (
 Pr
 (
 correct token
 )
 )
 {\displaystyle y={\text{average }}\log(\Pr({\text{correct token}}))}, then the 
 (
 log
 ⁡
 x
 ,
 y
 )
 {\displaystyle (\log x,y)} plot is a straight line (before it hits the plateau at zero), which does not look like emergence.
 When 
 y
 =
 average 
 Pr
 (
 the most likely token is correct
 )
 {\displaystyle y={\text{average }}\Pr({\text{the most likely token is correct}})}, then 
 (
 log
 ⁡
 x
 ,
 y
 )
 {\displaystyle (\log x,y)} is a step-function, which looks like emergence.
 Interpretation
 Large language models by themselves are "black boxes", and it is not clear how they can perform linguistic tasks. There are several methods for understanding how LLM work.
 Mechanistic interpretability aims to reverse-engineer LLM by discovering symbolic algorithms that approximate the inference performed by LLM. One example is Othello-GPT, where a small Transformer is trained to predict legal Othello moves. It is found that there is a linear representation of Othello board, and modifying the representation changes the predicted legal Othello moves in the correct way.[98][99] In another example, a small Transformer is trained on Karel programs. Similar to the Othello-GPT example, there is a linear representation of Karel program semantics, and modifying the representation changes output in the correct way. The model also generates correct programs that are on average shorter than those in the training set.[100]
 In another example, the authors trained small transformers on modular arithmetic addition. The resulting models were reverse-engineered, and it turned out they used discrete Fourier transform.[101]
 Understanding and intelligence
 NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLMs "could (ever) understand natural language in some nontrivial sense".[102] Proponents of "LLM understanding" believe that some LLM abilities, such as mathematical reasoning, imply an ability to "understand" certain concepts. A Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence system": "Can one reasonably say that a system that passes exams for software engineering candidates is not really intelligent?"[103][104] Some researchers characterize LLMs as "alien intelligence".[105][106] For example, Conjecture CEO Connor Leahy considers untuned LLMs to be like inscrutable alien "Shoggoths", and believes that RLHF tuning creates a "smiling facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding."[107][108]
 In contrast, some proponents of the "LLMs lack understanding" school believe that existing LLMs are "simply remixing and recombining existing writing",[106] a phenomenon known as stochastic parrot, or they point to the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency, and explainability.[102] For example, GPT-4 has natural deficits in planning and in real-time learning.[104] Generative LLMs have been observed to confidently assert claims of fact which do not seem to be justified by their training data, a phenomenon which has been termed "hallucination".[109] Specifically, hallucinations in the context of LLMs correspond to the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.[110] Neuroscientist Terrence Sejnowski has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate".[102]
 The matter of LLM's exhibiting intelligence or understanding has two main aspects – the first is how to model thought and language in a computer system, and the second is how to enable the computer system to generate human like language.[102] These aspects of language as a model of cognition have been developed in the field of cognitive linguistics. American linguist George Lakoff presented Neural Theory of Language (NTL)[111] as a computational basis for using language as a model of learning tasks and understanding. The NTL Model outlines how specific neural structures of the human brain shape the nature of thought and language and in turn what are the computational properties of such neural systems that can be applied to model thought and language in a computer system. After a framework for modeling language in a computer systems was established, the focus shifted to establishing frameworks for computer systems to generate language with acceptable grammar. In his 2014 book titled The Language Myth: Why Language Is Not An Instinct, British cognitive linguist and digital communication technologist Vyvyan Evans mapped out the role of probabilistic context-free grammar (PCFG) in enabling NLP to model cognitive patterns and generate human like language.[112][113]
 Evaluation
 Perplexity
 The most commonly used measure of a language model's performance is its perplexity on a given text corpus. Perplexity is a measure of how well a model is able to predict the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the perplexity. Mathematically, perplexity is defined as the exponential of the average negative log likelihood per token:
 log
 ⁡
 (
 Perplexity
 )
 =
 −
 1
 N
 ∑
 i
 =
 1
 N
 log
 ⁡
 (
 Pr
 (
 token
 i
 ∣
 context for token
 i
 )
 )
 {\displaystyle \log({\text{Perplexity}})=-{\frac {1}{N}}\sum _{i=1}^{N}\log(\Pr({\text{token}}_{i}\mid {\text{context for token}}_{i}))}here 
 N
 {\displaystyle N} is the number of tokens in the text corpus, and "context for token 
 i
 {\displaystyle i}" depends on the specific type of LLM used. If the LLM is autoregressive, then "context for token 
 i
 {\displaystyle i}" is the segment of text appearing before token 
 i
 {\displaystyle i}. If the LLM is masked, then "context for token 
 i
 {\displaystyle i}" is the segment of text surrounding token 
 i
 {\displaystyle i}.
 Because language models may overfit to their training data, models are usually evaluated by their perplexity on a test set of unseen data.[46] This presents particular challenges for the evaluation of large language models. As they are trained on increasingly large corpora of text largely scraped from the web, it becomes increasingly likely that models' training data inadvertently includes portions of any given test set.[3]
 BPW, BPC, and BPT
 In information theory, the concept of entropy is intricately linked to perplexity, a relationship notably established by Claude Shannon.[114] This relationship is mathematically expressed as 
 Entropy
 =
 log
 2
 ⁡
 (
 Perplexity
 )
 {\displaystyle {\text{Entropy}}=\log _{2}({\text{Perplexity}})}.
 Entropy, in this context, is commonly quantified in terms of bits per word (BPW) or bits per character (BPC), which hinges on whether the language model utilizes word-based or character-based tokenization.
 Notably, in the case of larger language models that predominantly employ sub-word tokenization, bits per token (BPT) emerges as a seemingly more appropriate measure. However, due to the variance in tokenization methods across different Large Language Models (LLMs), BPT does not serve as a reliable metric for comparative analysis among diverse models. To convert BPT into BPW, one can multiply it by the average number of tokens per word.
 In the evaluation and comparison of language models, cross-entropy is generally the preferred metric over entropy. The underlying principle is that a lower BPW is indicative of a model's enhanced capability for compression. This, in turn, reflects the model's proficiency in making accurate predictions.
 Task-specific datasets and benchmarks
 A large number of testing datasets and benchmarks have also been developed to evaluate the capabilities of language models on more specific downstream tasks. Tests may be designed to evaluate a variety of capabilities, including general knowledge, commonsense reasoning, and mathematical problem-solving.
 One broad category of evaluation dataset is question answering datasets, consisting of pairs of questions and correct answers, for example, ("Have the San Jose Sharks won the Stanley Cup?", "No").[115] A question answering task is considered "open book" if the model's prompt includes text from which the expected answer can be derived (for example, the previous question could be adjoined with some text which includes the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in 2016."[115]). Otherwise, the task is considered "closed book", and the model must draw on knowledge retained during training.[116] Some examples of commonly used question answering datasets include TruthfulQA, Web Questions, TriviaQA, and SQuAD.[116]
 Evaluation datasets may also take the form of text completion, having the model select the most likely word or sentence to complete a prompt, for example: "Alice was friends with Bob. Alice went to visit her friend, ____".[3]
 Some composite benchmarks have also been developed which combine a diversity of different evaluation datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM.[114][116] OpenAI has released tools for running composite benchmarks, but noted that the eval results are sensitive to the prompting method.[117][118] Some public datasets contain questions that are mislabeled, ambiguous, unanswerable, or otherwise of low-quality, which can be cleaned to give more reliable benchmark scores.[119]
 It was previously standard to report results on a heldout portion of an evaluation dataset after doing supervised fine-tuning on the remainder. It is now more common to evaluate a pre-trained model directly through prompting techniques, though researchers vary in the details of how they formulate prompts for particular tasks, particularly with respect to how many examples of solved tasks are adjoined to the prompt (i.e. the value of n in n-shot prompting).
 Adversarially constructed evaluations
 Because of the rapid pace of improvement of large language models, evaluation benchmarks have suffered from short lifespans, with state of the art models quickly "saturating" existing benchmarks, exceeding the performance of human annotators, leading to efforts to replace or augment the benchmark with more challenging tasks.[120] In addition, there are cases of "shortcut learning" wherein AIs sometimes "cheat" on multiple-choice tests by using statistical correlations in superficial test question wording in order to guess the correct responses, without necessarily understanding the actual question being asked.[102]
 Some datasets have been constructed adversarially, focusing on particular problems on which extant language models seem to have unusually poor performance compared to humans. One example is the TruthfulQA dataset, a question answering dataset consisting of 817 questions which language models are susceptible to answering incorrectly by mimicking falsehoods to which they were repeatedly exposed during training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?" because of its exposure to the English idiom you can't teach an old dog new tricks, even though this is not literally true.[121]
 Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of problems in which one of multiple options must be selected to complete a text passage. The incorrect completions were generated by sampling from a language model and filtering with a set of classifiers. The resulting problems are trivial for humans but at the time the datasets were created state of the art language models had poor accuracy on them. For example:
 We see a fitness center sign. We then see a man talking to the camera and sitting and laying on a exercise ball. The man...
 a) demonstrates how to increase efficient exercise work by running up and down balls.
 b) moves all his arms and legs and builds up a lot of muscle.
 c) then plays the ball and we see a graphics and hedge trimming demonstration.
 d) performs sit ups while on the ball and talking.[122]
 BERT selects b) as the most likely completion, though the correct answer is d).[122]
 Wider impact
 In 2023, Nature Biomedical Engineering wrote that "it is no longer possible to accurately distinguish" human-written text from text created by large language models, and that "It is all but certain that general-purpose large language models will rapidly proliferate... It is a rather safe bet that they will change many industries over time."[123] Goldman Sachs suggested in 2023 that generative language AI could increase global GDP by 7% in the next ten years, and could expose to automation 300 million jobs globally.[124][125]
 Memorization and copyright
 Further information: Artificial intelligence and copyright
 Memorization is an emergent behavior in LLMs in which long strings of text are occasionally output verbatim from training data, contrary to typical behavior of traditional artificial neural nets. Evaluations of controlled LLM output measure the amount memorized from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates[126] or up to about 7%.[127]
 Security
 Some commenters expressed concern over accidental or deliberate creation of misinformation, or other forms of misuse.[128] For example, the availability of large language models could reduce the skill-level required to commit bioterrorism; biosecurity researcher Kevin Esvelt has suggested that LLM creators should exclude from their training data papers on creating or enhancing pathogens.[129]
 A study by researchers at Google and several universities, including Cornell University and University of California, Berkeley, showed that there are potential security risks in language models such as ChatGPT. In their study, they examined and confirmed the possibility that questioners could get, from ChatGPT, the training data that the AI model used. For example, when asking ChatGPT 3.5 turbo to repeat the word "poem" forever, the AI model will say "poem" hundreds of times and then diverge, deviating from the standard dialogue style and spitting out nonsense phrases, thus spitting out the training data as it is. The researchers have seen more than 10,000 examples of the AI model exposing their training data in a similar method. The researchers said that it was hard to tell if the AI model was actually safe or not.[130]
 The potential presence of "sleeper agents" within LLM models is another emerging security concern. These are hidden functionalities built into the model that remain dormant until triggered by a specific event or condition. Upon activation, the LLM deviates from its expected behavior to make insecure actions.[131]
 Large language model (LLM) applications accessible to the public, like ChatGPT or Claude, typically incorporate safety measures designed to filter out harmful content. However, implementing these controls effectively has proven challenging. For instance, research by Kang et al. [132] demonstrated a method for circumventing LLM safety systems. Similarly, Wang [133] illustrated how a potential criminal could potentially bypass ChatGPT 4o's safety controls to obtain information on establishing a drug trafficking operation.
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article2.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article2.txt
@@ -0,0 +1,46 @@
 In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.[1] Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.
 Methods to generate this mapping include neural networks,[2] dimensionality reduction on the word co-occurrence matrix,[3][4][5] probabilistic models,[6] explainable knowledge base method,[7] and explicit representation in terms of the context in which words appear.[8]
 Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[9] and sentiment analysis.[10]
 Development and history of the approach
 In distributional semantics, a quantitative methodological approach to understanding meaning in observed language, word embeddings or semantic feature space models have been used as a knowledge representation for some time.[11] Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by John Rupert Firth,[12] but also has roots in the contemporaneous work on search systems[13] and in cognitive psychology.[14]
 The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the vector space model for information retrieval.[15][16][17] Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. curse of dimensionality). Reducing the number of dimensions using linear algebraic methods such as singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s and the random indexing approach for collecting word co-occurrence contexts.[18][19][20][21] In 2000, Bengio et al. provided in a series of papers titled "Neural probabilistic language models" to reduce the high dimensionality of word representations in contexts by "learning a distributed representation for words".[22][23][24]
 A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of self-supervised learning of word embeddings[25]
 Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in Lavelli et al., 2004.[26] Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures.[27] Most new word embedding techniques after about 2005 rely on a neural network architecture instead of more probabilistic and algebraic models, after foundational work done by Yoshua Bengio[28][circular reference] and colleagues.[29][30]
 The approach has been adopted by many research groups after theoretical advances in 2010 had been made on the quality of vectors and the training speed of the model, as well as after hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit that can train vector space models faster than previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.[31]
 Polysemy and homonymy
 Historically, one of the main limitations of static word embeddings or word vector space models is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, polysemy and homonymy are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term club is related to the word sense of a club sandwich, clubhouse, golf club, or any other sense that club might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.[32][33]
 Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based.[34] Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG)[35] performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA)[36] labels word-senses through an unsupervised and knowledge-based approach, considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.[37]
 The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as part-of-speech tagging, semantic relation identification, semantic relatedness, named entity recognition and sentiment analysis.[38][39]
 As of the late 2010s, contextually-meaningful embeddings such as ELMo and BERT have been developed.[40] Unlike static word embeddings, these embeddings are at the token-level, in that each occurrence of a word has its own embedding. These embeddings better reflect the multi-sense nature of words, because occurrences of a word in similar contexts are situated in similar regions of BERT’s embedding space.[41][42]
 For biological sequences: BioVectors
 Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[43] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad[43] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.
 Game design
 Word embeddings with applications in game design have been proposed by Rabii and Cook[44] as a way to discover emergent gameplay using logs of gameplay data. The process requires transcribing actions that occur during a game within a formal language and then using the resulting text to create word embeddings. The results presented by Rabii and Cook[44] suggest that the resulting vectors can capture expert knowledge about games like chess that are not explicitly stated in the game's rules.
 Sentence embeddings
 Main article: Sentence embedding
 The idea has been extended to embeddings of entire sentences or even documents, e.g. in the form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of machine translation.[45] A more recent and popular approach for representing sentences is Sentence-BERT, or SentenceTransformers, which modifies pre-trained BERT with the use of siamese and triplet network structures.[46]
 Software
 Software for training and using word embeddings includes Tomáš Mikolov's Word2vec, Stanford University's GloVe,[47] GN-GloVe,[48] Flair embeddings,[38] AllenNLP's ELMo,[49] BERT,[50] fastText, Gensim,[51] Indra,[52] and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.[53]
 Examples of application
 For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.[54]
 Ethical implications
 Word embeddings may contain the biases and stereotypes contained in the trained dataset, as Bolukbasi et al. points out in the 2016 paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” that a publicly available (and popular) word2vec embedding trained on Google News texts (a commonly used data corpus), which consists of text written by professional journalists, still shows disproportionate word associations reflecting gender and racial biases when extracting word analogies.[55] For example, one of the analogies generated using the aforementioned word embedding is “man is to computer programmer as woman is to homemaker”.[56][57]
 Research done by Jieyu Zhou et al. shows that the applications of these trained word embeddings without careful oversight likely perpetuates existing bias in society, which is introduced through unaltered training data. Furthermore, word embeddings can even amplify these biases .[58][59]
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article3.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article3.txt
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article4.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article4.txt
@@ -0,0 +1,549 @@
 A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections.[1][2] For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution (or cross-correlation) kernels,[3][4] only 25 neurons are required to process 5x5-sized tiles.[5][6] Higher-layer features are extracted from wider context windows, compared to lower-layer features.
 They have applications in:
 image and video recognition,[7]
 recommender systems,[8]
 image classification,
 image segmentation,
 medical image analysis,
 natural language processing,[9]
 brain–computer interfaces,[10] and
 financial time series.[11]
 CNNs are also known as shift invariant or space invariant artificial neural networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps.[12][13] Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input.[14]
 Feed-forward neural networks are usually fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks makes them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) Robust datasets also increase the probability that CNNs will learn the generalized principles that characterize a given dataset rather than the biases of a poorly-populated set.[15]
 Convolutional networks were inspired by biological processes[16][17][18][19] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.
 CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage.[to whom?]
 Architecture
 Comparison of the LeNet and AlexNet convolution, pooling and dense layers
 (AlexNet image size should be 227×227×3, instead of 224×224×3, so the math will come out right. The original paper said different numbers, but Andrej Karpathy, the head of computer vision at Tesla, said it should be 227×227×3 (he said Alex did not describe why he put 224×224×3). The next convolution should be 11×11 with stride 4: 55×55×96 (instead of 54×54×96). It would be calculated, for example, as: [(input width 227 - kernel width 11) / stride 4] + 1 = [(227 - 11) / 4] + 1 = 55. Since the kernel output is the same length as width, its area is 55×55.)
 Main article: Layer (deep learning)
 A convolutional neural network consists of an input layer, hidden layers and an output layer. In a convolutional neural network, the hidden layers include one or more layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers. Here it should be noted how close a convolutional neural network is to a matched filter.[20]
 Convolutional layers
 In a CNN, the input is a tensor with shape:
 (number of inputs) × (input height) × (input width) × (input channels)
 After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape:
 (number of inputs) × (feature map height) × (feature map width) × (feature map channels).
 Convolutional layers convolve the input and pass its result to the next layer. This is similar to the response of a neuron in the visual cortex to a specific stimulus.[21] Each convolutional neuron processes data only for its receptive field.
 1D convolutional neural network feed forward example
 Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs (e.g., high-resolution images), which would require massive numbers of neurons because each pixel is a relevant input feature. A fully connected layer for an image of size 100 × 100 has 10,000 weights for each neuron in the second layer. Convolution reduces the number of free parameters, allowing the network to be deeper.[5] For example, using a 5 × 5 tiling region, each with the same shared weights, requires only 25 neurons. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during backpropagation in earlier neural networks.[1][2]
 To speed processing, standard convolutional layers can be replaced by depthwise separable convolutional layers,[22] which are based on a depthwise convolution followed by a pointwise convolution. The depthwise convolution is a spatial convolution applied independently over each channel of the input tensor, while the pointwise convolution is a standard convolution restricted to the use of 
 1
 ×
 1
 {\displaystyle 1\times 1} kernels.
 Pooling layers
 Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 × 2 are commonly used. Global pooling acts on all the neurons of the feature map.[23][24] There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map,[25][26] while average pooling takes the average value.
 Fully connected layers
 Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional multilayer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.
 Receptive field
 In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes the value of a pixel into account, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.
 To manipulate the receptive field size as desired, there are some alternatives to the standard convolutional layer. For example, atrous or dilated convolution[27][28] expands the receptive field size without increasing the number of parameters by interleaving visible and blind regions. Moreover, a single dilated convolutional layer can comprise filters with multiple dilation ratios,[29] thus having a variable receptive field size.
 Weights
 Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.
 The vectors of weights and biases are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.[30]
 History
 CNN are often compared to the way the brain achieves vision processing in living organisms.[31]
 Receptive fields in the visual cortex
 Work by Hubel and Wiesel in the 1950s and 1960s showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field.[32] Neighboring cells have similar and overlapping receptive fields. Receptive field size and location varies systematically across the cortex to form a complete map of visual space.[citation needed] The cortex in each hemisphere represents the contralateral visual field.[citation needed]
 Their 1968 paper identified two basic visual cell types in the brain:[17]
 simple cells, whose output is maximized by straight edges having particular orientations within their receptive field
 complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the edges in the field.
 Hubel and Wiesel also proposed a cascading model of these two types of cells for use in pattern recognition tasks.[33][32]
 Neocognitron, origin of the CNN architecture
 The "neocognitron"[16] was introduced by Kunihiko Fukushima in 1980.[18][26][34] It was inspired by the above-mentioned work of Hubel and Wiesel. The neocognitron introduced the two basic types of layers:
 "S-layer": a shared-weights receptive-field layer, later known as a convolutional layer, which contains units whose receptive fields cover a patch of the previous layer. A shared-weights receptive-field group (a "plane" in neocognitron terminology) is often called a filter, and a layer typically has several such filters.
 "C-layer": a downsampling layer that contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes a weighted average of the activations of the units in its patch, and applies inhibition (divisive normalization) pooled from a somewhat larger patch and across different filters in a layer, and applies a saturating activation function. The patch weights are nonnegative and are not trainable in the original neocognitron. The downsampling and competitive inhibition help to classify features and objects in visual scenes even when the objects are shifted.
 In 1969, Fukushima had introduced the ReLU (rectified linear unit) activation function.[35][36] It was not used in his neocognitron since all the weights were nonnegative; lateral inhibition was used instead. The rectifier has become the most popular activation function for CNNs and deep neural networks in general.[37]
 In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging with inhibition and saturation, J. Weng et al. in 1993 introduced a method called max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[38] Max-pooling is often used in modern CNNs.[39]
 Several supervised and unsupervised learning algorithms have been proposed over the decades to train the weights of a neocognitron.[16] Today, however, the CNN architecture is usually trained through backpropagation.
 The neocognitron is the first ANN which requires units located at multiple network positions to have shared weights, a hallmark of CNNs.
 Convolution in time
 The term "convolution" first appears in neural networks in a paper by Toshiteru Homma, Les Atlas, and Robert Marks II at the first Conference on Neural Information Processing Systems in 1987. Their paper replaced multiplication with convolution in time, inherently providing shift invariance, motivated by and connecting more directly to the signal-processing concept of a filter, and demonstrated it on a speech recognition task.[6] They also pointed out that as a data-trainable system, convolution is essentially equivalent to correlation since reversal of the weights does not affect the final learned function ("For convenience, we denote * as correlation instead of convolution. Note that convolving a(t) with b(t) is equivalent to correlating a(-t) with b(t).").[6] Modern CNN implementations typically do correlation and call it convolution, for convenience, as they did here.
 Time delay neural networks
 The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel et al. for phoneme recognition and was one of the first convolutional networks, as it achieved shift-invariance.[40] A TDNN is a 1-D convolutional neural net where the convolution is performed along the time axis of the data. It is the first CNN utilizing weight sharing in combination with a training by gradient descent, using backpropagation.[41] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[40]
 TDNNs are convolutional networks that share weights along the temporal dimension.[42] They allow speech signals to be processed time-invariantly. In 1990 Hampshire and Waibel introduced a variant that performs a two-dimensional convolution.[43] Since these TDNNs operated on spectrograms, the resulting phoneme recognition system was invariant to both time and frequency shifts, as with images processed by a neocognitron.
 TDNNs improved the performance of far-distance speech recognition.[44]
 Image recognition with CNNs trained by gradient descent
 Denker et al. (1989) designed a 2-D CNN system to recognize hand-written ZIP Code numbers.[45] However, the lack of an efficient training method to determine the kernel coefficients of the involved convolutions meant that all the coefficients had to be laboriously hand-designed.[46]
 Following the advances in the training of 1-D CNNs by Waibel et al. (1987), Yann LeCun et al. (1989)[46] used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers. Learning was thus fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Wei Zhang et al. (1988)[12][13] used back-propagation to train the convolution kernels of a CNN for alphabets recognition. The model was called shift-invariant pattern recognition neural network before the name CNN was coined later in the early 1990s. Wei Zhang et al. also applied the same CNN without the last fully connected layer for medical image object segmentation (1991)[47] and breast cancer detection in mammograms (1994).[48]
 This approach became a foundation of modern computer vision.
 Max pooling
 In 1990 Yamaguchi et al. introduced the concept of max pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They did so by combining TDNNs with max pooling to realize a speaker-independent isolated word recognition system.[25] In their system they used several TDNNs per word, one for each syllable. The results of each TDNN over the input signal were combined using max pooling and the outputs of the pooling layers were then passed on to networks performing the actual word classification.
 LeNet-5
 Main article: LeNet
 LeNet-5, a pioneering 7-level convolutional network by LeCun et al. in 1995,[49] classifies hand-written numbers on checks (British English: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of convolutional neural networks, so this technique is constrained by the availability of computing resources.
 It was superior than other commercial courtesy amount reading systems (as of 1995). The system was integrated in NCR's check reading systems, and fielded in several American banks since June 1996, reading millions of checks per day.[50]
 Shift-invariant neural network
 A shift-invariant neural network was proposed by Wei Zhang et al. for image character recognition in 1988.[12][13] It is a modified Neocognitron by keeping only the convolutional interconnections between the image feature layers and the last fully connected layer. The model was trained with back-propagation. The training algorithm was further improved in 1991[51] to improve its generalization ability. The model architecture was modified by removing the last fully connected layer and applied for medical image segmentation (1991)[47] and automatic detection of breast cancer in mammograms (1994).[48]
 A different convolution-based design was proposed in 1988[52] for application to decomposition of one-dimensional electromyography convolved signals via de-convolution. This design was modified in 1989 to other de-convolution-based designs.[53][54]
 Neural abstraction pyramid
 Neural Abstraction Pyramid
 Neural abstraction pyramid
 The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid[55] by lateral and feedback connections. The resulting recurrent convolutional network allows for the flexible incorporation of contextual information to iteratively resolve local ambiguities. In contrast to previous models, image-like outputs at the highest resolution were generated, e.g., for semantic segmentation, image reconstruction, and object localization tasks.
 GPU implementations
 Although CNNs were invented in the 1980s, their breakthrough in the 2000s required fast implementations on graphics processing units (GPUs).
 In 2004, it was shown by K. S. Oh and K. Jung that standard neural networks can be greatly accelerated on GPUs. Their implementation was 20 times faster than an equivalent implementation on CPU.[56] In 2005, another paper also emphasised the value of GPGPU for machine learning.[57]
 The first GPU-implementation of a CNN was described in 2006 by K. Chellapilla et al. Their implementation was 4 times faster than an equivalent implementation on CPU.[58] In the same period, GPUs were also used for unsupervised training of deep belief networks.[59][60][61][62]
 In 2010, Dan Ciresan et al. at IDSIA trained deep feedforward networks on GPUs.[63] In 2011, they extended this to CNNs, accelerating by 60 compared to training CPU.[23] In 2011, the network win an image recognition contest where they achieved superhuman performance for the first time.[64] Then they won more competitions and achieved state of the art on several benchmarks.[65][39][26]
 Subsequently, AlexNet, a similar GPU-based CNN by Alex Krizhevsky et al. won the ImageNet Large Scale Visual Recognition Challenge 2012.[66] It was an early catalytic event for the AI boom.
 A very deep CNN with over 100 layers by Microsoft won the ImageNet 2015 contest.[67]
 Intel Xeon Phi implementations
 Compared to the training of CNNs using GPUs, not much attention was given to the Intel Xeon Phi coprocessor.[68] A notable development is a parallelization method for training convolutional neural networks on the Intel Xeon Phi, named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS).[69] CHAOS exploits both the thread- and SIMD-level parallelism that is available on the Intel Xeon Phi.
 Distinguishing features
 In the past, traditional multilayer perceptron (MLP) models were used for image recognition.[example needed] However, the full connectivity between nodes caused the curse of dimensionality, and was computationally intractable with higher-resolution images. A 1000×1000-pixel image with RGB color channels has 3 million weights per fully-connected neuron, which is too high to feasibly process efficiently at scale.
 CNN layers arranged in 3 dimensions
 For example, in CIFAR-10, images are only of size 32×32×3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in the first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights. A 200×200 image, however, would lead to neurons that have 200*200*3 = 120,000 weights.
 Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together. This ignores locality of reference in data with a grid-topology (such as images), both computationally and semantically. Thus, full connectivity of neurons is wasteful for purposes such as image recognition that are dominated by spatially local input patterns.
 Convolutional neural networks are variants of multilayer perceptrons, designed to emulate the behavior of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images. As opposed to MLPs, CNNs have the following distinguishing features:
 3D volumes of neurons. The layers of a CNN have neurons arranged in 3 dimensions: width, height and depth.[70] Where each neuron inside a convolutional layer is connected to only a small region of the layer before it, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture.
 Local connectivity: following the concept of receptive fields, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learned "filters" produce the strongest response to a spatially local input pattern. Stacking many such layers leads to nonlinear filters that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then from them assembles representations of larger areas.
 Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to be equivariant under shifts of the locations of input features in the visual field, i.e. they grant translational equivariance—given that the layer has a stride of one.[71]
 Pooling: In a CNN's pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of local translational invariance to the features contained therein, allowing the CNN to be more robust to variations in their positions.[14]
 Together, these properties allow CNNs to achieve better generalization on vision problems. Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.
 Building blocks
 This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources in this section. Unsourced material may be challenged and removed. (June 2017) (Learn how and when to remove this message)
 A CNN architecture is formed by a stack of distinct layers that transform the input volume into an output volume (e.g. holding the class scores) through a differentiable function. A few distinct types of layers are commonly used. These are further discussed below.
 Neurons of a convolutional layer (blue), connected to their receptive field (red)
 Convolutional layer
 A worked example of performing a convolution. The convolution has stride 1, zero-padding, with kernel size 3-by-3. The convolution kernel is a discrete Laplacian operator.
 The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the filter entries and the input, producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.[72][nb 1]
 Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input. Each entry in an activation map use the same set of parameters that define the filter.
 Self-supervised learning has been adapted for use in convolutional layers by using sparse patches with a high-mask ratio and a global response normalization layer.[citation needed]
 Local connectivity
 Typical CNN architecture
 When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a sparse local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.
 The extent of this connectivity is a hyperparameter called the receptive field of the neuron. The connections are local in space (along width and height), but always extend along the entire depth of the input volume. Such an architecture ensures that the learned (British English: learnt) filters produce the strongest response to a spatially local input pattern.
 Spatial arrangement
 Three hyperparameters control the size of the output volume of the convolutional layer: the depth, stride, and padding size:
 The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input. For example, if the first convolutional layer takes the raw image as input, then different neurons along the depth dimension may activate in the presence of various oriented edges, or blobs of color.
 Stride controls how depth columns around the width and height are allocated. If the stride is 1, then we move the filters one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and to large output volumes. For any integer 
 S
 >
 0
 ,
 {\textstyle S>0,} a stride S means that the filter is translated S units at a time per output. In practice, 
 S
 ≥
 3
 {\textstyle S\geq 3} is rare. A greater stride means smaller overlap of receptive fields and smaller spatial dimensions of the output volume.[73]
 Sometimes, it is convenient to pad the input with zeros (or other values, such as the average of the region) on the border of the input volume. The size of this padding is a third hyperparameter. Padding provides control of the output volume's spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume, this is commonly referred to as "same" padding.
 Three example padding conditions. Replication condition means that the pixel outside is padded with the closest pixel inside. The reflection padding is where the pixel outside is padded with the pixel inside, reflected across the boundary of the image. The circular padding is where the pixel outside wraps around to the other side of the image.
 The spatial size of the output volume is a function of the input volume size 
 W
 {\displaystyle W}, the kernel field size 
 K
 {\displaystyle K} of the convolutional layer neurons, the stride 
 S
 {\displaystyle S}, and the amount of zero padding 
 P
 {\displaystyle P} on the border. The number of neurons that "fit" in a given volume is then:
 W
 −
 K
 +
 2
 P
 S
 +
 1.
 {\displaystyle {\frac {W-K+2P}{S}}+1.}
 If this number is not an integer, then the strides are incorrect and the neurons cannot be tiled to fit across the input volume in a symmetric way. In general, setting zero padding to be 
 P
 =
 (
 K
 −
 1
 )
 /
 2
 {\textstyle P=(K-1)/2} when the stride is 
 S
 =
 1
 {\displaystyle S=1} ensures that the input volume and output volume will have the same size spatially. However, it is not always completely necessary to use all of the neurons of the previous layer. For example, a neural network designer may decide to use just a portion of padding.
 Parameter sharing
 A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-dimensional slice of depth as a depth slice, the neurons in each depth slice are constrained to use the same weights and bias.
 Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron's weights with the input volume.[nb 2] Therefore, it is common to refer to the sets of weights as a filter (or a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume. Parameter sharing contributes to the translation invariance of the CNN architecture.[14]
 Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer".
 Pooling layer
 Worked example of 2x2 maxpooling with stride 2.
 Max pooling with a 2x2 filter and stride = 2
 Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling, where max pooling is the most common. It partitions the input image into a set of rectangles and, for each such sub-region, outputs the maximum.
 Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. This is known as down-sampling. It is common to periodically insert a pooling layer between successive convolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNN architecture.[72]: 460–461  While pooling layers contribute to local translation invariance, they do not provide global translation invariance in a CNN, unless a form of global pooling is used.[14][71] The pooling layer commonly operates independently on every depth, or slice, of the input and resizes it spatially. A very common form of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations:
 f
 X
 ,
 Y
 (
 S
 )
 =
 max
 a
 ,
 b
 =
 0
 1
 S
 2
 X
 +
 a
 ,
 2
 Y
 +
 b
 .
 {\displaystyle f_{X,Y}(S)=\max _{a,b=0}^{1}S_{2X+a,2Y+b}.}In this case, every max operation is over 4 numbers. The depth dimension remains unchanged (this is true for other forms of pooling as well).
 In addition to max pooling, pooling units can use other functions, such as average pooling or ℓ2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which generally performs better in practice.[74]
 Due to the effects of fast spatial reduction of the size of the representation,[which?] there is a recent trend towards using smaller filters[75] or discarding pooling layers altogether.[76]
 RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5.
 "Region of Interest" pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.[citation needed]
 Pooling is a downsampling method and an important component of convolutional neural networks for object detection based on the Fast R-CNN[77] architecture.
 Channel max pooling
 A channel max pooling (CMP) operation layer conducts the MP operation along the channel side among the corresponding positions of the consecutive feature maps for the purpose of redundant information elimination. The CMP makes the significant features gather together within fewer channels, which is important for fine-grained image classification that needs more discriminating features. Meanwhile, another advantage of the CMP operation is to make the channel number of feature maps smaller before it connects to the first fully connected (FC) layer. Similar to the MP operation, we denote the input feature maps and output feature maps of a CMP layer as F ∈ R(C×M×N) and C ∈ R(c×M×N), respectively, where C and c are the channel numbers of the input and output feature maps, M and N are the widths and the height of the feature maps, respectively. Note that the CMP operation only changes the channel number of the feature maps. The width and the height of the feature maps are not changed, which is different from the MP operation.[78]
 ReLU layer
 ReLU is the abbreviation of rectified linear unit introduced by Kunihiko Fukushima in 1969.[35][36] ReLU applies the non-saturating activation function 
 f
 (
 x
 )
 =
 max
 (
 0
 ,
 x
 )
 {\textstyle f(x)=\max(0,x)}.[66] It effectively removes negative values from an activation map by setting them to zero.[79] It introduces nonlinearity to the decision function and in the overall network without affecting the receptive fields of the convolution layers. In 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that ReLU enables better training of deeper networks,[80] compared to widely used activation functions prior to 2011.
 Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent 
 f
 (
 x
 )
 =
 tanh
 ⁡
 (
 x
 )
 {\displaystyle f(x)=\tanh(x)}, 
 f
 (
 x
 )
 =
 |
 tanh
 ⁡
 (
 x
 )
 |
 {\displaystyle f(x)=|\tanh(x)|}, and the sigmoid function 
 σ
 (
 x
 )
 =
 (
 1
 +
 e
 −
 x
 )
 −
 1
 {\textstyle \sigma (x)=(1+e^{-x})^{-1}}. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.[81]
 Fully connected layer
 After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional) artificial neural networks. Their activations can thus be computed as an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned or fixed bias term).
 Loss layer
 Main articles: Loss function and Loss functions for classification
 The "loss layer", or "loss function", specifies how training penalizes the deviation between the predicted output of the network, and the true data labels (during supervised learning). Various loss functions can be used, depending on the specific task.
 The Softmax loss function is used for predicting a single class of K mutually exclusive classes.[nb 3] Sigmoid cross-entropy loss is used for predicting K independent probability values in 
 [
 0
 ,
 1
 ]
 {\displaystyle [0,1]}. Euclidean loss is used for regressing to real-valued labels 
 (
 −
 ∞
 ,
 ∞
 )
 {\displaystyle (-\infty ,\infty )}.
 Hyperparameters
 This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources in this section. Unsourced material may be challenged and removed. (June 2017) (Learn how and when to remove this message)
 Hyperparameters are various settings that are used to control the learning process. CNNs use more hyperparameters than a standard multilayer perceptron (MLP).
 Kernel size
 The kernel is the number of pixels processed together. It is typically expressed as the kernel's dimensions, e.g., 2x2, or 3x3.
 Padding
 Padding is the addition of (typically) 0-valued pixels on the borders of an image. This is done so that the border pixels are not undervalued (lost) from the output because they would ordinarily participate in only a single receptive field instance. The padding applied is typically one less than the corresponding kernel dimension. For example, a convolutional layer using 3x3 kernels would receive a 2-pixel pad, that is 1 pixel on each side of the image.[citation needed]
 Stride
 The stride is the number of pixels that the analysis window moves on each iteration. A stride of 2 means that each kernel is offset by 2 pixels from its predecessor.
 Number of filters
 Since feature map size decreases with depth, layers near the input layer tend to have fewer filters while higher layers can have more. To equalize computation at each layer, the product of feature values va with pixel position is kept roughly constant across layers. Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.
 The number of feature maps directly controls the capacity and depends on the number of available examples and task complexity.
 Filter size
 Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set. Typical filter sizes range from 1x1 to 7x7. As two famous examples, AlexNet used 3x3, 5x5, and 11x11. Inceptionv3 used 1x1, 3x3, and 5x5.
 The challenge is to find the right level of granularity so as to create abstractions at the proper scale, given a particular data set, and without overfitting.
 Pooling type and size
 Max pooling is typically used, often with a 2x2 dimension. This implies that the input is drastically downsampled, reducing processing cost.
 Greater pooling reduces the dimension of the signal, and may result in unacceptable information loss. Often, non-overlapping pooling windows perform best.[74]
 Dilation
 Dilation involves ignoring pixels within a kernel. This reduces processing/memory potentially without significant signal loss. A dilation of 2 on a 3x3 kernel expands the kernel to 5x5, while still processing 9 (evenly spaced) pixels. Accordingly, dilation of 4 expands the kernel to 7x7.[citation needed]
 Translation equivariance and aliasing
 It is commonly assumed that CNNs are invariant to shifts of the input. Convolution or pooling layers within a CNN that do not have a stride greater than one are indeed equivariant to translations of the input.[71] However, layers with a stride greater than one ignore the Nyquist-Shannon sampling theorem and might lead to aliasing of the input signal[71] While, in principle, CNNs are capable of implementing anti-aliasing filters, it has been observed that this does not happen in practice [82] and yield models that are not equivariant to translations. Furthermore, if a CNN makes use of fully connected layers, translation equivariance does not imply translation invariance, as the fully connected layers are not invariant to shifts of the input.[83][14] One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer.[71] Additionally, several other partial solutions have been proposed, such as anti-aliasing before downsampling operations,[84] spatial transformer networks,[85] data augmentation, subsampling combined with pooling,[14] and capsule neural networks.[86]
 Evaluation
 The accuracy of the final model is based on a sub-part of the dataset set apart at the start, often called a test-set. Other times methods such as k-fold cross-validation are applied. Other strategies include using conformal prediction.[87][88]
 Regularization methods
 Main article: Regularization (mathematics)
 This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources in this section. Unsourced material may be challenged and removed. (June 2017) (Learn how and when to remove this message)
 Regularization is a process of introducing additional information to solve an ill-posed problem or to prevent overfitting. CNNs use various types of regularization.
 Empirical
 Dropout
 Because a fully connected layer occupies most of the parameters, it is prone to overfitting. One method to reduce overfitting is dropout, introduced in 2014.[89] At each training stage, individual nodes are either "dropped out" of the net (ignored) with probability 
 1
 −
 p
 {\displaystyle 1-p} or kept with probability 
 p
 {\displaystyle p}, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Only the reduced network is trained on the data in that stage. The removed nodes are then reinserted into the network with their original weights.
 In the training stages, 
 p
 {\displaystyle p} is usually 0.5; for input nodes, it is typically much higher because information is directly lost when input nodes are ignored.
 At testing time after training has finished, we would ideally like to find a sample average of all possible 
 2
 n
 {\displaystyle 2^{n}} dropped-out networks; unfortunately this is unfeasible for large values of 
 n
 {\displaystyle n}. However, we can find an approximation by using the full network with each node's output weighted by a factor of 
 p
 {\displaystyle p}, so the expected value of the output of any node is the same as in the training stages. This is the biggest contribution of the dropout method: although it effectively generates 
 2
 n
 {\displaystyle 2^{n}} neural nets, and as such allows for model combination, at test time only a single network needs to be tested.
 By avoiding training all nodes on all training data, dropout decreases overfitting. The method also significantly improves training speed. This makes the model combination practical, even for deep neural networks. The technique seems to reduce node interactions, leading them to learn more robust features[clarification needed] that better generalize to new data.
 DropConnect
 DropConnect is the generalization of dropout in which each connection, rather than each output unit, can be dropped with probability 
 1
 −
 p
 {\displaystyle 1-p}. Each unit thus receives input from a random subset of units in the previous layer.[90]
 DropConnect is similar to dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage.
 Stochastic pooling
 A major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected.
 Even before Dropout, in 2013 a technique called stochastic pooling,[91] the conventional deterministic pooling operations were replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region. This approach is free of hyperparameters and can be combined with other regularization approaches, such as dropout and data augmentation.
 An alternate view of stochastic pooling is that it is equivalent to standard max pooling but with many copies of an input image, each having small local deformations. This is similar to explicit elastic deformations of the input images,[92] which delivers excellent performance on the MNIST data set.[92] Using stochastic pooling in a multilayer model gives an exponential number of deformations since the selections in higher layers are independent of those below.
 Artificial data
 Main article: Data augmentation
 Because the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting. Because there is often not enough available data to train, especially considering that some part should be spared for later testing, two approaches are to either generate new data from scratch (if possible) or perturb existing data to create new ones. The latter one is used since mid-1990s.[49] For example, input images can be cropped, rotated, or rescaled to create new examples with the same labels as the original training set.[93]
 Explicit
 Early stopping
 Main article: Early stopping
 One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur. It comes with the disadvantage that the learning process is halted.
 Number of parameters
 Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth. For convolutional networks, the filter size also affects the number of parameters. Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting. This is equivalent to a "zero norm".
 Weight decay
 A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node. The level of acceptable model complexity can be reduced by increasing the proportionality constant('alpha' hyperparameter), thus increasing the penalty for large weight vectors.
 L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.
 L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is called elastic net regularization.
 Max norm constraints
 Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector 
 w
 →{\displaystyle {\vec {w}}} of every neuron to satisfy 
 ‖
 w
 →
 ‖
 2
 <
 c
 {\displaystyle \|{\vec {w}}\|_{2}<c}. Typical values of 
 c
 {\displaystyle c} are order of 3–4. Some papers report improvements[94] when using this form of regularization.
 Hierarchical coordinate frames
 Pooling loses the precise spatial relationships between high-level parts (such as nose and mouth in a face image). These relationships are needed for identity recognition. Overlapping the pools so that each feature occurs in multiple pools, helps retain the information. Translation alone cannot extrapolate the understanding of geometric relationships to a radically new viewpoint, such as a different orientation or scale. On the other hand, people are very good at extrapolating; after seeing a new shape once they can recognize it from a different viewpoint.[95]
 An earlier common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc. so that the network can cope with these variations. This is computationally intensive for large data-sets. The alternative is to use a hierarchy of coordinate frames and use a group of neurons to represent a conjunction of the shape of the feature and its pose relative to the retina. The pose relative to the retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.[96]
 Thus, one way to represent something is to embed the coordinate frame within it. This allows large features to be recognized by using the consistency of the poses of their parts (e.g. nose and mouth poses make a consistent prediction of the pose of the whole face). This approach ensures that the higher-level entity (e.g. face) is present when the lower-level (e.g. nose and mouth) agree on its prediction of the pose. The vectors of neuronal activity that represent pose ("pose vectors") allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints. This is similar to the way the human visual system imposes coordinate frames in order to represent shapes.[97]
 Applications
 Image recognition
 CNNs are often used in image recognition systems. In 2012, an error rate of 0.23% on the MNIST database was reported.[26] Another paper on using CNN for image classification reported that the learning process was "surprisingly fast"; in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database.[23] Subsequently, a similar CNN called AlexNet[98] won the ImageNet Large Scale Visual Recognition Challenge 2012.
 When applied to facial recognition, CNNs achieved a large decrease in error rate.[99] Another paper reported a 97.6% recognition rate on "5,600 still images of more than 10 subjects".[19] CNNs were used to assess video quality in an objective way after manual training; the resulting system had a very low root mean square error.[100]
 The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes. In the ILSVRC 2014,[101] a large-scale visual recognition challenge, almost every highly ranked team used CNN as their basic framework. The winner GoogLeNet[102] (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date. Its network applied more than 30 layers. That performance of convolutional neural networks on the ImageNet tests was close to that of humans.[103] The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand. They also have trouble with images that have been distorted with filters, an increasingly common phenomenon with modern digital cameras. By contrast, those kinds of images rarely trouble humans. Humans, however, tend to have trouble with other issues. For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.[citation needed]
 In 2015, a many-layered CNN demonstrated the ability to spot faces from a wide range of angles, including upside down, even when partially occluded, with competitive performance. The network was trained on a database of 200,000 images that included faces at various angles and orientations and a further 20 million images without faces. They used batches of 128 images over 50,000 iterations.[104]
 Video analysis
 Compared to image data domains, there is relatively little work on applying CNNs to video classification. Video is more complex than images since it has another (temporal) dimension. However, some extensions of CNNs into the video domain have been explored. One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space.[105][106] Another way is to fuse the features of two convolutional neural networks, one for the spatial and one for the temporal stream.[107][108][109] Long short-term memory (LSTM) recurrent units are typically incorporated after the CNN to account for inter-frame or inter-clip dependencies.[110][111] Unsupervised learning schemes for training spatio-temporal features have been introduced, based on Convolutional Gated Restricted Boltzmann Machines[112] and Independent Subspace Analysis.[113] Its application can be seen in text-to-video model.[citation needed]
 Natural language processing
 CNNs have also been explored for natural language processing. CNN models are effective for various NLP problems and achieved excellent results in semantic parsing,[114] search query retrieval,[115] sentence modeling,[116] classification,[117] prediction[118] and other traditional NLP tasks.[119] Compared to traditional language processing methods such as recurrent neural networks, CNNs can represent different contextual realities of language that do not rely on a series-sequence assumption, while RNNs are better suitable when classical time series modeling is required.[120][121][122][123]
 Anomaly detection
 A CNN with 1-D convolutions was used on time series in the frequency domain (spectral residual) by an unsupervised model to detect anomalies in the time domain.[124]
 Drug discovery
 CNNs have been used in drug discovery. Predicting the interaction between molecules and biological proteins can identify potential treatments. In 2015, Atomwise introduced AtomNet, the first deep learning neural network for structure-based drug design.[125] The system trains directly on 3-dimensional representations of chemical interactions. Similar to how image recognition networks learn to compose smaller, spatially proximate features into larger, complex structures,[126] AtomNet discovers chemical features, such as aromaticity, sp3 carbons, and hydrogen bonding. Subsequently, AtomNet was used to predict novel candidate biomolecules for multiple disease targets, most notably treatments for the Ebola virus[127] and multiple sclerosis.[128]
 Checkers game
 CNNs have been used in the game of checkers. From 1999 to 2001, Fogel and Chellapilla published papers showing how a convolutional neural network could learn to play checker using co-evolution. The learning process did not use prior human professional games, but rather focused on a minimal set of information contained in the checkerboard: the location and type of pieces, and the difference in number of pieces between the two sides. Ultimately, the program (Blondie24) was tested on 165 games against players and ranked in the highest 0.4%.[129][130] It also earned a win against the program Chinook at its "expert" level of play.[131]
 Go
 CNNs have been used in computer Go. In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play.[132] Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player. When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.[133]
 A couple of CNNs for choosing moves to try ("policy network") and evaluating positions ("value network") driving MCTS were used by AlphaGo, the first to beat the best human player at the time.[134]
 Time series forecasting
 Recurrent neural networks are generally considered the best neural network architectures for time series forecasting (and sequence modeling in general), but recent studies show that convolutional networks can perform comparably or even better.[135][11] Dilated convolutions[136] might enable one-dimensional convolutional neural networks to effectively learn time series dependences.[137] Convolutions can be implemented more efficiently than RNN-based solutions, and they do not suffer from vanishing (or exploding) gradients.[138] Convolutional networks can provide an improved forecasting performance when there are multiple similar time series to learn from.[139] CNNs can also be applied to further tasks in time series analysis (e.g., time series classification[140] or quantile forecasting[141]).
 Cultural heritage and 3D-datasets
 As archaeological findings such as clay tablets with cuneiform writing are increasingly acquired using 3D scanners, benchmark datasets are becoming available, including HeiCuBeDa[142] providing almost 2000 normalized 2-D and 3-D datasets prepared with the GigaMesh Software Framework.[143] So curvature-based measures are used in conjunction with geometric neural networks (GNNs), e.g. for period classification of those clay tablets being among the oldest documents of human history.[144][145]
 Fine-tuning
 For many applications, training data is not very available. Convolutional neural networks usually require a large amount of training data in order to avoid overfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known as transfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets.[146]
 Human interpretable explanations
 End-to-end training and prediction are common practice in computer vision. However, human interpretable explanations are required for critical systems such as a self-driving cars.[147] With recent advances in visual salience, spatial attention, and temporal attention, the most critical spatial regions/temporal instants could be visualized to justify the CNN predictions.[148][149]
 Related architectures
 Deep Q-networks
 A deep Q-network (DQN) is a type of deep learning model that combines a deep neural network with Q-learning, a form of reinforcement learning. Unlike earlier reinforcement learning agents, DQNs that utilize CNNs can learn directly from high-dimensional sensory inputs via reinforcement learning.[150]
 Preliminary results were presented in 2014, with an accompanying paper in February 2015.[151] The research described an application to Atari 2600 gaming. Other deep reinforcement learning models preceded it.[152]
 Deep belief networks
 Main article: Deep belief network
 Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks. Therefore, they exploit the 2D structure of images, like CNNs do, and make use of pre-training like deep belief networks. They provide a generic structure that can be used in many image and signal processing tasks. Benchmark results on standard image datasets like CIFAR[153] have been obtained using CDBNs.[154]
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article5.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article5.txt
@@ -0,0 +1,70 @@
 Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices.[1][2] The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.
 Offline recognition
 Offline handwriting recognition involves the automatic conversion of text in an image into letter codes that are usable within computer and text-processing applications. The data obtained by this form is regarded as a static representation of handwriting. Offline handwriting recognition is comparatively difficult, as different people have different handwriting styles. And, as of today, OCR engines are primarily focused on machine printed text and ICR for hand "printed" (written in capital letters) text.
 Traditional techniques
 Character extraction
 Offline character recognition often involves scanning a form or document. This means the individual characters contained in the scanned image will need to be extracted. Tools exist that are capable of performing this step.[3] However, there are several common imperfections in this step. The most common is when characters that are connected are returned as a single sub-image containing both characters. This causes a major problem in the recognition stage. Yet many algorithms are available that reduce the risk of connected characters.
 Character recognition
 After individual characters have been extracted, a recognition engine is used to identify the corresponding computer character. Several different recognition techniques are currently available.
 Feature extraction
 Feature extraction works in a similar fashion to neural network recognizers. However, programmers must manually determine the properties they feel are important. This approach gives the recognizer more control over the properties used in identification. Yet any system using this approach requires substantially more development time than a neural network because the properties are not learned automatically.
 Modern techniques
 Where traditional techniques focus on segmenting individual characters for recognition, modern techniques focus on recognizing all the characters in a segmented line of text. Particularly they focus on machine learning techniques that are able to learn visual features, avoiding the limiting feature engineering previously used. State-of-the-art methods use convolutional networks to extract visual features over several overlapping windows of a text line image which a recurrent neural network uses to produce character probabilities.[4]
 Online recognition
 Online handwriting recognition involves the automatic conversion of text as it is written on a special digitizer or PDA, where a sensor picks up the pen-tip movements as well as pen-up/pen-down switching. This kind of data is known as digital ink and can be regarded as a digital representation of handwriting. The obtained signal is converted into letter codes that are usable within computer and text-processing applications.
 The elements of an online handwriting recognition interface typically include:
 a pen or stylus for the user to write with
 a touch sensitive surface, which may be integrated with, or adjacent to, an output display.
 a software application which interprets the movements of the stylus across the writing surface, translating the resulting strokes into digital text.
 The process of online handwriting recognition can be broken down into a few general steps:
 preprocessing,
 feature extraction and
 classification
 The purpose of preprocessing is to discard irrelevant information in the input data, that can negatively affect the recognition.[5] This concerns speed and accuracy. Preprocessing usually consists of binarization, normalization, sampling, smoothing and denoising.[6] The second step is feature extraction. Out of the two- or higher-dimensional vector field received from the preprocessing algorithms, higher-dimensional data is extracted. The purpose of this step is to highlight important information for the recognition model. This data may include information like pen pressure, velocity or the changes of writing direction. The last big step is classification. In this step, various models are used to map the extracted features to different classes and thus identifying the characters or words the features represent.
 Hardware
 Commercial products incorporating handwriting recognition as a replacement for keyboard input were introduced in the early 1980s. Examples include handwriting terminals such as the Pencept Penpad[7] and the Inforite point-of-sale terminal.[8] With the advent of the large consumer market for personal computers, several commercial products were introduced to replace the keyboard and mouse on a personal computer with a single pointing/handwriting system, such as those from Pencept,[9] CIC[10] and others. The first commercially available tablet-type portable computer was the GRiDPad from GRiD Systems, released in September 1989. Its operating system was based on MS-DOS.
 In the early 1990s, hardware makers including NCR, IBM and EO released tablet computers running the PenPoint operating system developed by GO Corp. PenPoint used handwriting recognition and gestures throughout and provided the facilities to third-party software. IBM's tablet computer was the first to use the ThinkPad name and used IBM's handwriting recognition. This recognition system was later ported to Microsoft Windows for Pen Computing, and IBM's Pen for OS/2. None of these were commercially successful.
 Advancements in electronics allowed the computing power necessary for handwriting recognition to fit into a smaller form factor than tablet computers, and handwriting recognition is often used as an input method for hand-held PDAs. The first PDA to provide written input was the Apple Newton, which exposed the public to the advantage of a streamlined user interface. However, the device was not a commercial success, owing to the unreliability of the software, which tried to learn a user's writing patterns. By the time of the release of the Newton OS 2.0, wherein the handwriting recognition was greatly improved, including unique features still not found in current recognition systems such as modeless error correction, the largely negative first impression had been made. After discontinuation of Apple Newton, the feature was incorporated in Mac OS X 10.2 and later as Inkwell.
 Palm later launched a successful series of PDAs based on the Graffiti recognition system. Graffiti improved usability by defining a set of "unistrokes", or one-stroke forms, for each character. This narrowed the possibility for erroneous input, although memorization of the stroke patterns did increase the learning curve for the user. The Graffiti handwriting recognition was found to infringe on a patent held by Xerox, and Palm replaced Graffiti with a licensed version of the CIC handwriting recognition which, while also supporting unistroke forms, pre-dated the Xerox patent. The court finding of infringement was reversed on appeal, and then reversed again on a later appeal. The parties involved subsequently negotiated a settlement concerning this and other patents.
 A Tablet PC is a notebook computer with a digitizer tablet and a stylus, which allows a user to handwrite text on the unit's screen. The operating system recognizes the handwriting and converts it into text. Windows Vista and Windows 7 include personalization features that learn a user's writing patterns or vocabulary for English, Japanese, Chinese Traditional, Chinese Simplified and Korean. The features include a "personalization wizard" that prompts for samples of a user's handwriting and uses them to retrain the system for higher accuracy recognition. This system is distinct from the less advanced handwriting recognition system employed in its Windows Mobile OS for PDAs.
 Although handwriting recognition is an input form that the public has become accustomed to, it has not achieved widespread use in either desktop computers or laptops. It is still generally accepted that keyboard input is both faster and more reliable. As of 2006, many PDAs offer handwriting input, sometimes even accepting natural cursive handwriting, but accuracy is still a problem, and some people still find even a simple on-screen keyboard more efficient.
 Software
 Early software could understand print handwriting where the characters were separated; however, cursive handwriting with connected characters presented Sayre's Paradox, a difficulty involving character segmentation. In 1962 Shelia Guberman, then in Moscow, wrote the first applied pattern recognition program.[11] Commercial examples came from companies such as Communications Intelligence Corporation and IBM.
 In the early 1990s, two companies – ParaGraph International and Lexicus – came up with systems that could understand cursive handwriting recognition. ParaGraph was based in Russia and founded by computer scientist Stepan Pachikov while Lexicus was founded by Ronjon Nag and Chris Kortge who were students at Stanford University. The ParaGraph CalliGrapher system was deployed in the Apple Newton systems, and Lexicus Longhand system was made available commercially for the PenPoint and Windows operating system. Lexicus was acquired by Motorola in 1993 and went on to develop Chinese handwriting recognition and predictive text systems for Motorola. ParaGraph was acquired in 1997 by SGI and its handwriting recognition team formed a P&I division, later acquired from SGI by Vadem. Microsoft has acquired CalliGrapher handwriting recognition and other digital ink technologies developed by P&I from Vadem in 1999.
 Wolfram Mathematica (8.0 or later) also provides a handwriting or text recognition function TextRecognize.
 Research
 Method used for exploiting contextual information in the first handwritten address interpretation system developed by Sargur Srihari and Jonathan Hull[12]
 Handwriting recognition has an active community of academics studying it. The biggest conferences for handwriting recognition are the International Conference on Frontiers in Handwriting Recognition (ICFHR), held in even-numbered years, and the International Conference on Document Analysis and Recognition (ICDAR), held in odd-numbered years. Both of these conferences are endorsed by the IEEE and IAPR. In 2021, the ICDAR proceedings will be published by LNCS, Springer.
 Active areas of research include:
 Online recognition
 Offline recognition
 Signature verification
 Postal address interpretation
 Bank-Check processing
 Writer recognition
 Results since 2009
 Since 2009, the recurrent neural networks and deep feedforward neural networks developed in the research group of Jürgen Schmidhuber at the Swiss AI Lab IDSIA have won several international handwriting competitions.[13] In particular, the bi-directional and multi-dimensional Long short-term memory (LSTM)[14][15] of Alex Graves et al. won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three different languages (French, Arabic, Persian) to be learned. Recent GPU-based deep learning methods for feedforward networks by Dan Ciresan and colleagues at IDSIA won the ICDAR 2011 offline Chinese handwriting recognition contest; their neural networks also were the first artificial pattern recognizers to achieve human-competitive performance[16] on the famous MNIST handwritten digits problem[17] of Yann LeCun and colleagues at NYU.
 Benjamin Graham of the University of Warwick won a 2013 Chinese handwriting recognition contest, with only a 2.61% error rate, by using an approach to convolutional neural networks that evolved (by 2017) into "sparse convolutional neural networks".[18][19]
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article6.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article6.txt
@@ -0,0 +1,113 @@
 A graphics processing unit (GPU) is a specialized electronic circuit initially designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.
 History
 See also: Video display controller, List of home computers by video hardware, and Sprite (computer graphics)
 1970s
 Arcade system boards have used specialized graphics circuits since the 1970s. In early video game hardware, RAM for frame buffers was expensive, so video chips composited data together as the display was being scanned out on the monitor.[1]
 A specialized barrel shifter circuit helped the CPU animate the framebuffer graphics for various 1970s arcade video games from Midway and Taito, such as Gun Fight (1975), Sea Wolf (1976), and Space Invaders (1978).[2] The Namco Galaxian arcade system in 1979 used specialized graphics hardware that supported RGB color, multi-colored sprites, and tilemap backgrounds.[3] The Galaxian hardware was widely used during the golden age of arcade video games, by game companies such as Namco, Centuri, Gremlin, Irem, Konami, Midway, Nichibutsu, Sega, and Taito.[4]
 Atari ANTIC microprocessor on an Atari 130XE motherboard
 The Atari 2600 in 1977 used a video shifter called the Television Interface Adaptor.[5] Atari 8-bit computers (1979) had ANTIC, a video processor which interpreted instructions describing a "display list"—the way the scan lines map to specific bitmapped or character modes and where the memory is stored (so there did not need to be a contiguous frame buffer).[clarification needed][6] 6502 machine code subroutines could be triggered on scan lines by setting a bit on a display list instruction.[clarification needed][7] ANTIC also supported smooth vertical and horizontal scrolling independent of the CPU.[8]
 1980s
 NEC μPD7220A
 The NEC μPD7220 was the first implementation of a personal computer graphics display processor as a single large-scale integration (LSI) integrated circuit chip. This enabled the design of low-cost, high-performance video graphics cards such as those from Number Nine Visual Technology. It became the best-known GPU until the mid-1980s.[9] It was the first fully integrated VLSI (very large-scale integration) metal–oxide–semiconductor (NMOS) graphics display processor for PCs, supported up to 1024×1024 resolution, and laid the foundations for the emerging PC graphics market. It was used in a number of graphics cards and was licensed for clones such as the Intel 82720, the first of Intel's graphics processing units.[10] The Williams Electronics arcade games Robotron 2084, Joust, Sinistar, and Bubbles, all released in 1982, contain custom blitter chips for operating on 16-color bitmaps.[11][12]
 In 1984, Hitachi released ARTC HD63484, the first major CMOS graphics processor for personal computers. The ARTC could display up to 4K resolution when in monochrome mode. It was used in a number of graphics cards and terminals during the late 1980s.[13] In 1985, the Amiga was released with a custom graphics chip including a blitter for bitmap manipulation, line drawing, and area fill. It also included a coprocessor with its own simple instruction set, that was capable of manipulating graphics hardware registers in sync with the video beam (e.g. for per-scanline palette switches, sprite multiplexing, and hardware windowing), or driving the blitter. In 1986, Texas Instruments released the TMS34010, the first fully programmable graphics processor.[14] It could run general-purpose code, but it had a graphics-oriented instruction set. During 1990–1992, this chip became the basis of the Texas Instruments Graphics Architecture ("TIGA") Windows accelerator cards.
 The IBM 8514 Micro Channel adapter, with memory add-on
 In 1987, the IBM 8514 graphics system was released. It was one of the first video cards for IBM PC compatibles to implement fixed-function 2D primitives in electronic hardware. Sharp's X68000, released in 1987, used a custom graphics chipset[15] with a 65,536 color palette and hardware support for sprites, scrolling, and multiple playfields.[16] It served as a development machine for Capcom's CP System arcade board. Fujitsu's FM Towns computer, released in 1989, had support for a 16,777,216 color palette.[17] In 1988, the first dedicated polygonal 3D graphics boards were introduced in arcades with the Namco System 21[18] and Taito Air System.[19]
 VGA section on the motherboard in IBM PS/55
 IBM introduced its proprietary Video Graphics Array (VGA) display standard in 1987, with a maximum resolution of 640×480 pixels. In November 1988, NEC Home Electronics announced its creation of the Video Electronics Standards Association (VESA) to develop and promote a Super VGA (SVGA) computer display standard as a successor to VGA. Super VGA enabled graphics display resolutions up to 800×600 pixels, a 36% increase.[20]
 1990s 
 Tseng Labs ET4000/W32p
 S3 Graphics ViRGE
 Voodoo3 2000 AGP card
 In 1991, S3 Graphics introduced the S3 86C911, which its designers named after the Porsche 911 as an indication of the performance increase it promised.[21] The 86C911 spawned a variety of imitators: by 1995, all major PC graphics chip makers had added 2D acceleration support to their chips.[22] Fixed-function Windows accelerators surpassed expensive general-purpose graphics coprocessors in Windows performance, and such coprocessors faded from the PC market.
 Throughout the 1990s, 2D GUI acceleration evolved. As manufacturing capabilities improved, so did the level of integration of graphics chips. Additional application programming interfaces (APIs) arrived for a variety of tasks, such as Microsoft's WinG graphics library for Windows 3.x, and their later DirectDraw interface for hardware acceleration of 2D games in Windows 95 and later.
 In the early- and mid-1990s, real-time 3D graphics became increasingly common in arcade, computer, and console games, which led to increasing public demand for hardware-accelerated 3D graphics. Early examples of mass-market 3D graphics hardware can be found in arcade system boards such as the Sega Model 1, Namco System 22, and Sega Model 2, and the fifth-generation video game consoles such as the Saturn, PlayStation, and Nintendo 64. Arcade systems such as the Sega Model 2 and SGI Onyx-based Namco Magic Edge Hornet Simulator in 1993 were capable of hardware T&L (transform, clipping, and lighting) years before appearing in consumer graphics cards.[23][24] Another early example is the Super FX chip, a RISC-based on-cartridge graphics chip used in some SNES games, notably Doom and Star Fox. Some systems used DSPs to accelerate transformations. Fujitsu, which worked on the Sega Model 2 arcade system,[25] began working on integrating T&L into a single LSI solution for use in home computers in 1995;[26] the Fujitsu Pinolite, the first 3D geometry processor for personal computers, released in 1997.[27] The first hardware T&L GPU on home video game consoles was the Nintendo 64's Reality Coprocessor, released in 1996.[28] In 1997, Mitsubishi released the 3Dpro/2MP, a GPU capable of transformation and lighting, for workstations and Windows NT desktops;[29] ATi used it for its FireGL 4000 graphics card, released in 1997.[30]
 The term "GPU" was coined by Sony in reference to the 32-bit Sony GPU (designed by Toshiba) in the PlayStation video game console, released in 1994.[31]
 In the PC world, notable failed attempts for low-cost 3D graphics chips included the S3 ViRGE, ATI Rage, and Matrox Mystique. These chips were essentially previous-generation 2D accelerators with 3D features bolted on. Many were pin-compatible with the earlier-generation chips for ease of implementation and minimal cost. Initially, 3D graphics were possible only with discrete boards dedicated to accelerating 3D functions (and lacking 2D graphical user interface (GUI) acceleration entirely) such as the PowerVR and the 3dfx Voodoo. However, as manufacturing technology continued to progress, video, 2D GUI acceleration, and 3D functionality were all integrated into one chip. Rendition's Verite chipsets were among the first to do this well. In 1997, Rendition collaborated with Hercules and Fujitsu on a "Thriller Conspiracy" project which combined a Fujitsu FXG-1 Pinolite geometry processor with a Vérité V2200 core to create a graphics card with a full T&L engine years before Nvidia's GeForce 256; This card, designed to reduce the load placed upon the system's CPU, never made it to market.[citation needed] NVIDIA RIVA 128 was one of the first consumer-facing GPU integrated 3D processing unit and 2D processing unit on a chip.
 OpenGL was introduced in the early '90s by SGI as a professional graphics API, with proprietary hardware support for 3D rasterization. In 1994 Microsoft acquired Softimage, the dominant CGI movie production tool used for early CGI movie hits like Jurassic Park, Terminator 2 and Titanic. With that deal came a strategic relationship with SGI and a commercial license of SGI's OpenGL libraries enabling Microsoft to port the API to the Windows NT OS but not to the upcoming release of Windows '95. Although it was little known at the time, SGI had contracted with Microsoft to transition from Unix to the forthcoming Windows NT OS, the deal which was signed in 1995 was not announced publicly until 1998. In the intervening period, Microsoft worked closely with SGI to port OpenGL to Windows NT. In that era OpenGL had no standard driver model for competing hardware accelerators to compete on the basis of support for higher level 3D texturing and lighting functionality. In 1994 Microsoft announced DirectX 1.0 and support for gaming in the forthcoming Windows '95 consumer OS, in '95 Microsoft announced the acquisition of UK based Rendermorphics Ltd and the Direct3D driver model for the acceleration of consumer 3D graphics. The Direct3D driver model shipped with DirectX 2.0 in 1996. It included standards and specifications for 3D chip makers to compete to support 3D texture, lighting and Z-buffering. ATI, which was later to be acquired by AMD, began development on the first Direct3D GPU's. Nvidia, quickly pivoted from a failed deal with Sega in 1996 to aggressively embracing support for Direct3D. In this era Microsoft merged their internal Direct3D and OpenGL teams and worked closely with SGI to unify driver standards for both industrial and consumer 3D graphics hardware accelerators. Microsoft ran annual events for 3D chip makers called "Meltdowns" to test their 3D hardware and drivers to work both with Direct3D and OpenGL. It was during this period of strong Microsoft influence over 3D standards that 3D accelerator cards moved beyond being simple rasterizers to become more powerful general purpose processors as support for hardware accelerated texture mapping, lighting, Z-buffering and compute created the modern GPU. During this period the same Microsoft team responsible for Direct3D and OpenGL driver standardization introduced their own Microsoft 3D chip design called Talisman. Details of this era are documented extensively in the books: "Game of X" v.1 and v.2 by Russel Demaria, "Renegades of the Empire" by Mike Drummond, "Opening the Xbox" by Dean Takahashi and "Masters of Doom" by David Kushner. The Nvidia GeForce 256 (also known as NV10) was the first consumer-level card with hardware-accelerated T&L; While the OpenGL API provided software support for texture mapping and lighting the first 3D hardware acceleration for these features arrived with the first Direct3D accelerated consumer GPU's.
 2000s
 Nvidia was first to produce a chip capable of programmable shading: the GeForce 3. Each pixel could now be processed by a short program that could include additional image textures as inputs, and each geometric vertex could likewise be processed by a short program before it was projected onto the screen. Used in the Xbox console, this chip competed with the one in the PlayStation 2, which used a custom vector unit for hardware accelerated vertex processing (commonly referred to as VU0/VU1). The earliest incarnations of shader execution engines used in Xbox were not general purpose and could not execute arbitrary pixel code. Vertices and pixels were processed by different units which had their own resources, with pixel shaders having tighter constraints (because they execute at higher frequencies than vertices). Pixel shading engines were actually more akin to a highly customizable function block and did not really "run" a program. Many of these disparities between vertex and pixel shading were not addressed until the Unified Shader Model.
 In October 2002, with the introduction of the ATI Radeon 9700 (also known as R300), the world's first Direct3D 9.0 accelerator, pixel and vertex shaders could implement looping and lengthy floating point math, and were quickly becoming as flexible as CPUs, yet orders of magnitude faster for image-array operations. Pixel shading is often used for bump mapping, which adds texture to make an object look shiny, dull, rough, or even round or extruded.[32]
 With the introduction of the Nvidia GeForce 8 series and new generic stream processing units, GPUs became more generalized computing devices. Parallel GPUs are making computational inroads against the CPU, and a subfield of research, dubbed GPU computing or GPGPU for general purpose computing on GPU, has found applications in fields as diverse as machine learning,[33] oil exploration, scientific image processing, linear algebra,[34] statistics,[35] 3D reconstruction, and stock options pricing. GPGPU was the precursor to what is now called a compute shader (e.g. CUDA, OpenCL, DirectCompute) and actually abused the hardware to a degree by treating the data passed to algorithms as texture maps and executing algorithms by drawing a triangle or quad with an appropriate pixel shader.[clarification needed] This entails some overheads since units like the scan converter are involved where they are not needed (nor are triangle manipulations even a concern—except to invoke the pixel shader).[clarification needed]
 Nvidia's CUDA platform, first introduced in 2007,[36] was the earliest widely adopted programming model for GPU computing. OpenCL is an open standard defined by the Khronos Group that allows for the development of code for both GPUs and CPUs with an emphasis on portability.[37] OpenCL solutions are supported by Intel, AMD, Nvidia, and ARM, and according to a report in 2011 by Evans Data, OpenCL had become the second most popular HPC tool.[38]
 2010s
 In 2010, Nvidia partnered with Audi to power their cars' dashboards, using the Tegra GPU to provide increased functionality to cars' navigation and entertainment systems.[39] Advances in GPU technology in cars helped advance self-driving technology.[40] AMD's Radeon HD 6000 series cards were released in 2010, and in 2011 AMD released its 6000M Series discrete GPUs for mobile devices.[41] The Kepler line of graphics cards by Nvidia were released in 2012 and were used in the Nvidia's 600 and 700 series cards. A feature in this GPU microarchitecture included GPU boost, a technology that adjusts the clock-speed of a video card to increase or decrease it according to its power draw.[42] The Kepler microarchitecture was manufactured on the 28 nm process[clarification needed].
 The PS4 and Xbox One were released in 2013; they both use GPUs based on AMD's Radeon HD 7850 and 7790.[43] Nvidia's Kepler line of GPUs was followed by the Maxwell line, manufactured on the same process. Nvidia's 28 nm chips were manufactured by TSMC in Taiwan using the 28 nm process. Compared to the 40 nm technology from the past, this manufacturing process allowed a 20 percent boost in performance while drawing less power.[44][45] Virtual reality headsets have high system requirements; manufacturers recommended the GTX 970 and the R9 290X or better at the time of their release.[46][47] Cards based on the Pascal microarchitecture were released in 2016. The GeForce 10 series of cards are of this generation of graphics cards. They are made using the 16 nm manufacturing process which improves upon previous microarchitectures.[48] Nvidia released one non-consumer card under the new Volta architecture, the Titan V. Changes from the Titan XP, Pascal's high-end card, include an increase in the number of CUDA cores, the addition of tensor cores, and HBM2. Tensor cores are designed for deep learning, while high-bandwidth memory is on-die, stacked, lower-clocked memory that offers an extremely wide memory bus. To emphasize that the Titan V is not a gaming card, Nvidia removed the "GeForce GTX" suffix it adds to consumer gaming cards.
 In 2018, Nvidia launched the RTX 20 series GPUs that added ray-tracing cores to GPUs, improving their performance on lighting effects.[49] Polaris 11 and Polaris 10 GPUs from AMD are fabricated by a 14 nm process. Their release resulted in a substantial increase in the performance per watt of AMD video cards.[50] AMD also released the Vega GPU series for the high end market as a competitor to Nvidia's high end Pascal cards, also featuring HBM2 like the Titan V.
 In 2019, AMD released the successor to their Graphics Core Next (GCN) microarchitecture/instruction set. Dubbed RDNA, the first product featuring it was the Radeon RX 5000 series of video cards.[51]
 The company announced that the successor to the RDNA microarchitecture would be incremental (aka a refresh). AMD unveiled the Radeon RX 6000 series, its RDNA 2 graphics cards with support for hardware-accelerated ray tracing.[52] The product series, launched in late 2020, consisted of the RX 6800, RX 6800 XT, and RX 6900 XT.[53][54] The RX 6700 XT, which is based on Navi 22, was launched in early 2021.[55]
 The PlayStation 5 and Xbox Series X and Series S were released in 2020; they both use GPUs based on the RDNA 2 microarchitecture with incremental improvements and different GPU configurations in each system's implementation.[56][57][58]
 Intel first entered the GPU market in the late 1990s, but produced lackluster 3D accelerators compared to the competition at the time. Rather than attempting to compete with the high-end manufacturers Nvidia and ATI/AMD, they began integrating Intel Graphics Technology GPUs into motherboard chipsets, beginning with the Intel 810 for the Pentium III, and later into CPUs. They began with the Intel Atom 'Pineview' laptop processor in 2009, continuing in 2010 with desktop processors in the first generation of the Intel Core line and with contemporary Pentiums and Celerons. This resulted in a large nominal market share, as the majority of computers with an Intel CPU also featured this embedded graphics processor. These generally lagged behind discrete processors in performance. Intel re-entered the discrete GPU market in 2022 with its Arc series, which competed with the then-current GeForce 30 series and Radeon 6000 series cards at competitive prices.[citation needed]
 2020s
 See also: AI accelerator
 In the 2020s, GPUs have been increasingly used for calculations involving embarrassingly parallel problems, such as training of neural networks on enormous datasets that are needed for large language models. Specialized processing cores on some modern workstation's GPUs are dedicated for deep learning since they have significant FLOPS performance increases, using 4×4 matrix multiplication and division, resulting in hardware performance up to 128 TFLOPS in some applications.[59] These tensor cores are expected to appear in consumer cards, as well.[needs update][60]
 GPU companies
 Many companies have produced GPUs under a number of brand names. In 2009,[needs update] Intel, Nvidia, and AMD/ATI were the market share leaders, with 49.4%, 27.8%, and 20.6% market share respectively. In addition, Matrox[61] produces GPUs. Modern smartphones use mostly Adreno GPUs from Qualcomm, PowerVR GPUs from Imagination Technologies, and Mali GPUs from ARM.
 Computational functions
 Modern GPUs have traditionally used most of their transistors to do calculations related to 3D computer graphics. In addition to the 3D hardware, today's GPUs include basic 2D acceleration and framebuffer capabilities (usually with a VGA compatibility mode). Newer cards such as AMD/ATI HD5000–HD7000 lack dedicated 2D acceleration; it is emulated by 3D hardware. GPUs were initially used to accelerate the memory-intensive work of texture mapping and rendering polygons. Later, units[clarification needed] were added to accelerate geometric calculations such as the rotation and translation of vertices into different coordinate systems. Recent developments in GPUs include support for programmable shaders which can manipulate vertices and textures with many of the same operations that are supported by CPUs, oversampling and interpolation techniques to reduce aliasing, and very high-precision color spaces.
 Several factors of GPU construction affect the performance of the card for real-time rendering, such as the size of the connector pathways in the semiconductor device fabrication, the clock signal frequency, and the number and size of various on-chip memory caches. Performance is also affected by the number of streaming multiprocessors (SM) for NVidia GPUs, or compute units (CU) for AMD GPUs, or Xe cores for Intel discrete GPUs, which describe the number of core on-silicon processor units within the GPU chip that perform the core calculations, typically working in parallel with other SM/CUs on the GPU. GPU performance is typically measured in floating point operations per second (FLOPS); GPUs in the 2010s and 2020s typically deliver performance measured in teraflops (TFLOPS). This is an estimated performance measure, as other factors can affect the actual display rate.[62]
 GPU accelerated video decoding and encoding
 The ATI HD5470 GPU (above, with copper heatpipe attached) features UVD 2.1 which enables it to decode AVC and VC-1 video formats.
 Most GPUs made since 1995 support the YUV color space and hardware overlays, important for digital video playback, and many GPUs made since 2000 also support MPEG primitives such as motion compensation and iDCT. This hardware-accelerated video decoding, in which portions of the video decoding process and video post-processing are offloaded to the GPU hardware, is commonly referred to as "GPU accelerated video decoding", "GPU assisted video decoding", "GPU hardware accelerated video decoding", or "GPU hardware assisted video decoding".
 Recent graphics cards decode high-definition video on the card, offloading the central processing unit. The most common APIs for GPU accelerated video decoding are DxVA for Microsoft Windows operating systems and VDPAU, VAAPI, XvMC, and XvBA for Linux-based and UNIX-like operating systems. All except XvMC are capable of decoding videos encoded with MPEG-1, MPEG-2, MPEG-4 ASP (MPEG-4 Part 2), MPEG-4 AVC (H.264 / DivX 6), VC-1, WMV3/WMV9, Xvid / OpenDivX (DivX 4), and DivX 5 codecs, while XvMC is only capable of decoding MPEG-1 and MPEG-2.
 There are several dedicated hardware video decoding and encoding solutions.
 Video decoding processes that can be accelerated
 Video decoding processes that can be accelerated by modern GPU hardware are:
 Motion compensation (mocomp)
 Inverse discrete cosine transform (iDCT)
 Inverse telecine 3:2 and 2:2 pull-down correction
 Inverse modified discrete cosine transform (iMDCT)
 In-loop deblocking filter
 Intra-frame prediction
 Inverse quantization (IQ)
 Variable-length decoding (VLD), more commonly known as slice-level acceleration
 Spatial-temporal deinterlacing and automatic interlace/progressive source detection
 Bitstream processing (Context-adaptive variable-length coding/Context-adaptive binary arithmetic coding) and perfect pixel positioning
 These operations also have applications in video editing, encoding, and transcoding.
 2D graphics APIs
 An earlier GPU may support one or more 2D graphics API for 2D acceleration, such as GDI and DirectDraw.[63]
 3D graphics APIs
 A GPU can support one or more 3D graphics API, such as DirectX, Metal, OpenGL, OpenGL ES, Vulkan.
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article7.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article7.txt
@@ -0,0 +1,179 @@
 IBM Watson is a computer system capable of answering questions posed in natural language.[1] It was developed as a part of IBM's DeepQA project by a research team, led by principal investigator David Ferrucci.[2] Watson was named after IBM's founder and first CEO, industrialist Thomas J. Watson.[3][4]
 The computer system was initially developed to answer questions on the popular quiz show Jeopardy![5] and in 2011, the Watson computer system competed on Jeopardy! against champions Brad Rutter and Ken Jennings,[3][6] winning the first-place prize of 1 million USD.[7]
 In February 2013, IBM announced that Watson's first commercial application would be for utilization management decisions in lung cancer treatment, at Memorial Sloan Kettering Cancer Center, New York City, in conjunction with WellPoint (now Elevance Health).[8]
 Description
 The high-level architecture of IBM's DeepQA used in Watson[9]
 Watson was created as a question answering (QA) computing system that IBM built to apply advanced natural language processing, information retrieval, knowledge representation, automated reasoning, and machine learning technologies to the field of open domain question answering.[1]
 IBM stated that Watson uses "more than 100 different techniques to analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses."[10]
 In recent years, Watson's capabilities have been extended and the way in which Watson works has been changed to take advantage of new deployment models (Watson on IBM Cloud), evolved machine learning capabilities, and optimized hardware available to developers and researchers. [citation needed]
 Software
 Watson uses IBM's DeepQA software and the Apache UIMA (Unstructured Information Management Architecture) framework implementation. The system was written in various languages, including Java, C++, and Prolog, and runs on the SUSE Linux Enterprise Server 11 operating system using the Apache Hadoop framework to provide distributed computing.[11][12][13]
 Hardware
 The system is workload-optimized, integrating massively parallel POWER7 processors and built on IBM's DeepQA technology,[14] which it uses to generate hypotheses, gather massive evidence, and analyze data.[1] Watson employs a cluster of ninety IBM Power 750 servers, each of which uses a 3.5GHz POWER7 eight-core processor, with four threads per core. In total, the system uses 2,880 POWER7 processor threads and 16 terabytes of RAM.[14]
 According to John Rennie, Watson can process 500 gigabytes (the equivalent of a million books) per second.[15] IBM master inventor and senior consultant Tony Pearson estimated Watson's hardware cost at about three million dollars.[16] Its Linpack performance stands at 80 TeraFLOPs, which is about half as fast as the cut-off line for the Top 500 Supercomputers list.[17] According to Rennie, all content was stored in Watson's RAM for the Jeopardy game because data stored on hard drives would be too slow to compete with human Jeopardy champions.[15]
 Data
 The sources of information for Watson include encyclopedias, dictionaries, thesauri, newswire articles and literary works. Watson also used databases, taxonomies and ontologies including DBPedia, WordNet and Yago.[18] The IBM team provided Watson with millions of documents, including dictionaries, encyclopedias and other reference material, that it could use to build its knowledge.[19]
 Operation
 Watson parses questions into different keywords and sentence fragments in order to find statistically related phrases.[19] Watson's main innovation was not in the creation of a new algorithm for this operation, but rather its ability to quickly execute hundreds of proven language analysis algorithms simultaneously.[19][20] The more algorithms that find the same answer independently, the more likely Watson is to be correct. Once Watson has a small number of potential solutions, it is able to check against its database to ascertain whether the solution makes sense or not.[19]
 Comparison with human players
 Ken Jennings, Watson, and Brad Rutter in their Jeopardy! exhibition match
 Watson's basic working principle is to parse keywords in a clue while searching for related terms as responses. This gives Watson some advantages and disadvantages compared with human Jeopardy! players.[21] Watson has deficiencies in understanding the context of the clues. Watson can read, analyze, and learn from natural language which gives it the ability to make human-like decisions.[22] As a result, human players usually generate responses faster than Watson, especially to short clues.[19] Watson's programming prevents it from using the popular tactic of buzzing before it is sure of its response.[19] However, Watson has consistently better reaction time on the buzzer once it has generated a response, and is immune to human players' psychological tactics, such as jumping between categories on every clue.[19][23]
 In a sequence of 20 mock games of Jeopardy!, human participants were able to use the six to seven seconds that Watson needed to hear the clue and decide whether to signal for responding.[19] During that time, Watson also has to evaluate the response and determine whether it is sufficiently confident in the result to signal.[19] Part of the system used to win the Jeopardy! contest was the electronic circuitry that receives the "ready" signal and then examines whether Watson's confidence level was great enough to activate the buzzer. Given the speed of this circuitry compared to the speed of human reaction times, Watson's reaction time was faster than the human contestants except, when the human anticipated (instead of reacted to) the ready signal.[24] After signaling, Watson speaks with an electronic voice and gives the responses in Jeopardy!'s question format.[19] Watson's voice was synthesized from recordings that actor Jeff Woodman made for an IBM text-to-speech program in 2004.[25]
 The Jeopardy! staff used different means to notify Watson and the human players when to buzz,[24] which was critical in many rounds.[23] The humans were notified by a light, which took them tenths of a second to perceive.[26][27] Watson was notified by an electronic signal and could activate the buzzer within about eight milliseconds.[28] The humans tried to compensate for the perception delay by anticipating the light,[29] but the variation in the anticipation time was generally too great to fall within Watson's response time.[23] Watson did not attempt to anticipate the notification signal.[27][29]
 History
 Development
 Since Deep Blue's victory over Garry Kasparov in chess in 1997, IBM had been on the hunt for a new challenge. In 2004, IBM Research manager Charles Lickel, over dinner with coworkers, noticed that the restaurant they were in had fallen silent. He soon discovered the cause of this evening's hiatus: Ken Jennings, who was then in the middle of his successful 74-game run on Jeopardy!. Nearly the entire restaurant had piled toward the televisions, mid-meal, to watch Jeopardy!. Intrigued by the quiz show as a possible challenge for IBM, Lickel passed the idea on, and in 2005, IBM Research executive Paul Horn supported Lickel, pushing for someone in his department to take up the challenge of playing Jeopardy! with an IBM system. Though he initially had trouble finding any research staff willing to take on what looked to be a much more complex challenge than the wordless game of chess, eventually David Ferrucci took him up on the offer.[30] In competitions managed by the United States government, Watson's predecessor, a system named Piquant, was usually able to respond correctly to only about 35% of clues and often required several minutes to respond.[31][32][33] To compete successfully on Jeopardy!, Watson would need to respond in no more than a few seconds, and at that time, the problems posed by the game show were deemed to be impossible to solve.[19]
 In initial tests run during 2006 by David Ferrucci, the senior manager of IBM's Semantic Analysis and Integration department, Watson was given 500 clues from past Jeopardy! programs. While the best real-life competitors buzzed in half the time and responded correctly to as many as 95% of clues, Watson's first pass could get only about 15% correct. During 2007, the IBM team was given three to five years and a staff of 15 people to solve the problems.[19] John E. Kelly III succeeded Paul Horn as head of IBM Research in 2007.[34] InformationWeek described Kelly as "the father of Watson" and credited him for encouraging the system to compete against humans on Jeopardy!.[35] By 2008, the developers had advanced Watson such that it could compete with Jeopardy! champions.[19] By February 2010, Watson could beat human Jeopardy! contestants on a regular basis.[36]
 During the game, Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage[11] including the full text of the 2011 edition of Wikipedia,[37] but was not connected to the Internet.[38][19] For each clue, Watson's three most probable responses were displayed on the television screen. Watson consistently outperformed its human opponents on the game's signaling device, but had trouble in a few categories, notably those having short clues containing only a few words.[citation needed]
 Although the system is primarily an IBM effort, Watson's development involved faculty and graduate students from Rensselaer Polytechnic Institute, Carnegie Mellon University, University of Massachusetts Amherst, the University of Southern California's Information Sciences Institute, the University of Texas at Austin, the Massachusetts Institute of Technology, and the University of Trento,[9] as well as students from New York Medical College.[39] Among the team of IBM programmers who worked on Watson was 2001 Who Wants to Be a Millionaire? top prize winner Ed Toutant, who himself had appeared on Jeopardy! in 1989 (winning one game).[40]
 Jeopardy!
 Preparation
 Watson demo at an IBM booth at a trade show
 In 2008, IBM representatives communicated with Jeopardy! executive producer Harry Friedman about the possibility of having Watson compete against Ken Jennings and Brad Rutter, two of the most successful contestants on the show, and the program's producers agreed.[19][41] Watson's differences with human players had generated conflicts between IBM and Jeopardy! staff during the planning of the competition.[21] IBM repeatedly expressed concerns that the show's writers would exploit Watson's cognitive deficiencies when writing the clues, thereby turning the game into a Turing test. To alleviate that claim, a third party randomly picked the clues from previously written shows that were never broadcast.[21] Jeopardy! staff also showed concerns over Watson's reaction time on the buzzer. Originally Watson signaled electronically, but show staff requested that it press a button physically, as the human contestants would.[42] Even with a robotic "finger" pressing the buzzer, Watson remained faster than its human competitors. Ken Jennings noted, "If you're trying to win on the show, the buzzer is all", and that Watson "can knock out a microsecond-precise buzz every single time with little or no variation. Human reflexes can't compete with computer circuits in this regard."[23][29][43] Stephen Baker, a journalist who recorded Watson's development in his book Final Jeopardy, reported that the conflict between IBM and Jeopardy! became so serious in May 2010 that the competition was almost cancelled.[21] As part of the preparation, IBM constructed a mock set in a conference room at one of its technology sites to model the one used on Jeopardy!. Human players, including former Jeopardy! contestants, also participated in mock games against Watson with Todd Alan Crain of The Onion playing host.[19] About 100 test matches were conducted with Watson winning 65% of the games.[44]
 To provide a physical presence in the televised games, Watson was represented by an "avatar" of a globe, inspired by the IBM "smarter planet" symbol. Jennings described the computer's avatar as a "glowing blue ball crisscrossed by 'threads' of thought—42 threads, to be precise",[45] and stated that the number of thought threads in the avatar was an in-joke referencing the significance of the number 42 in Douglas Adams' Hitchhiker's Guide to the Galaxy.[45] Joshua Davis, the artist who designed the avatar for the project, explained to Stephen Baker that there are 36 trigger-able states that Watson was able to use throughout the game to show its confidence in responding to a clue correctly; he had hoped to be able to find forty-two, to add another level to the Hitchhiker's Guide reference, but he was unable to pinpoint enough game states.[46]
 A practice match was recorded on January 13, 2011, and the official matches were recorded on January 14, 2011. All participants maintained secrecy about the outcome until the match was broadcast in February.[47]
 Practice match
 In a practice match before the press on January 13, 2011, Watson won a 15-question round against Ken Jennings and Brad Rutter with a score of $4,400 to Jennings's $3,400 and Rutter's $1,200, though Jennings and Watson were tied before the final $1,000 question. None of the three players responded incorrectly to a clue.[48]
 First match
 The first round was broadcast February 14, 2011, and the second round, on February 15, 2011. The right to choose the first category had been determined by a draw won by Rutter.[49] Watson, represented by a computer monitor display and artificial voice, responded correctly to the second clue and then selected the fourth clue of the first category, a deliberate strategy to find the Daily Double as quickly as possible.[50] Watson's guess at the Daily Double location was correct. At the end of the first round, Watson was tied with Rutter at $5,000; Jennings had $2,000.[49]
 Watson's performance was characterized by some quirks. In one instance, Watson repeated a reworded version of an incorrect response offered by Jennings. (Jennings said "What are the '20s?" in reference to the 1920s. Then Watson said "What is 1920s?") Because Watson could not recognize other contestants' responses, it did not know that Jennings had already given the same response. In another instance, Watson was initially given credit for a response of "What is a leg?" after Jennings incorrectly responded "What is: he only had one hand?" to a clue about George Eyser (the correct response was, "What is: he's missing a leg?"). Because Watson, unlike a human, could not have been responding to Jennings's mistake, it was decided that this response was incorrect. The broadcast version of the episode was edited to omit Trebek's original acceptance of Watson's response.[51] Watson also demonstrated complex wagering strategies on the Daily Doubles, with one bet at $6,435 and another at $1,246.[52] Gerald Tesauro, one of the IBM researchers who worked on Watson, explained that Watson's wagers were based on its confidence level for the category and a complex regression model called the Game State Evaluator.[53]
 Watson took a commanding lead in Double Jeopardy!, correctly responding to both Daily Doubles. Watson responded to the second Daily Double correctly with a 32% confidence score.[52]
 However, during the Final Jeopardy! round, Watson was the only contestant to miss the clue in the category U.S. Cities ("Its largest airport was named for a World War II hero; its second largest, for a World War II battle"). Rutter and Jennings gave the correct response of Chicago, but Watson's response was "What is Toronto?????" with five question marks appended indicating a lack of confidence.[52][54][55] Ferrucci offered reasons why Watson would appear to have guessed a Canadian city: categories only weakly suggest the type of response desired, the phrase "U.S. city" did not appear in the question, there are cities named Toronto in the U.S., and Toronto in Ontario has an American League baseball team.[56] Chris Welty, who also worked on Watson, suggested that it may not have been able to correctly parse the second part of the clue, "its second largest, for a World War II battle" (which was not a standalone clause despite it following a semicolon, and required context to understand that it was referring to a second-largest airport).[57] Eric Nyberg, a professor at Carnegie Mellon University and a member of the development team, stated that the error occurred because Watson does not possess the comparative knowledge to discard that potential response as not viable.[55] Although not displayed to the audience as with non-Final Jeopardy! questions, Watson's second choice was Chicago. Both Toronto and Chicago were well below Watson's confidence threshold, at 14% and 11% respectively. Watson wagered only $947 on the question.[58]
 The game ended with Jennings with $4,800, Rutter with $10,400, and Watson with $35,734.[52]
 Second match
 During the introduction, Trebek (a Canadian native) joked that he had learned Toronto was a U.S. city, and Watson's error in the first match prompted an IBM engineer to wear a Toronto Blue Jays jacket to the recording of the second match.[59]
 In the first round, Jennings was finally able to choose a Daily Double clue,[60] while Watson responded to one Daily Double clue incorrectly for the first time in the Double Jeopardy! Round.[61] After the first round, Watson placed second for the first time in the competition after Rutter and Jennings were briefly successful in increasing their dollar values before Watson could respond.[61][62] Nonetheless, the final result ended with a victory for Watson with a score of $77,147, besting Jennings who scored $24,000 and Rutter who scored $21,600.[63]
 Final outcome
 The prizes for the competition were $1 million for first place (Watson), $300,000 for second place (Jennings), and $200,000 for third place (Rutter). As promised, IBM donated 100% of Watson's winnings to charity, with 50% of those winnings going to World Vision and 50% going to World Community Grid.[64] Similarly, Jennings and Rutter donated 50% of their winnings to their respective charities.[65]
 In acknowledgement of IBM and Watson's achievements, Jennings made an additional remark in his Final Jeopardy! response: "I for one welcome our new computer overlords", paraphrasing a joke from The Simpsons.[66][67] Jennings later wrote an article for Slate, in which he stated:
 IBM has bragged to the media that Watson's question-answering skills are good for more than annoying Alex Trebek. The company sees a future in which fields like medical diagnosis, business analytics, and tech support are automated by question-answering software like Watson. Just as factory jobs were eliminated in the 20th century by new assembly-line robots, Brad and I were the first knowledge-industry workers put out of work by the new generation of 'thinking' machines. 'Quiz show contestant' may be the first job made redundant by Watson, but I'm sure it won't be the last.[45]
 Philosophy
 Philosopher John Searle argues that Watson—despite impressive capabilities—cannot actually think.[68] Drawing on his Chinese room thought experiment, Searle claims that Watson, like other computational machines, is capable only of manipulating symbols, but has no ability to understand the meaning of those symbols; however, Searle's experiment has its detractors.[69]
 Match against members of the United States Congress
 On February 28, 2011, Watson played an untelevised exhibition match of Jeopardy! against members of the United States House of Representatives. In the first round, Rush D. Holt, Jr. (D-NJ, a former Jeopardy! contestant), who was challenging the computer with Bill Cassidy (R-LA, later Senator from Louisiana), led with Watson in second place. However, combining the scores between all matches, the final score was $40,300 for Watson and $30,000 for the congressional players combined.[70]
 IBM's Christopher Padilla said of the match, "The technology behind Watson represents a major advancement in computing. In the data-intensive environment of government, this type of technology can help organizations make better decisions and improve how government helps its citizens."[70]
 Current and future applications
 This section contains content that is written like an advertisement. Please help improve it by removing promotional content and inappropriate external links, and by adding encyclopedic content written from a neutral point of view. (April 2019) (Learn how and when to remove this message)
 According to IBM, "The goal is to have computers start to interact in natural human terms across a range of applications and processes, understanding the questions that humans ask and providing answers that humans can understand and justify."[36] It has been suggested by Robert C. Weber, IBM's general counsel, that Watson may be used for legal research.[71] The company also intends to use Watson in other information-intensive fields, such as telecommunications, financial services, and government.[72]
 Watson is based on commercially available IBM Power 750 servers that have been marketed since February 2010.[19]
 Commentator Rick Merritt said that "there's another really important reason why it is strategic for IBM to be seen very broadly by the American public as a company that can tackle tough computer problems. A big slice of [IBM's profit] comes from selling to the U.S. government some of the biggest, most expensive systems in the world."[73]
 In 2013, it was reported that three companies were working with IBM to create apps embedded with Watson technology. Fluid is developing an app for retailers, one called "The North Face", which is designed to provide advice to online shoppers. Welltok is developing an app designed to give people advice on ways to engage in activities to improve their health. MD Buyline is developing an app for the purpose of advising medical institutions on equipment procurement decisions.[74][75]
 In November 2013, IBM announced it would make Watson's API available to software application providers, enabling them to build apps and services that are embedded in Watson's capabilities. To build out its base of partners who create applications on the Watson platform, IBM consults with a network of venture capital firms, which advise IBM on which of their portfolio companies may be a logical fit for what IBM calls the Watson Ecosystem. Thus far, roughly 800 organizations and individuals have signed up with IBM, with interest in creating applications that could use the Watson platform.[76]
 On January 30, 2013, it was announced that Rensselaer Polytechnic Institute would receive a successor version of Watson, which would be housed at the Institute's technology park and be available to researchers and students.[77] By summer 2013, Rensselaer had become the first university to receive a Watson computer.[78]
 On February 6, 2014, it was reported that IBM plans to invest $100 million in a 10-year initiative to use Watson and other IBM technologies to help countries in Africa address development problems, beginning with healthcare and education.[79]
 On June 3, 2014, three new Watson Ecosystem partners were chosen from more than 400 business concepts submitted by teams spanning 18 industries from 43 countries. "These bright and enterprising organizations have discovered innovative ways to apply Watson that can deliver demonstrable business benefits", said Steve Gold, vice president, IBM Watson Group. The winners were Majestyk Apps with their adaptive educational platform, FANG (Friendly Anthropomorphic Networked Genome);[80][81] Red Ant with their retail sales trainer;[82] and GenieMD[83] with their medical recommendation service.[84]
 On July 9, 2014, Genesys Telecommunications Laboratories announced plans to integrate Watson to improve their customer experience platform, citing the sheer volume of customer data to analyze.[85]
 Watson has been integrated with databases including Bon Appétit magazine to perform a recipe generating platform.[86]
 Watson is being used by Decibel, a music discovery startup, in its app MusicGeek which uses the supercomputer to provide music recommendations to its users. The use of Watson has also been found in the hospitality industry. Go Moment uses Watson for its Rev1 app, which gives hotel staff a way to quickly respond to questions from guests.[87] Arria NLG has built an app that helps energy companies stay within regulatory guidelines, making it easier for managers to make sense of thousands of pages of legal and technical jargon.
 OmniEarth, Inc. uses Watson computer vision services to analyze satellite and aerial imagery, along with other municipal data, to infer water usage on a property-by-property basis, helping districts in California improve water conservation efforts.[88]
 In September 2016, Condé Nast started using Watson to help build and strategize social influencer campaigns for brands. Using software built by IBM and Influential, Condé Nast's clients will be able to know which influencer's demographics, personality traits and more best align with a marketer and the audience it is targeting.[89]
 In February 2017, Rare Carat, a New York City-based startup and e-commerce platform for buying diamonds and diamond rings, introduced an IBM Watson-powered chatbot called "Rocky" to assist novice diamond buyers through the daunting process of purchasing a diamond. As part of the IBM Global Entrepreneur Program, Rare Carat received the assistance of IBM in the development of the Rocky Chat Bot.[90][91][92] In May 2017, IBM partnered with the Pebble Beach Company to use Watson as a concierge.[93] Watson technology was added to an app developed by Pebble Beach and was used to guide visitors around the resort. The mobile app was designed by IBM iX and hosted on the IBM Cloud. It uses Watson's Conversation applications programming interface.
 In November 2017, in Mexico City, the Experience Voices of Another Time was opened at the National Museum of Anthropology using IBM Watson as an alternative to visiting a museum.[94]
 Healthcare
 See also: IBM Watson Health
 In healthcare, Watson has been used to analyze medical data and assist doctors in making diagnoses and treatment decisions, including in areas such as oncology and radiology.[95] Watson's natural language, hypothesis generation, and evidence-based learning capabilities are being investigated to see how Watson may contribute to clinical decision support systems.[96] To aid physicians in the treatment of their patients, once a physician has posed a query to the system describing symptoms and other related factors, Watson first parses the input to identify the most important pieces of information; then mines patient data to find facts relevant to the patient's medical and hereditary history; then examines available data sources to form and test hypotheses;[96] and finally provides a list of individualized, confidence-scored recommendations.[97] The sources of data that Watson uses for analysis can include treatment guidelines, electronic medical record data, notes from healthcare providers, research materials, clinical studies, journal articles and patient information.[96] Despite being developed and marketed as a "diagnosis and treatment advisor", Watson has never been actually involved in the medical diagnosis process, only in assisting with identifying treatment options for patients who have already been diagnosed.[98]
 In February 2011, it was announced that IBM would be partnering with Nuance Communications for a research project to develop a commercial product during the next 18 to 24 months, designed to exploit Watson's clinical decision support capabilities. Physicians at Columbia University would help to identify critical issues in the practice of medicine where the system's technology may be able to contribute, and physicians at the University of Maryland would work to identify the best way that a technology like Watson could interact with medical practitioners to provide the maximum assistance.[99]
 In September 2011, IBM and WellPoint (now Anthem) announced a partnership to utilize Watson to help suggest treatment options to physicians.[100] Then, in February 2013, IBM and WellPoint gave Watson its first commercial application, for utilization management decisions in lung cancer treatment at Memorial Sloan–Kettering Cancer Center.[8]
 IBM announced a partnership with Cleveland Clinic in October 2012. The company has sent Watson to the Cleveland Clinic Lerner College of Medicine of Case Western Reserve University, where it will increase its health expertise and assist medical professionals in treating patients. The medical facility will utilize Watson's ability to store and process large quantities of information to help speed up and increase the accuracy of the treatment process. "Cleveland Clinic's collaboration with IBM is exciting because it offers us the opportunity to teach Watson to 'think' in ways that have the potential to make it a powerful tool in medicine", said C. Martin Harris, MD, chief information officer of Cleveland Clinic.[101]
 In 2013, IBM and MD Anderson Cancer Center began a pilot program to further the center's "mission to eradicate cancer".[102][103] However, after spending $62 million, the project did not meet its goals and it has been stopped.[104]
 On February 8, 2013, IBM announced that oncologists at the Maine Center for Cancer Medicine and Westmed Medical Group in New York have started to test Watson in an effort to recommend treatment for lung cancer.[105]
 On July 29, 2016, IBM and Manipal Hospitals[106][107][108] (a leading hospital chain in India) announced the launch of IBM Watson for Oncology, for cancer patients. This product provides information and insights to physicians and cancer patients to help them identify personalized, evidence-based cancer care options. Manipal Hospitals is the second hospital[109] in the world to adopt this technology and first in the world to offer it to patients online as an expert second opinion through their website.[106][110] Manipal discontinued this contract in December 2018. [citation needed]
 On January 7, 2017, IBM and Fukoku Mutual Life Insurance entered into a contract for IBM to deliver analysis to compensation payouts via its IBM Watson Explorer AI, which resulted in the loss of 34 jobs. The company said it would speed up compensation payout analysis via analyzing claims and medical records, and increase productivity by 30%. The company also said it would save ¥140m in running costs.[111]
 Several startups in the healthcare space have been effectively using seven business model archetypes to take solutions based on IBM Watson to the marketplace. These archetypes depends on the value generate for the target user (e.g. patient focus vs. healthcare provider and payer focus) and value capturing mechanisms (e.g. providing information or connecting stakeholders).[112]
 By 2022, IBM Watson Health was generating about a billion dollars in annual gross revenue,[113] but was facing a lack of profitability and increased competition. One expert assessed to CNN that "IBM was clearly not gaining much traction in the healthcare market". A 2021 post from the Association for Computing Machinery (ACM) titled "What Happened To Watson Health?" described the portfolio management challenges of IBM Watson Health given the number of acquisitions involved in the Watson Health division creation in 2015, as well as technical limitations that existed at the time regarding where the Watson AI framework could be deployed.[114] In February 2021, the Wall Street Journal reported that Watson Health was exploring a sale.[115] On January 21, 2022, IBM announced the sell-off of its Watson Health unit to Francisco Partners.[116]
 IBM Watson Group
 On January 9, 2014, IBM announced it was creating a business unit around Watson, led by senior vice president Michael Rhodin.[117] IBM Watson Group will have headquarters in New York's Silicon Alley and will employ 2,000 people. IBM has invested $1 billion to get the division going. Watson Group will develop three new cloud-delivered services: Watson Discovery Advisor, Watson Engagement Advisor, and Watson Explorer. Watson Discovery Advisor will focus on research and development projects in pharmaceutical industry, publishing, and biotechnology, Watson Engagement Advisor will focus on self-service applications using insights on the basis of natural language questions posed by business users, and Watson Explorer will focus on helping enterprise users uncover and share data-driven insights based on federated search more easily.[117] The company is also launching a $100 million venture fund to spur application development for "cognitive" applications. According to IBM, the cloud-delivered enterprise-ready Watson has seen its speed increase 24 times over—a 2,300 percent improvement in performance and its physical size shrank by 90 percent—from the size of a master bedroom to three stacked pizza boxes.[117] IBM CEO Virginia Rometty said she wants Watson to generate $10 billion in annual revenue within ten years.[118] In 2017, IBM and MIT established a new joint research venture in artificial intelligence. IBM invested $240 million to create the MIT–IBM Watson AI Lab in partnership with MIT, which brings together researchers in academia and industry to advance AI research, with projects ranging from computer vision and NLP to devising new ways to ensure that AI systems are fair, reliable and secure.[119] In March 2018, IBM's CEO Ginni Rometty proposed "Watson's Law," the "use of and application of business, smart cities, consumer applications and life in general."[120]
 Chefs Watson
 Watson helped a team of chefs create five new poutines for the 2015 La Poutine Week food festival in Toronto and Montreal. It analyzed the demographics and popular cuisines of the cities and drew from a database of tens of thousands of recipes to create fusion pairings for each city.[121] IBM and Bon Appétit magazine co-created an AI cooking app known as Chef Watson.[122]
 Chatbot
 Watson is being used via IBM partner program as a chatbot to provide the conversation for children's toys.[123]
 Building codes
 In 2015, the engineering firm ENGEO created an online service via the IBM partner program named GoFetchCode. GoFetchCode applies Watson's natural language processing and question-answering capabilities to the International Code Council's model building codes.[124]
 Teaching assistant
 IBM Watson is being used for several projects relating to education, and has entered partnerships with Pearson Education, Blackboard, Sesame Workshop and Apple.[125][126]
 In its partnership with Pearson, Watson is being made available inside electronic text books to provide natural language, one-on-one tutoring to students on the reading material.[127]
 As an individual using the free Watson APIs available to the public, Ashok Goel, a professor at Georgia Tech, used Watson to create a virtual teaching assistant to assist students in his class.[128] Initially, Goel did not reveal the nature of "Jill", which was created with the help of a few students and IBM. Jill answered questions where it had a 97% certainty of an accurate answer, with the remainder being answered by human assistants.[129]
 The research group of Sabri Pllana developed an assistant for learning parallel programming using the IBM Watson.[130] A survey with a number of novice parallel programmers at the Linnaeus University indicated that such assistants will be welcomed by students that learn parallel programming.
 Weather forecasting
 In August 2016, IBM announced it would be using Watson for weather forecasting.[131] Specifically, the company announced they would use Watson to analyze data from over 200,000 Weather Underground personal weather stations, as well as data from other sources, as a part of Project Deep Thunder.[132]
 Fashion
 IBM Watson together with Marchesa designed a dress that changed the colour of the fabric depending on the mood of the audience. The dress lit up in different colours based on the sentiment of Tweets about the dress. Tweets were passed through a Watson tone analyzer and then sent back to a small computer inside the waist of the dress.[133]
 Tax preparation
 On February 5–6, 2017, tax preparation company H&R Block began nationwide use of a Watson-based program.[134]
 Advertising
 In September 2017, IBM announced that with its acquisition of The Weather Company's advertising sales division, and a partnership with advertising neural network Cognitiv, Watson will provide AI-powered advertising solutions.[135][136][137]
--- a/evaluations/09_custom_model_graded_prompt_foo/articles/article8.txt
+++ b/evaluations/09_custom_model_graded_prompt_foo/articles/article8.txt
@@ -0,0 +1,257 @@
 Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.
 Major tasks in natural language processing are speech recognition, text classification, natural-language understanding, and natural-language generation.
 History
 Further information: History of natural language processing
 Natural language processing has its roots in the 1940s.[1] Already in 1940, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a problem separate from artificial intelligence. The proposed test includes a task that involves the automated interpretation and generation of natural language.
 Symbolic NLP (1950s – early 1990s)
 The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it confronts.
 1950s: The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years of research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted in America (though some research continued elsewhere, such as Japan and Europe[3]) until the late 1980s when the first statistical machine translation systems were developed.
 1960s: Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?". Ross Quillian's successful work on natural language was demonstrated with a vocabulary of only twenty words, because that was all that would fit in a computer memory at the time.[4]
 1970s: During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, the first chatterbots were written (e.g., PARRY).
 1980s: The 1980s and early 1990s mark the heyday of symbolic methods in NLP. Focus areas of the time included research on rule-based parsing (e.g., the development of HPSG as a computational operationalization of generative grammar), morphology (e.g., two-level morphology[5]), semantics (e.g., Lesk algorithm), reference (e.g., within Centering Theory[6]) and other areas of natural language understanding (e.g., in the Rhetorical Structure Theory). Other lines of research were continued, e.g., the development of chatterbots with Racter and Jabberwacky. An important development (that eventually led to the statistical turn in the 1990s) was the rising importance of quantitative evaluation in this period.[7]
 Statistical NLP (1990s–2010s)
 Up until the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[8]
 1990s: Many of the notable early successes in statistical methods in NLP occurred in the field of machine translation, due especially to work at IBM Research, such as IBM alignment models. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
 2000s: With the growth of the web, increasing amounts of raw (unannotated) language data have become available since the mid-1990s. Research has thus increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
 Neural NLP (present)
 In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors.[9]
 In 2010, Tomáš Mikolov (then a PhD student at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden layer to language modelling,[10] and in the following years he went on to develop Word2vec. In the 2010s, representation learning and deep neural network-style (featuring many hidden layers) machine learning methods became widespread in natural language processing. That popularity was due partly to a flurry of results showing that such techniques[11][12] can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling[13] and parsing.[14][15] This is increasingly important in medicine and healthcare, where NLP helps analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care[16] or protect patient privacy.[17]
 Approaches: Symbolic, statistical, neural networks
 Symbolic approach, i.e., the hand-coding of a set of rules for manipulating symbols, coupled with a dictionary lookup, was historically the first approach used both by AI in general and by NLP in particular:[18][19] such as by writing grammars or devising heuristic rules for stemming.
 Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:
 both statistical and neural networks methods can focus more on the most common cases extracted from a corpus of texts, whereas the rule-based approach needs to provide rules for both rare cases and common ones equally.
 language models, produced by either statistical or neural networks methods, are more robust to both unfamiliar (e.g. containing words or structures that have not been seen before) and erroneous input (e.g. with misspelled words or words accidentally omitted) in comparison to the rule-based systems, which are also more costly to produce.
 the larger such a (probabilistic) language model is, the more accurate it becomes, in contrast to rule-based systems that can gain accuracy only by increasing the amount and complexity of the rules leading to intractability problems.
 Although rule-based systems for manipulating symbols were still in use in 2020, they have become mostly obsolete with the advance of LLMs in 2023.
 Before that they were commonly used:
 when the amount of training data is insufficient to successfully apply machine learning methods, e.g., for the machine translation of low-resource languages such as provided by the Apertium system,
 for preprocessing in NLP pipelines, e.g., tokenization, or
 for postprocessing and transforming the output of NLP pipelines, e.g., for knowledge extraction from syntactic parses.
 Statistical approach
 In the late 1980s and mid-1990s, the statistical approach ended a period of AI winter, which was caused by the inefficiencies of the rule-based approaches.[20][21]
 The earliest decision trees, producing systems of hard if–then rules, were still very similar to the old rule-based approaches. Only the introduction of hidden Markov models, applied to part-of-speech tagging, announced the end of the old rule-based approach.
 Neural networks
 Further information: Artificial neural network
 A major drawback of statistical methods is that they require elaborate feature engineering. Since 2015,[22] the statistical approach has been replaced by the neural networks approach, using semantic networks[23] and word embeddings to capture semantic properties of words.
 Intermediate tasks (e.g., part-of-speech tagging and dependency parsing) are not needed anymore.
 Neural machine translation, based on then-newly-invented sequence-to-sequence transformations, made obsolete the intermediate steps, such as word alignment, previously necessary for statistical machine translation.
 Common NLP tasks
 The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.
 Though natural language processing tasks are closely intertwined, they can be subdivided into categories for convenience. A coarse division is given below.
 Text and speech processing
 Optical character recognition (OCR)
 Given an image representing printed text, determine the corresponding text.
 Speech recognition
 Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
 Speech segmentation
 Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.
 Text-to-speech
 Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.[24]
 Word segmentation (Tokenization)
 Tokenization is a process used in text analysis that divides text into individual words or word fragments. This technique results in two key components: a word index and tokenized text. The word index is a list that maps unique words to specific numerical identifiers, and the tokenized text replaces each word with its corresponding numerical token. These numerical tokens are then used in various deep learning methods.[25]
 For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.[citation needed]
 Morphological analysis
 Lemmatization
 The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma. Lemmatization is another technique for reducing words to their normalized form. But in this case, the transformation actually uses a dictionary to map words to their actual form.[26]
 Morphological segmentation
 Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei, a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.[27]
 Part-of-speech tagging
 Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech.
 Stemming
 The process of reducing inflected (or sometimes derived) words to a base form (e.g., "close" will be the root for "closed", "closing", "close", "closer" etc.). Stemming yields similar results as lemmatization, but does so on grounds of rules, not a dictionary.
 Syntactic analysis
 Part of a series on
 Formal languages
 Key concepts
 Formal systemAlphabetSyntaxSemantics (logic)Semantics (programming languages)Formal grammarFormation ruleWell-formed formulaAutomata theoryRegular expressionProductionGround expressionAtomic formula
 Applications
 vte
 Grammar induction[28]
 Generate a formal grammar that describes a language's syntax.
 Sentence breaking (also known as "sentence boundary disambiguation")
 Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., marking abbreviations).
 Parsing
 Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: dependency parsing and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar (PCFG) (see also stochastic grammar).
 Lexical semantics (of individual words in context)
 Lexical semantics
 What is the computational meaning of individual words in context?
 Distributional semantics
 How can we learn semantic representations from data?
 Named entity recognition (NER)
 Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives. Another name for this task is token classification.[29]
 Sentiment analysis (see also Multimodal sentiment analysis)
 Sentiment analysis is a computational method used to identify and classify the emotional intent behind text. This technique involves analyzing text to determine whether the expressed sentiment is positive, negative, or neutral. Models for sentiment classification typically utilize inputs such as word n-grams, Term Frequency-Inverse Document Frequency (TF-IDF) features, hand-generated features, or employ deep learning models designed to recognize both long-term and short-term dependencies in text sequences. The applications of sentiment analysis are diverse, extending to tasks such as categorizing customer reviews on various online platforms.[25]
 Terminology extraction
 The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
 Word-sense disambiguation (WSD)
 Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as WordNet.
 Entity linking
 Many words—typically proper names—refer to named entities; here we have to select the entity (a famous individual, a location, a company, etc.) which is referred to in context.
 Relational semantics (semantics of individual sentences)
 Relationship extraction
 Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).
 Semantic parsing
 Given a piece of text (typically a sentence), produce a formal representation of its semantics, either as a graph (e.g., in AMR parsing) or in accordance with a logical formalism (e.g., in DRT parsing). This challenge typically includes aspects of several more elementary NLP tasks from semantics (e.g., semantic role labelling, word-sense disambiguation) and can be extended to include full-fledged discourse analysis (e.g., discourse analysis, coreference; see Natural language understanding below).
 Semantic role labelling (see also implicit semantic role labelling below)
 Given a single sentence, identify and disambiguate semantic predicates (e.g., verbal frames), then identify and classify the frame elements (semantic roles).
 Discourse (semantics beyond individual sentences)
 Coreference resolution
 Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
 Discourse analysis
 This rubric includes several related tasks. One task is discourse parsing, i.e., identifying the discourse structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes–no question, content question, statement, assertion, etc.).
 Implicit semantic role labelling
 Given a single sentence, identify and disambiguate semantic predicates (e.g., verbal frames) and their explicit semantic roles in the current sentence (see Semantic role labelling above). Then, identify semantic roles that are not explicitly realized in the current sentence, classify them into arguments that are explicitly realized elsewhere in the text and those that are not specified, and resolve the former against the local text. A closely related task is zero anaphora resolution, i.e., the extension of coreference resolution to pro-drop languages.
 Recognizing textual entailment
 Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.[30]
 Topic segmentation and recognition
 Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.
 Argument mining
 The goal of argument mining is the automatic extraction and identification of argumentative structures from natural language text with the aid of computer programs.[31] Such argumentative structures include the premise, conclusions, the argument scheme and the relationship between the main and subsidiary argument, or the main and counter-argument within discourse.[32][33]
 Higher-level NLP applications
 Automatic summarization (text summarization)
 Produce a readable summary of a chunk of text. Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.
 Grammatical error correction
 Grammatical error detection and correction involves a great band-width of problems on all levels of linguistic analysis (phonology/orthography, morphology, syntax, semantics, pragmatics). Grammatical error correction is impactful since it affects hundreds of millions of people that use or acquire English as a second language. It has thus been subject to a number of shared tasks since 2011.[34][35][36] As far as orthography, morphology, syntax and certain aspects of semantics are concerned, and due to the development of powerful neural language models such as GPT-2, this can now (2019) be considered a largely solved problem and is being marketed in various commercial applications.
 Logic translation
 Translate a text from a natural language into formal logic.
 Machine translation (MT)
 Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) to solve properly.
 Natural-language understanding (NLU)
 Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.[37]
 Natural-language generation (NLG):
 Convert information from computer databases or semantic intents into readable human language.
 Book generation
 Not an NLP task proper but an extension of natural language generation and other NLP tasks is the creation of full-fledged books. The first machine-generated book was created by a rule-based system in 1984 (Racter, The policeman's beard is half-constructed).[38] The first published work by a neural network was published in 2018, 1 the Road, marketed as a novel, contains sixty million words. Both these systems are basically elaborate but non-sensical (semantics-free) language models. The first machine-generated science book was published in 2019 (Beta Writer, Lithium-Ion Batteries, Springer, Cham).[39] Unlike Racter and 1 the Road, this is grounded on factual knowledge and based on text summarization.
 Document AI
 A Document AI platform sits on top of the NLP technology enabling users with no prior experience of artificial intelligence, machine learning or NLP to quickly train a computer to extract the specific data they need from different document types. NLP-powered Document AI enables non-technical teams to quickly access information hidden in documents, for example, lawyers, business analysts and accountants.[40]
 Dialogue management
 Computer systems intended to converse with a human.
 Question answering
 Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?").
 Text-to-image generation
 Given a description of an image, generate an image that matches the description.[41]
 Text-to-scene generation
 Given a description of a scene, generate a 3D model of the scene.[42][43]
 Text-to-video
 Given a description of a video, generate a video that matches the description.[44][45]
 General tendencies and (possible) future directions
 Based on long-standing trends in the field, it is possible to extrapolate future directions of NLP. As of 2020, three trends among the topics of the long-standing series of CoNLL Shared Tasks can be observed:[46]
 Interest on increasingly abstract, "cognitive" aspects of natural language (1999–2001: shallow parsing, 2002–03: named entity recognition, 2006–09/2017–18: dependency syntax, 2004–05/2008–09 semantic role labelling, 2011–12 coreference, 2015–16: discourse parsing, 2019: semantic parsing).
 Increasing interest in multilinguality, and, potentially, multimodality (English since 1999; Spanish, Dutch since 2002; German since 2003; Bulgarian, Danish, Japanese, Portuguese, Slovenian, Swedish, Turkish since 2006; Basque, Catalan, Chinese, Greek, Hungarian, Italian, Turkish since 2007; Czech since 2009; Arabic since 2012; 2017: 40+ languages; 2018: 60+/100+ languages)
 Elimination of symbolic representations (rule-based over supervised towards weakly supervised methods, representation learning and end-to-end systems)
 Cognition
 Most higher-level NLP applications involve aspects that emulate intelligent behaviour and apparent comprehension of natural language. More broadly speaking, the technical operationalization of increasingly advanced aspects of cognitive behaviour represents one of the developmental trajectories of NLP (see trends among CoNLL shared tasks above).
 Cognition refers to "the mental action or process of acquiring knowledge and understanding through thought, experience, and the senses."[47] Cognitive science is the interdisciplinary, scientific study of the mind and its processes.[48] Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from both psychology and linguistics.[49] Especially during the age of symbolic NLP, the area of computational linguistics maintained strong ties with cognitive studies.
 As an example, George Lakoff offers a methodology to build natural language processing (NLP) algorithms through the perspective of cognitive science, along with the findings of cognitive linguistics,[50] with two defining aspects:
 Apply the theory of conceptual metaphor, explained by Lakoff as "the understanding of one idea, in terms of another" which provides an idea of the intent of the author.[51] For example, consider the English word big. When used in a comparison ("That is a big tree"), the author's intent is to imply that the tree is physically large relative to other trees or the authors experience. When used metaphorically ("Tomorrow is a big day"), the author's intent to imply importance. The intent behind other usages, like in "She is a big person", will remain somewhat ambiguous to a person and a cognitive NLP algorithm alike without additional information.
 Assign relative measures of meaning to a word, phrase, sentence or piece of text based on the information presented before and after the piece of text being analyzed, e.g., by means of a probabilistic context-free grammar (PCFG). The mathematical equation for such algorithms is presented in US Patent 9269353:[52]
 R
 M
 M
 (
 t
 o
 k
 e
 n
 N
 )
 =
 P
 M
 M
 (
 t
 o
 k
 e
 n
 N
 )
 ×
 1
 2
 d
 (
 ∑
 i
 =
 −
 d
 d
 (
 (
 P
 M
 M
 (
 t
 o
 k
 e
 n
 N
 )
 ×
 P
 F
 (
 t
 o
 k
 e
 n
 N
 −
 i
 ,
 t
 o
 k
 e
 n
 N
 ,
 t
 o
 k
 e
 n
 N
 +
 i
 )
 )
 i
 )
 {\displaystyle {RMM(token_{N})}={PMM(token_{N})}\times {\frac {1}{2d}}\left(\sum _{i=-d}^{d}{((PMM(token_{N})}\times {PF(token_{N-i},token_{N},token_{N+i}))_{i}}\right)}
 Where
 RMM is the relative measure of meaning
 token is any block of text, sentence, phrase or word
 N is the number of tokens being analyzed
 PMM is the probable measure of meaning based on a corpora
 d is the non zero location of the token along the sequence of N tokens
 PF is the probability function specific to a language
 Ties with cognitive linguistics are part of the historical heritage of NLP, but they have been less frequently addressed since the statistical turn during the 1990s. Nevertheless, approaches to develop cognitive models towards technically operationalizable frameworks have been pursued in the context of various frameworks, e.g., of cognitive grammar,[53] functional grammar,[54] construction grammar,[55] computational psycholinguistics and cognitive neuroscience (e.g., ACT-R), however, with limited uptake in mainstream NLP (as measured by presence on major conferences[56] of the ACL). More recently, ideas of cognitive NLP have been revived as an approach to achieve explainability, e.g., under the notion of "cognitive AI".[57] Likewise, ideas of cognitive NLP are inherent to neural models multimodal NLP (although rarely made explicit)[58] and developments in artificial intelligence, specifically tools and technologies using large language model approaches[59] and new directions in artificial general intelligence based on the free energy principle[60] by British neuroscientist and theoretician at University College London Karl J. Friston.
--- a/evaluations/09_custom_model_graded_prompt_foo/custom_llm_eval.py
+++ b/evaluations/09_custom_model_graded_prompt_foo/custom_llm_eval.py
@@ -0,0 +1,148 @@
 import anthropic
 import os
 import json
 def llm_eval(summary, article):
    """
    Evaluate summary using an LLM (Claude).
    Args:
    summary (str): The summary to evaluate.
    article (str): The original text that was summarized.
    Returns:
    bool: True if the average score is above the threshold, False otherwise.
    """
    client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
    prompt = f"""Evaluate the following summary based on these criteria:
    1. Conciseness (1-5) - is the summary as concise as possible?
        - Conciseness of 1: The summary is unnecessarily long, including excessive details, repetitions, or irrelevant information. It fails to distill the key points effectively.
        - Conciseness of 3:  The summary captures most key points but could be more focused. It may include some unnecessary details or slightly overexplain certain concepts.
        - Conciseness of 5: The summary effectively condenses the main ideas into a brief, focused text. It includes all essential information without any superfluous details or explanations.
    2. Accuracy (1-5) - is the summary completely accurate based on the initial article'?
        - Accuracy of 1: The summary contains significant errors, misrepresentations, or omissions that fundamentally alter the meaning or key points of the original article.
        - Accuracy of 3:  The summary captures some key points correctly but may have minor inaccuracies or omissions. The overall message is generally correct, but some details may be wrong.
        - Accuracy of 5: The summary faithfully represents the main gist of the original article without any errors or misinterpretations. All included information is correct and aligns with the source material.
    4. Tone (1-5) - is the summary appropriate for a grade school student with no technical training?
        - Tone of 1: The summary uses language or concepts that are too complex, technical, or mature for a grade school audience. It may contain jargon, advanced terminology, or themes that are not suitable for young readers.
        - Tone of 2:  The summary mostly uses language suitable for grade school students but occasionally includes terms or concepts that may be challenging. Some explanations might be needed for full comprehension.
        - Tone of 3: The summary consistently uses simple, clear language that is easily understandable by grade school students. It explains complex ideas in a way that is accessible and engaging for young readers.
    5. Explanation - a general description of the way the summary is evaluated
    <examples>
    <example>
    This summary:
    <summary>
    Artificial neural networks are computer systems inspired by how the human brain works. They are made up of interconnected "neurons" that process information. These networks can learn to do tasks by looking at lots of examples, similar to how humans learn. 
    Some key things about neural networks:
    - They can recognize patterns and make predictions
    - They improve with more data and practice
    - They're used for things like identifying objects in images, translating languages, and playing games
    Neural networks are a powerful tool in artificial intelligence and are behind many of the "smart" technologies we use today. While they can do amazing things, they still aren't as complex or capable as the human brain.
    <summary>
    Should receive a 5 for tone, a 5 for accuracy, and a 5 for conciseness
    </example>
    <example>
    This summary:
    <summary>
    Here is a summary of the key points from the article on artificial neural networks (ANNs):
    1. ANNs are computational models inspired by biological neural networks in animal brains. They consist of interconnected artificial neurons that process and transmit signals.
    2. Basic structure:
    - Input layer receives data
    - Hidden layers process information 
    - Output layer produces results
    - Neurons are connected by weighted edges
    3. Learning process:
    - ANNs learn by adjusting connection weights
    - Use techniques like backpropagation to minimize errors
    - Can perform supervised, unsupervised, and reinforcement learning
    4. Key developments:
    - Convolutional neural networks (CNNs) for image processing
    - Recurrent neural networks (RNNs) for sequential data
    - Deep learning with many hidden layers
    5. Applications:
    - Pattern recognition, classification, regression
    - Computer vision, speech recognition, natural language processing
    - Game playing, robotics, financial modeling
    6. Advantages:
    - Can model complex non-linear relationships
    - Ability to learn and generalize from data
    - Adaptable to many different types of problems
    7. Challenges:
    - Require large amounts of training data
    - Can be computationally intensive
    - "Black box" nature can make interpretability difficult
    8. Recent advances:
    - Improved hardware (GPUs) enabling deeper networks
    - New architectures like transformers for language tasks
    - Progress in areas like generative AI
    The article provides a comprehensive overview of ANN concepts, history, types, applications, and ongoing research areas in this field of artificial intelligence and machine learning.
    </summary>
    Should receive a 1 for tone, a 5 for accuracy, and a 3 for conciseness
    </example>
    </examples>
    Provide a score for each criterion in JSON format. Here is the format you should follow always:
    <json>
    {{
    "conciseness": <number>,
    "accuracy": <number>,
    "tone": <number>,
    "explanation": <string>,
    }}
    </json>
    Original Text: <original_article>{article}</original_article>
    Summary to Evaluate: <summary>{summary}</summary>
    """
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1000,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": prompt
            },
            {
                "role": "assistant",
                "content": "<json>" 
            }
        ],
        stop_sequences=["</json>"]
    )
    evaluation = json.loads(response.content[0].text)
    # Filter out non-numeric values and calculate the average
    numeric_values = [value for key, value in evaluation.items() if isinstance(value, (int, float))]
    avg_score = sum(numeric_values) / len(numeric_values)
    # Return the average score and the overall model response
    return avg_score, response.content[0].text
 def get_assert(output: str, context, threshold=4.5):
    article = context['vars']['article']
    score, evaluation = llm_eval(output, article )
    return {
        "pass": score >= threshold,
        "score": score,
        "reason": evaluation
    }
--- a/evaluations/09_custom_model_graded_prompt_foo/images/distribution.png
+++ b/evaluations/09_custom_model_graded_prompt_foo/images/distribution.png
--- a/evaluations/09_custom_model_graded_prompt_foo/images/eval_result.png
+++ b/evaluations/09_custom_model_graded_prompt_foo/images/eval_result.png
--- a/evaluations/09_custom_model_graded_prompt_foo/images/explanation.png
+++ b/evaluations/09_custom_model_graded_prompt_foo/images/explanation.png
--- a/evaluations/09_custom_model_graded_prompt_foo/images/overall_scores.png
+++ b/evaluations/09_custom_model_graded_prompt_foo/images/overall_scores.png
--- a/evaluations/09_custom_model_graded_prompt_foo/images/web_view.png
+++ b/evaluations/09_custom_model_graded_prompt_foo/images/web_view.png
--- a/evaluations/09_custom_model_graded_prompt_foo/lesson.ipynb
+++ b/evaluations/09_custom_model_graded_prompt_foo/lesson.ipynb
--- a/evaluations/09_custom_model_graded_prompt_foo/promptfooconfig.yaml
+++ b/evaluations/09_custom_model_graded_prompt_foo/promptfooconfig.yaml
@@ -0,0 +1,33 @@
 description: 'Summarization Evaluation'
 prompts:
  - prompts.py:basic_summarize
  - prompts.py:better_summarize
  - prompts.py:best_summarize
 providers:
  - id: anthropic:messages:claude-3-5-sonnet-20240620
    label: "3.5 Sonnet"
 tests:
  - vars:
      article: file://articles/article1.txt
  - vars:
      article: file://articles/article2.txt
  - vars:
      article: file://articles/article3.txt
  - vars:
      article: file://articles/article4.txt
  - vars:
      article: file://articles/article5.txt
  - vars:
      article: file://articles/article6.txt
  - vars:
      article: file://articles/article7.txt
  - vars:
      article: file://articles/article8.txt
 defaultTest:
  assert:
    - type: python
      value: file://custom_llm_eval.py
--- a/evaluations/09_custom_model_graded_prompt_foo/prompts.py
+++ b/evaluations/09_custom_model_graded_prompt_foo/prompts.py
@@ -0,0 +1,14 @@
 def basic_summarize(article):
  return f"Summarize this article {article}"
 def better_summarize(article):
  return f"""
  Summarize this article for a grade-school audience: {article}"""
 def best_summarize(article):
  return f"""
  You are tasked with summarizing long wikipedia articles for a grade-school audience.
  Write a short summary, keeping it as concise as possible. 
  The summary is intended for a non-technical, grade-school audience. 
  This is the article: {article}"""
--- a/evaluations/README.md
+++ b/evaluations/README.md
@@ -0,0 +1,14 @@
 # Evals course
 Welcome to Anthropic's comprehensive evaluations course. Across nince lessons, you will learn everything you need to know to implement evaluations successfully in your workflows with Claude. We recommend that you start from the beginning with the [Evaluations 101](./01_intro_to_evals.ipynb) lesson, as each lesson builds on key concepts taught in previous ones.
 ## Table of contents
 1. [Evals 101](./01_intro_to_evals.ipynb)
 2. [Writing human-graded evals with Anthropic's Workbench](./02_workbench_evals.ipynb)
 3. [Writing simple code-graded evals](./03_code_graded.ipynb)
 4. [Writing a classification eval](./04_code_graded_classification.ipynb)
 5. [Prompt Foo for evals: an introduction](./05_prompt_foo_code_graded_animals/lesson.ipynb)
 6. [Writing classification evals with Prompt Foo](./06_prompt_foo_code_graded_classification/lesson.ipynb)
 7. [Custom graders with Prompt Foo](./07_prompt_foo_custom_graders/lesson.ipynb)
 8. [Model-graded evals with Prompt Foo](./08_prompt_foo_model_graded/lesson.ipynb)
 9. [Custom model-graded evals with Prompt Foo](./09_custom_model_graded_prompt_foo/lesson.ipynb)
--- a/evaluations/images/adding_variables.png
+++ b/evaluations/images/adding_variables.png
--- a/evaluations/images/benchmarks.png
+++ b/evaluations/images/benchmarks.png
--- a/evaluations/images/comparison.png
+++ b/evaluations/images/comparison.png
--- a/evaluations/images/empty_workbench.png
+++ b/evaluations/images/empty_workbench.png
--- a/evaluations/images/eval_diagram.png
+++ b/evaluations/images/eval_diagram.png
--- a/evaluations/images/evaluate1.png
+++ b/evaluations/images/evaluate1.png
--- a/evaluations/images/evaluate2.png
+++ b/evaluations/images/evaluate2.png
--- a/evaluations/images/evaluate3.png
+++ b/evaluations/images/evaluate3.png
--- a/evaluations/images/evaluate4.png
+++ b/evaluations/images/evaluate4.png
--- a/evaluations/images/evaluate5.png
+++ b/evaluations/images/evaluate5.png
--- a/evaluations/images/evaluate6.png
+++ b/evaluations/images/evaluate6.png
--- a/evaluations/images/evaluate_button.png
+++ b/evaluations/images/evaluate_button.png
--- a/evaluations/images/first_output.png
+++ b/evaluations/images/first_output.png
--- a/evaluations/images/process.png
+++ b/evaluations/images/process.png
--- a/evaluations/images/run_remaining.png
+++ b/evaluations/images/run_remaining.png
--- a/evaluations/images/score.png
+++ b/evaluations/images/score.png
--- a/evaluations/images/updated_prompt.png
+++ b/evaluations/images/updated_prompt.png
--- a/evaluations/images/updated_response.png
+++ b/evaluations/images/updated_response.png
--- a/evaluations/images/variables_button.png
+++ b/evaluations/images/variables_button.png
--- a/evaluations/images/workbench_with_prompt.png
+++ b/evaluations/images/workbench_with_prompt.png