🧠

The AI Literacy Guide

From the origins of artificial intelligence to the concepts behind today's models. History, key terms in plain English, and a cheatsheet all in one place.

Interactive Timeline Plain English Concepts Cheatsheet

The History of AI

An interactive timeline of the breakthroughs that brought us here

1950

The Turing Test

Alan Turing publishes "Computing Machinery and Intelligence", proposing the Turing Test — can a machine fool a human into thinking it's a person?

1956

AI Gets Its Name

The Dartmouth Conference takes place. The term "Artificial Intelligence" is coined for the first time, launching AI as a formal field of study.

1966

ELIZA Chatbot

ELIZA, created at MIT, becomes the first conversational AI. It simulates a psychotherapist and convinces some users it truly understands them.

1997

Deep Blue vs. Kasparov

IBM's Deep Blue defeats world chess champion Garry Kasparov, the first time a computer beats a reigning champion under standard tournament rules.

2011

Watson & Siri

IBM Watson wins Jeopardy! against human champions. Apple launches Siri, bringing voice assistants to millions of pockets worldwide.

2012

Deep Learning Breakthrough

AlexNet wins the ImageNet competition by a massive margin, proving deep neural networks can dramatically outperform traditional methods at image recognition.

2014

GANs Invented

Ian Goodfellow invents Generative Adversarial Networks (GANs) — two neural networks that compete to generate increasingly realistic images, audio, and more.

2016

AlphaGo Stuns the World

Google DeepMind's AlphaGo defeats world Go champion Lee Sedol 4–1. Go was considered too complex for computers — with more possible positions than atoms in the universe.

2017

Attention Is All You Need

Google publishes the landmark paper introducing the Transformer architecture. This single innovation becomes the foundation for virtually all modern AI models.

2018

GPT-1 & BERT

OpenAI releases GPT-1, the first Generative Pre-trained Transformer. Google releases BERT. The era of large language models begins.

2020

GPT-3: Emergent Abilities

OpenAI releases GPT-3 with 175 billion parameters. It demonstrates surprising emergent abilities — few-shot learning, code generation, and creative writing — simply from scale.

2022

AI Goes Mainstream

ChatGPT launches in November and reaches 100 million users in two months. Stable Diffusion open-sources image generation. AI becomes a household word.

2023

The Cambrian Explosion

GPT-4 released. Anthropic launches Claude 2. Meta open-sources Llama 2. AI coding tools boom. Every major tech company races to build and deploy AI products.

2024

Open Source Catches Up

Claude 3.5 Sonnet, GPT-4o, and Gemini push the frontier. Open-source models rival proprietary ones. AI agents begin to emerge as a new paradigm.

2025

The Age of Agents

Claude 4 / Opus arrives. AI coding agents (Claude Code, Codex CLI) ship. Agentic AI becomes mainstream — AI systems that can plan, use tools, and take actions autonomously.

Key AI Concepts — In Plain English

Every term explained with one-sentence definitions, real-world analogies, and optional technical depth

1

Artificial Intelligence(AI)

In one sentence:

Making computers do things that normally require human intelligence — recognizing images, understanding language, making decisions.

Think of it like:

Teaching a calculator to not just add numbers, but understand what numbers mean.

AI is an umbrella term covering many subfields: machine learning, natural language processing, computer vision, robotics, and more. Modern AI is dominated by machine learning approaches, particularly deep learning. The term was coined in 1956 at the Dartmouth Conference.
2

Machine Learning(ML)

In one sentence:

Instead of programming rules, you show the computer examples and it figures out the rules itself.

Think of it like:

Learning to cook by tasting 10,000 dishes vs. reading a recipe. You eventually figure out what makes food taste good.

ML algorithms learn patterns from data through optimization. Key types include supervised learning (labeled examples), unsupervised learning (finding structure in unlabeled data), and reinforcement learning (learning from rewards). Common algorithms: linear regression, decision trees, SVMs, and neural networks.
3

Neural Network

In one sentence:

A system inspired by the human brain, made of layers of connected nodes (neurons) that process information.

Think of it like:

A team of people where each person checks one thing, and they all vote on the answer. With enough people and layers, they can recognize anything.

A neural network consists of an input layer, one or more hidden layers, and an output layer. Each neuron applies a weighted sum followed by a non-linear activation function (ReLU, sigmoid, etc.). The network learns by adjusting weights via gradient descent and backpropagation to minimize a loss function.
4

Deep Learning

In one sentence:

Machine learning using artificial neural networks with many layers, capable of learning incredibly complex patterns.

Think of it like:

Like ML but with a much bigger brain — it can spot patterns humans never could, like recognizing faces from millions of pixels.

Deep learning uses neural networks with many hidden layers (hence "deep"). The key breakthrough was backpropagation combined with GPUs for fast training. Architectures include CNNs (images), RNNs/LSTMs (sequences), and Transformers (attention-based). The 2012 AlexNet result proved deep learning's superiority for computer vision.
5

Transformer

In one sentence:

The breakthrough architecture (2017) that made modern AI possible. Uses "attention" to understand which words in a sentence relate to each other.

Think of it like:

Reading a sentence and highlighting which words are connected, instead of reading left to right. In "The cat sat on the mat because it was tired" — the Transformer knows "it" refers to "cat."

The Transformer architecture uses self-attention (scaled dot-product attention) to process all tokens in parallel, unlike RNNs which process sequentially. Key components: multi-head attention, positional encoding, layer normalization, and feed-forward networks. The "Attention Is All You Need" paper by Vaswani et al. (2017) introduced this architecture, replacing recurrence entirely.
6

Training

In one sentence:

The process of feeding data to a model so it learns patterns. For large models, this costs millions of dollars and takes weeks on thousands of GPUs.

Think of it like:

Going to school for 12+ years. It's expensive and time-consuming, but once you've graduated, you can answer questions quickly.

Pre-training involves next-token prediction on trillions of tokens. GPT-4 reportedly cost $100M+ to train. Training uses distributed computing across thousands of GPUs/TPUs with techniques like data parallelism, tensor parallelism, and pipeline parallelism. The training process involves multiple epochs over the dataset, with learning rate scheduling and gradient accumulation.
7

Parameters

In one sentence:

The internal "knobs" a model learns during training — each one is a number that controls how the model processes information. More parameters generally means more capability.

Think of it like:

A mixing board in a music studio with billions of sliders. During training, the AI adjusts every slider until the music sounds right. A bigger board (more parameters) can produce richer, more nuanced music.

Parameters are the learnable weights and biases in a neural network. GPT-3 has 175 billion parameters, Llama 3 comes in 8B, 70B, and 405B variants. Parameter count is a rough proxy for model capability but architecture and training data quality also matter enormously. Techniques like quantization (reducing precision from float32 to int4) allow large models to run on consumer hardware by shrinking memory requirements at a small quality cost.
8

Tokens

In one sentence:

The pieces that AI breaks text into before processing. Not always whole words — sometimes parts of words, punctuation, or spaces.

Think of it like:

Lego bricks. You build sentences from smaller pieces. The AI reads and writes one piece at a time.

"The quick brown fox jumps"
↓ tokenized into:
The quick brown fox jumps — 5 tokens
"I can't believe it's not butter"
↓ tokenized into:
I can 't believe it 's not butter — 8 tokens
Tokenization uses algorithms like Byte-Pair Encoding (BPE) or SentencePiece to break text into subword units. Common words are single tokens, while rare words are split into pieces. GPT-4 uses ~100K token vocabulary. One token is roughly 4 characters or 0.75 words in English. Tokenization differs across languages — non-Latin scripts typically require more tokens per word.
9

Large Language Model(LLM)

In one sentence:

A neural network trained on massive amounts of text that can generate, understand, and reason about language.

Think of it like:

Someone who has read every book, article, and website ever written, and can write in any style — from Shakespeare to code to legal contracts.

LLMs are typically Transformer-based models with billions of parameters, trained on trillions of tokens using self-supervised learning (next-token prediction). They develop emergent capabilities at scale, including reasoning, translation, and code generation. Examples: GPT-4, Claude, Llama, Gemini. Post-training steps like RLHF align them with human preferences.
10

GPT(Generative Pre-trained Transformer)

In one sentence:

A specific type of LLM architecture by OpenAI. G = generates text, P = pre-trained on lots of data, T = uses the Transformer architecture.

Think of it like:

A specific brand of car engine. "LLM" is the category (all car engines), "GPT" is a particular design (like a V8 from a specific manufacturer).

GPT models are autoregressive, decoder-only Transformers. GPT-1 (2018) had 117M parameters, GPT-2 (2019) had 1.5B, GPT-3 (2020) had 175B, and GPT-4 (2023) is rumored to be a mixture-of-experts model. Each version showed dramatic capability improvements, largely from scaling data, compute, and parameters.
11

Context Window

In one sentence:

How much text the AI can "see" at once — both your input and its output must fit within this window.

Think of it like:

The size of the AI's desk. A bigger desk means it can spread out more documents and see more information at once.

Context window sizes:
GPT-3 (2020) 4K tokens — roughly 3 pages of text
GPT-4 (2023) 128K tokens — roughly a 300-page book
ChatGPT 5.4 (2025) 1M+ tokens — OpenAI's latest flagship model
Claude 3.5 (2024) 200K tokens — ~150K words, an entire novel
Claude Opus 4.6 (2025) 200K tokens — enhanced reasoning over full context
Gemini 1.5 (2024) 1M+ tokens — multiple books at once
Gemini 3 (2025) 1M+ tokens — multimodal with massive context
Context window is limited by the self-attention mechanism, which has O(n²) memory complexity. Longer contexts require techniques like sparse attention, sliding window attention, or ALiBi positional encodings. The context includes both the prompt and the completion. Models like Claude use extended context efficiently but may show degraded performance for information in the middle ("lost in the middle" phenomenon).
12

Inference

In one sentence:

When the AI actually generates a response. The "using" phase as opposed to the "learning" phase.

Think of it like:

Training is studying for the exam. Inference is taking the exam.

Inference involves a forward pass through the network. For autoregressive models like GPT and Claude, tokens are generated one at a time, each conditioned on all previous tokens. Inference costs scale with context length and output length. Techniques like KV caching, speculative decoding, and quantization reduce inference costs and latency.
13

Prompt

In one sentence:

The text you send to an AI — your question, instruction, or input. Everything the model sees before generating a response.

Think of it like:

The question on an exam paper. The clearer and more specific the question, the better the answer you'll get.

Prompts can include system messages (setting behavior), user messages, and assistant messages (for multi-turn conversations). The prompt is tokenized and processed through the model. Prompt construction significantly affects output quality — techniques like few-shot examples, role-playing instructions, and structured output formats improve results.
14

Prompt Engineering

In one sentence:

The art of writing prompts that get the best results from AI — including techniques like giving examples, assigning roles, and structuring instructions.

Think of it like:

Knowing how to ask the right question to get the right answer. "Tell me about dogs" vs. "Compare 3 hypoallergenic dog breeds for apartments, in a table."

Key techniques: zero-shot (no examples), few-shot (providing examples), chain-of-thought (asking for step-by-step reasoning), self-consistency (generating multiple answers and picking the consensus), and tree-of-thought (exploring multiple reasoning paths). System prompts set model behavior, and structured output formats (JSON, XML) improve parsing reliability.
15

Chain of Thought(CoT)

In one sentence:

Getting AI to show its reasoning step by step instead of jumping straight to an answer, which dramatically improves accuracy on complex problems.

Think of it like:

Showing your work in math class. The AI "thinks out loud" and catches mistakes along the way.

CoT prompting (Wei et al., 2022) dramatically improves performance on math, logic, and reasoning tasks. Techniques include "Let's think step by step" (zero-shot CoT), providing worked examples (few-shot CoT), and extended thinking modes where models reason in a scratchpad before answering. Models like Claude use extended thinking natively for complex problems.
16

Temperature

In one sentence:

A setting that controls how creative/random vs. deterministic the AI's output is. Low temperature = predictable, high temperature = creative and varied.

Think of it like:

A dial between "play it safe" and "get creative." Temperature 0 always picks the most likely word; temperature 1+ takes more risks.

Temperature scales the logits (raw model output scores) before applying softmax to create a probability distribution. Temperature 0 makes the distribution sharp (argmax/greedy decoding). Temperature 1 uses the model's natural distribution. Higher values flatten the distribution, increasing randomness. Related parameters: top-p (nucleus sampling) and top-k limit which tokens are considered.
17

Hallucination

In one sentence:

When AI confidently generates incorrect or made-up information — citing fake sources, inventing facts, or getting details wrong.

Think of it like:

A student who doesn't know the answer but writes something convincing anyway. It sounds right, but it's completely fabricated.

Hallucinations occur because LLMs are trained to generate plausible-sounding text, not to verify truth. They arise from training data gaps, distributional patterns, and the model's inability to distinguish knowledge from pattern completion. Mitigation strategies include RAG (grounding in documents), fine-tuning on factual data, calibration training, and teaching models to say "I don't know."
18

Fine-tuning

In one sentence:

Taking a pre-trained model and training it further on specific data for a specific task — like specializing after a general education.

Think of it like:

A doctor who went to medical school (pre-training) and then specialized in cardiology (fine-tuning). Same brain, focused expertise.

Fine-tuning updates model weights using a smaller, task-specific dataset. Techniques include full fine-tuning (updating all parameters), LoRA (Low-Rank Adaptation, updating small adapter matrices), and QLoRA (quantized LoRA). Instruction tuning is a form of fine-tuning that teaches models to follow instructions. RLHF is another post-training technique.
19

Reinforcement Learning(RL)

In one sentence:

Training AI by giving it rewards for good behavior and penalties for bad behavior, so it learns what to do through trial and error.

Think of it like:

Training a dog — treat for sitting, no treat for barking. Over time, the dog figures out what earns rewards.

RL involves an agent taking actions in an environment to maximize cumulative reward. Key concepts: states, actions, rewards, policies, and value functions. Algorithms include Q-learning, policy gradient methods (PPO, A2C), and actor-critic methods. RL powers game-playing AI (AlphaGo, Atari) and is used in RLHF for language model alignment.
20

RLHF(Reinforcement Learning from Human Feedback)

In one sentence:

Humans rate AI outputs as good or bad, and the AI learns from that feedback to produce more helpful, harmless, and honest responses.

Think of it like:

A chef getting reviews from food critics and adjusting recipes based on what diners actually enjoy.

RLHF involves three steps: (1) supervised fine-tuning on demonstration data, (2) training a reward model from human comparisons of outputs, (3) optimizing the language model against the reward model using PPO or similar algorithms. Anthropic uses Constitutional AI (CAI), where AI principles guide the feedback process alongside human input.
21

Embedding

In one sentence:

Converting text (or images, audio) into numbers (vectors) so the AI can understand meaning and compare similarity.

Think of it like:

Giving every word a GPS coordinate in "meaning space." Similar words like "happy" and "joyful" end up close together on the map.

Embeddings map discrete tokens to continuous vector spaces (typically 768–4096 dimensions). They capture semantic relationships: king - man + woman = queen. Used for semantic search, clustering, recommendation systems, and as input layers in neural networks. Popular embedding models: OpenAI's text-embedding-3, Cohere Embed, and open-source models like E5 and BGE.
22

RAG(Retrieval-Augmented Generation)

In one sentence:

Letting the AI search a knowledge base before answering, so it has access to up-to-date or private information it wasn't trained on.

Think of it like:

An open-book exam vs. a closed-book exam. The AI looks up information in your documents before answering, instead of relying only on memory.

RAG pipelines: (1) chunk documents into passages, (2) embed them into vectors, (3) store in a vector database (Pinecone, Weaviate, ChromaDB), (4) at query time, embed the query, retrieve relevant chunks via similarity search, (5) inject retrieved chunks into the prompt as context. This reduces hallucination and enables domain-specific knowledge without fine-tuning.
23

Multimodal

In one sentence:

AI that can understand and generate multiple types of content — text, images, audio, video — not just one format.

Think of it like:

A person who can read, see photos, listen to music, and watch videos — not just read text. They understand the world through multiple senses.

Multimodal models process different input types through modality-specific encoders (vision transformers for images, audio encoders for speech) that project into a shared embedding space. Examples: GPT-4o (text + image + audio), Claude 3 (text + image), Gemini (text + image + audio + video). This enables tasks like describing images, analyzing charts, and understanding screenshots.
24

Agent

In one sentence:

An AI system that can take actions, use tools, and make decisions autonomously — not just answer questions, but actually do things.

Think of it like:

Regular AI answers questions. An agent can actually DO things — browse the web, write and run code, create files, send emails, and plan multi-step workflows.

AI agents use an observation-thought-action loop. The LLM serves as a "reasoning engine" that decides which tools to call (code execution, web search, file I/O, APIs) and how to chain actions together. Frameworks include Claude Code (terminal agent), OpenAI Codex CLI, LangChain, AutoGPT, and CrewAI. Key challenges: reliability, safety, and knowing when to ask for human confirmation.
25

Open Source vs. Closed Source Models

In one sentence:

Open source means the model's code and weights are public and free to use (Llama, Mistral). Closed source means you can only access it via an API (GPT-4, Claude).

Think of it like:

A recipe you can see, modify, and cook at home vs. a secret sauce you can only order at the restaurant.

Open-source models (Llama 3, Mistral, Qwen) release model weights for download and local inference. Benefits: privacy, customization, no API costs. Tradeoffs: requires your own GPU infrastructure, may lack safety training. Closed-source models (GPT-4, Claude) offer superior capabilities and safety but require API access and usage fees. "Open weights" models release weights but not training code/data.
26

MCP (Model Context Protocol)

In one sentence:

A universal standard that lets AI assistants connect to external tools and data sources — like apps on your phone, but for AI.

Think of it like:

USB for AI. Before USB, every device needed a different cable. MCP is the one standard plug that lets any AI tool connect to any service — GitHub, Slack, databases, cloud providers, and more.

MCP (Model Context Protocol) was created by Anthropic and open-sourced as a standard for AI-tool integration. It uses a client-server architecture: the AI app (Claude Code, Codex) is the client, and each integration is an MCP server that exposes tools and resources. Servers can run locally or remotely and communicate via JSON-RPC. This replaces custom API integrations with a single protocol. See the MCP page for connectors and setup.

AI Cheatsheet

Quick reference for models, companies, numbers, and buzzwords

🤖 Models to Know

ChatGPT (5.4) OpenAI's flagship multimodal model; fast, capable, widely used
Claude (Opus 4.6) Anthropic's model family; known for safety, long context, and coding
Gemini 3 Google DeepMind's multimodal model; massive context window (1M+ tokens)
Llama 3 Meta's open-source model; strong performance you can run locally
Mistral European open-source models; punches above its weight class in size vs. performance
Stable Diffusion Open-source image generation model; runs locally, huge community

🏢 Companies to Know

OpenAI Created GPT series and ChatGPT; popularized AI for consumers
Anthropic Created Claude; focused on AI safety and responsible development
Google DeepMind Created AlphaGo, Gemini, Transformers; deep research powerhouse
Meta AI Open-sourced Llama; major contributor to open AI research
Mistral AI French startup; efficient open-source models rivaling larger competitors
Stability AI Created Stable Diffusion; democratized image generation
Hugging Face The "GitHub of ML" — hosts models, datasets, and tools for the community

📊 Key Numbers

Training cost $2M–$100M+ for frontier models; GPT-4 rumored at $100M+
GPT-3 params 175 billion parameters (the "knobs" the model learned to tune)
Llama 3 params 8B, 70B, and 405B parameter variants available
Training data Trillions of tokens — much of the public internet
1 token Roughly 4 characters or 0.75 words in English
ChatGPT users 100M users in 2 months (fastest consumer app adoption ever)

💡 Buzzwords Decoded

AGI Artificial General Intelligence: AI that can do any intellectual task a human can
ASI Artificial Super Intelligence: hypothetical AI smarter than all humans combined
Alignment Making sure AI systems do what humans actually want them to do
Safety Ensuring AI doesn't cause harm — from misinformation to misuse
Guardrails Built-in limits that prevent AI from generating harmful or inappropriate content
Jailbreaking Tricking AI into bypassing its safety guardrails through clever prompting
Built with care for the Teach Me Dev project