From paradigms & equations to Python demos, mock data, and real-world applications — a complete RL presentation.
| Feature | Supervised SL | Unsupervised UL | Reinforcement RL |
|---|---|---|---|
| Data | Labeled (X, y) | Unlabeled raw data | Trajectories (State, Action, Reward) |
| Feedback | Immediate error (Loss) | No explicit feedback | Delayed scalar Reward |
| Goal | Predict labels | Find hidden structure | Maximize cumulative reward |
| Decision | Static (i.i.d) | Static (patterns) | Sequential (actions affect future) |
| Analogy | Teacher with answer key | Librarian organizing books | Training a dog with treats |
# Pseudocode
def build_tree(data, features):
if pure or max_depth: return leaf
best = select_best_split(data, features)
tree = {best: {}}
for val in unique(data[best]):
sub = data[data[best] == val]
tree[best][val] = build_tree(sub, features - {best})
return tree
# Python (sklearn)
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
# Pseudocode
def random_forest(data, features, n):
forest = []
for i in range(n):
boot = sample_with_replacement(data)
tree = build_tree(boot, random_subset(features))
forest.append(tree)
return majority_vote(forest)
# Python (sklearn)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
preds = rf.predict(X_test)
# Pseudocode
def kmeans(data, k):
centroids = random_k_points(data, k)
while not converged:
clusters = assign_to_nearest(data, centroids)
new_centroids = mean(clusters)
if new_centroids == centroids: break
centroids = new_centroids
return centroids, clusters
# Python (sklearn)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_
Learn Q-values directly from experience.
# Q-Learning Pseudocode
Initialize Q(s,a) arbitrarily
For each episode:
Initialize state s
For each step:
Choose a from s using ε-greedy(Q)
Take a, observe r, s'
Q(s,a) += α * (r + γ * max Q(s',a') - Q(s,a))
s = s'
Until s is terminal
Learn policy directly, often with a Critic for variance reduction.
# Policy Gradient (REINFORCE) Pseudocode
Initialize policy π_θ
For each episode:
Generate trajectory {s_t, a_t, r_t}
For each step t:
G = Σ γ^(k-t) r_k # Return
θ += α * ∇θ log π(a_t|s_t) * G
Update policy
Q-Learning on a 4x4 grid. 🍬 Candy +10 🕳️ Pit -10
Optimal Path: (0,0)→(1,0)→(2,0)→(2,1)→(2,2)→(3,2)→(3,3)
import numpy as np
import random
# 4x4 Grid: 0=Empty, 1=Candy, -1=Pit
grid = [[0,0,0,0],[0,-1,0,0],[0,0,0,0],[0,0,0,1]]
goal = (3,3); actions = [(-1,0),(1,0),(0,-1),(0,1)]
Q = np.zeros((16,4)); alpha, gamma, eps = 0.8, 0.95, 0.1
def step(s, a):
r,c = divmod(s,4); dr,dc = actions[a]
nr,nc = max(0,min(3,r+dr)), max(0,min(3,c+dc))
ns = nr*4+nc
reward = -1
if grid[nr][nc] == 1: reward = 10
if grid[nr][nc] == -1: reward = -10
return ns, reward, grid[nr][nc] != 0
for ep in range(500):
s = 0; done = False
while not done:
a = random.randint(0,3) if random.random() < eps else np.argmax(Q[s])
ns, r, done = step(s, a)
Q[s,a] += alpha * (r + gamma * np.max(Q[ns]) - Q[s,a])
s = ns
print("Trained Q-Table (first 5 states):\n", Q[:5])
# Test: Greedy path
s, path = 0, [(0,0)]
while s != 15:
a = np.argmax(Q[s]); s, _, _ = step(s, a)
path.append(divmod(s,4))
print("Path:", path) # Optimal route!
On‑policy SARSA and Deep Q‑Learning with neural network on the same Candy Maze.
# SARSA update (replace Q-learning update)
# Choose a' using same ε-greedy policy
a_prime = epsilon_greedy(Q, s_prime, eps)
Q[s,a] += alpha * (r + gamma * Q[s_prime, a_prime] - Q[s,a])
# Full training loop (same environment)
for ep in range(500):
s = 0; a = epsilon_greedy(Q, s, eps)
done = False
while not done:
ns, r, done = step(s, a)
a_next = epsilon_greedy(Q, ns, eps)
Q[s,a] += alpha * (r + gamma * Q[ns, a_next] - Q[s,a])
s, a = ns, a_next
# Path will be slightly different due to on-policy nature.
import torch, torch.nn as nn, torch.optim as optim
import numpy as np
from collections import deque
# Simple NN: input 16 (one-hot state) + 4 actions? Better: state -> 4 Q-values
class DQN(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(16, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU(),
nn.Linear(64, 4)
)
def forward(self, x): return self.fc(x)
# State representation: one-hot of state index (16)
def state_to_tensor(s):
x = np.zeros(16); x[s]=1; return torch.tensor(x, dtype=torch.float32)
model = DQN(); target = DQN(); target.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr=0.001)
replay = deque(maxlen=2000); batch_size=32; gamma=0.95; eps=0.1
for ep in range(500):
s = 0; done = False
while not done:
# ε-greedy using model
if np.random.rand() < eps:
a = np.random.randint(4)
else:
with torch.no_grad():
q_vals = model(state_to_tensor(s))
a = torch.argmax(q_vals).item()
ns, r, done = step(s, a)
replay.append((s, a, r, ns, done))
s = ns
# Train if enough samples
if len(replay) >= batch_size:
batch = np.random.choice(len(replay), batch_size, replace=False)
# ... compute loss (MSE) and update model periodically
# Target network update every 10 episodes
# Update target network every 10 episodes
if ep % 10 == 0: target.load_state_dict(model.state_dict())
# After training, greedy path with model
Both algorithms solve the maze; DQN is essential when state space is large (e.g., raw pixels). SARSA is on‑policy, more conservative.
| # | Domain | State | Action | Reward | Next State / Info |
|---|---|---|---|---|---|
| 1 | Maze | State 5 (row=1, col=1) | Down | -1 | State 9 (row=2, col=1) |
| 2 | Maze | State 10 (row=2, col=2) | Right | +10 | Goal! (Candy found) |
| 3 | Self-Driving | Steering=0.2, Speed=12, Lane=0.1 | Throttle 0.3 | +5.2 | Lane=0.05, Speed=13 |
| 4 | FX Trader | RSI=72, Price=1.1050, Vol=12 | SELL (size 0.5) | +$23.40 | Price drops to 1.1042 |
| 5 | 3x3 Puzzle | Board [1,2,3,4,0,5,6,7,8] | Blank Down | -1 | Board [1,2,3,4,7,5,6,0,8] |
These samples represent typical transitions logged during RL training — used by Q‑Learning, SARSA, and DQN.
RL is sequential decision-making under uncertainty, trading off exploration vs. exploitation with delayed feedback.
Every problem (Maze, Puzzle, Car, Trader) reduces to the same Agent, Environment, State, Action, Reward loop. The math (Bellman) stays constant; only the function approximator scales.