Topic 7A - Reinforcement Learning

Learning Objectives

By the end of this topic, you should be able to:

Explain what a fitness landscape is and how hill climbing uses it to search for good solutions.
Explain why hill climbing can get stuck — including local optima and plateaus — and describe how genetic algorithms address this by maintaining a diverse population.
Explain how a genetic algorithm works: population, fitness function, selection, crossover, and mutation.
Given a scenario, identify whether hill climbing or a genetic algorithm is being used and explain why.

To help you meet the learning objectives, we have prepared three readings. Please complete them in order.

Reading 1 - The Optimization Problem — what it means to search for a good solution in a space too large to check exhaustively, and the fitness landscape as a way of seeing the challenge
Reading 2 - Hill Climbing — how a simple optimization agent works, a fully traced classroom scheduling example, and the three ways hill climbing gets stuck
Reading 3 - Genetic Algorithms — evolution as a search strategy, the full genetic algorithm cycle, why randomness helps, and how both techniques connect to the reinforcement learning category

These readings intentionally build on each other, so please complete them in order.

Review the Learning Objectives at the top of this page. The questions below will help you check your understanding before moving on to Topic 7B.

A school district wants to assign 40 teachers to 40 classrooms to minimize travel time between rooms for team-teaching pairs. There are more than 8 × 10⁴⁷ possible assignments. Why is exhaustive search impossible here? What does this tell us about the kind of algorithm needed?
Describe a fitness landscape for the classroom assignment problem. What does a "peak" represent? What does a "valley" represent? What would it mean for two solutions to be "neighbors"?

Trace hill climbing on the following simplified scenario: a teacher is trying to find the best order for five topics in a unit, scored by student engagement ratings from a pilot. She starts with order [A, B, C, D, E] which scores 62. Swapping B and C produces a score of 71. Swapping A and B from the new arrangement produces 68. Swapping D and E produces 74. Swapping C and D produces 73. What does hill climbing do at each step? Where does it stop?
A hill climbing agent searching for a good weekly staff meeting schedule keeps trying small changes but every change it considers makes the score worse. It stops and reports the current schedule as optimal. A staff member points out that a completely different schedule structure — moving the meeting to a different day entirely — would be dramatically better. Which hill climbing pathology does this illustrate? Why couldn't the agent find the better solution?

A genetic algorithm is being used to optimize a school bus route. The population consists of ten candidate routes. After evaluating fitness, the two best-performing routes are selected as parents. Their routes are split at the midpoint and recombined to create two offspring routes. One offspring has a randomly swapped stop order.
- Which step is selection? Which is crossover? Which is mutation?
- Why might the offspring routes be better than either parent?
- Why might they be worse? What happens in the next generation if they are?
Explain in your own words why genetic algorithms are better than hill climbing at escaping local optima. What specific mechanism allows a genetic algorithm to explore parts of the solution space that hill climbing would never reach?
Both hill climbing and genetic algorithms improve solutions through a reward signal rather than labeled training data. In one or two sentences, explain how this connects them to the reinforcement learning category from Topic 6D.

It is completely fine to revisit the readings as you work through these questions.

These optional topics go beyond the core learning goals but are rich avenues for deeper understanding.

Simulated annealing
- A variant of hill climbing that sometimes accepts a worse solution deliberately — like cooling metal slowly to reach a more stable structure. This controlled randomness helps it escape local optima while still converging on a good solution.
Evolutionary programming
- When genetic algorithms are applied specifically to evolving programs rather than solutions — letting code itself evolve toward a goal rather than being explicitly written. The textbook's BACON and AUTOCLASS systems are early examples of this direction.
Neuroevolution
- Using genetic algorithms to evolve the structure and weights of neural networks, rather than training them by gradient descent. An active research area that combines Topics 7A and 7B in interesting ways.
The traveling salesman problem
- One of the classic benchmarks for optimization algorithms. Both hill climbing and genetic algorithms have been applied to it extensively, and comparing their performance on this problem illustrates their respective strengths and limits concretely.