Topic 7C - Unsupervised Learning and Pattern Discovery

No labels. No right answers. Just data — and the structure hiding inside it.

Learning Objectives

By the end of this topic, you should be able to:

Learning Activities

To help you meet the learning objectives, we have prepared three readings. Please complete them in order.

Readings

These readings intentionally build on each other, so please complete them in order.

Checking for Understanding

Review the Learning Objectives at the top of this page. The questions below will help you check your understanding before moving on to Topic 7D.

Clustering

  1. Explain in your own words what K-means clustering is doing at each step. What is a centroid? What does it mean for a data point to be assigned to a cluster? What makes the algorithm stop?
  2. A school counselor uses K-means on three years of student data and discovers four clusters. She labels them herself: "academically engaged," "socially connected but academically disengaged," "quietly struggling," and "broadly disengaged." What did the algorithm contribute to this result, and what did the human contribute? Could the algorithm have produced the labels on its own?
  3. K-means requires you to specify the number of clusters K before the algorithm runs. Why is this a limitation? What might go wrong if K is too small? Too large?

Dimensionality Reduction

  1. A dataset of student performance records has 40 attributes per student. After PCA, the data is represented in 3 dimensions. What has been preserved? What has been lost? Why might the 3-dimensional version be more useful for some purposes despite having less information?
  2. Explain why visualizing 40-dimensional data directly is impossible for humans, and how dimensionality reduction makes visualization possible. What is the risk of drawing conclusions from a 2D visualization of 40-dimensional data?

Autoencoders and Pretraining

  1. An autoencoder is trained to reconstruct its inputs. It has no labels and no "correct answers" in the usual sense. In what sense is it still learning something? What is it learning?
  2. Large language models are pretrained on hundreds of billions of words of text without any labels. The training task is simply: predict the next word. Why does successfully learning to predict the next word require the model to learn something much deeper than word sequences?
  3. Supervised learning requires labeled data, which is expensive to produce. Unsupervised pretraining uses unlabeled data, which is abundant. Explain why this difference matters for what kinds of AI systems are now possible to build.

It is completely fine to revisit the readings as you work through these questions.

Extend Your Learning

These optional topics go beyond the core learning goals but are rich avenues for deeper understanding.

  • Generative adversarial networks (GANs)
    • A powerful unsupervised approach where two networks compete: one generates fake data, the other tries to distinguish real from fake. Through competition, both improve — producing the photorealistic synthetic images that made "deepfakes" possible.
  • Self-supervised learning
    • A variant of unsupervised learning where the training signal is derived from the data itself — masking parts of an image and predicting the missing pieces, for example. Most modern foundation models use self-supervised approaches.
  • Word embeddings (Word2Vec, GloVe)
    • Early unsupervised techniques that learned dense vector representations of words from large text corpora — capturing semantic relationships (king − man + woman ≈ queen) purely from co-occurrence patterns. The conceptual ancestor of modern language model representations.