Topic 7C - Unsupervised Learning and Pattern Discovery

Learning Objectives

By the end of this topic, you should be able to:

Explain what unsupervised learning is trying to accomplish and how the absence of labels changes the learning problem.
Explain how K-means clustering works at a conceptual level: what a centroid is, how points are assigned to clusters, and what makes the algorithm stop.
Identify at least two limitations of K-means clustering.
Explain what dimensionality reduction is and why high-dimensional data is difficult to work with directly.

To help you meet the learning objectives, we have prepared three readings. Please complete them in order.

Reading 1 - Finding Structure Without Labels — what unsupervised learning is trying to do, clustering as the core technique, and K-means traced step by step on a school-based example
Reading 2 - Seeing the Shape of Data — dimensionality reduction, why high-dimensional data is hard to work with, PCA and t-SNE as tools for revealing hidden structure
Reading 3 - Unsupervised Neural Networks and the Bridge to LLMs — autoencoders, unsupervised pretraining, and why the largest AI systems in the world are built on a foundation of unlabeled data

These readings intentionally build on each other, so please complete them in order.

Review the Learning Objectives at the top of this page. The questions below will help you check your understanding before moving on to Topic 7D.

Explain in your own words what K-means clustering is doing at each step. What is a centroid? What does it mean for a data point to be assigned to a cluster? What makes the algorithm stop?
A school counselor uses K-means on three years of student data and discovers four clusters. She labels them herself: "academically engaged," "socially connected but academically disengaged," "quietly struggling," and "broadly disengaged." What did the algorithm contribute to this result, and what did the human contribute? Could the algorithm have produced the labels on its own?
K-means requires you to specify the number of clusters K before the algorithm runs. Why is this a limitation? What might go wrong if K is too small? Too large?

A dataset of student performance records has 40 attributes per student. After PCA, the data is represented in 3 dimensions. What has been preserved? What has been lost? Why might the 3-dimensional version be more useful for some purposes despite having less information?
Explain why visualizing 40-dimensional data directly is impossible for humans, and how dimensionality reduction makes visualization possible. What is the risk of drawing conclusions from a 2D visualization of 40-dimensional data?

An autoencoder is trained to reconstruct its inputs. It has no labels and no "correct answers" in the usual sense. In what sense is it still learning something? What is it learning?
Large language models are pretrained on hundreds of billions of words of text without any labels. The training task is simply: predict the next word. Why does successfully learning to predict the next word require the model to learn something much deeper than word sequences?
Supervised learning requires labeled data, which is expensive to produce. Unsupervised pretraining uses unlabeled data, which is abundant. Explain why this difference matters for what kinds of AI systems are now possible to build.

It is completely fine to revisit the readings as you work through these questions.

These optional topics go beyond the core learning goals but are rich avenues for deeper understanding.

Generative adversarial networks (GANs)
- A powerful unsupervised approach where two networks compete: one generates fake data, the other tries to distinguish real from fake. Through competition, both improve — producing the photorealistic synthetic images that made "deepfakes" possible.
Self-supervised learning
- A variant of unsupervised learning where the training signal is derived from the data itself — masking parts of an image and predicting the missing pieces, for example. Most modern foundation models use self-supervised approaches.
Word embeddings (Word2Vec, GloVe)
- Early unsupervised techniques that learned dense vector representations of words from large text corpora — capturing semantic relationships (king − man + woman ≈ queen) purely from co-occurrence patterns. The conceptual ancestor of modern language model representations.