Unsupervised Learning

k-means, dimensionalty reduction...

Posted by Hao Xu on October 1, 2018

$k$-means Clustering

Description1

Given a set of observation ${x_1, x_2, …, x_n}$, where each observation is a $d$-dimension real vector. The $k$-means clustering aims to partition the $n$ observations into set $S = {S_1, S_2, …, S_k} (k \leq n)$ so as to minimise the within-cluster sum of squares (WCSS) (e.g. variance). \(arg \min_S \sum_{i=1}^k \sum_{x\in S_i} \| x-\mu_i \|^2 = arg \min_S \sum_{i=1}^k |S_i|\ Var\ S_i\) where $μ_i$ is the mean of points in $S_i$.

The Algorithm2

In the clustering problem, we are given a training set $x^{(1)},…,x^{(m)}$, and want to group the data into a few cohesive “clusters.” Here, we are given feature vectors for each data point $x^{(i)}∈ℝ^n$ as usual; but no labels $y^{(i)}$ (making this an unsupervised learning problem). Our goal is to predict $k$ centroids and a label $c^{(i)}$ for each datapoint. The k-means clustering algorithm is as follows:

  1. Initialise cluster centroids $\mu_1, \mu_2, …, \mu_k \in \mathbb{R}^n$ randomly.
  2. Repeat until convergence:

$\qquad$ for every $i$, set

\[c^{(i)} := arg \min_j \| x^{(i)} - \mu_j \|\]

$\qquad$ for every $j$, set

\[\mu_j := \cfrac{\sum_{i=1}^m 1 \{c^{(i)}=j\} x^{(i)}}{\sum_{i=1}^m 1 \{c^{(i)}=j\}}\]

Comment: $1{x_i=j}$ is an indicator variable.

Dimensionality Reduction - Principal Component Analysis (PCA)3

PCA is a technique for taking a dataset consisting of a set of tuples representing points in a high-dimensional space and finding the directions along which the tuples line up best. The idea is to treat the set of tuples as a matrix $M$ and find the eigenvectors for $MM^\top$ or $M\top M$. The matrix of these eigenvectors can be thought of as a rigid rotation in a highdimensional space. When you apply this transformation to the original data, the axis corresponding to the principal eigenvector is the one along which the points are most “spread out,” More precisely, this axis is the one along which the variance of the data is maximized. Put another way, the points can best be viewed as lying along this axis, with small deviations from this axis. Likewise, the axis corresponding to the second eigenvector (the eigenvector corresponding to the second-largest eigenvalue) is the axis along which the variance of distances from the first axis is greatest, and so on.