This is my notes about the CS168 in Stanford University
Dimensionality Reduction
Want to reduce the large dimension into small dimension ( usually independent with input size ) while preserving distance.
Distance
distance or similarity
exact distance: 1  [x==y]
similarity of set ( or multiset ): Jaccard Similarity \(J(S,T) = \frac{S \bigcap T}{S\bigcup T}\)
\(l_{p}\) distance.
framework
reduce dimension while preserving expect.
Then use independent trials.
exact distance
use universal hash function assumption
assume \(h: U \rightarrow [b]+1\)
k hash function \(h_{1}, h_{2}, ... , h_{k}\)
\(\epsilon\)  Heavy Hitters
use Markov's inequality
to achieve \((\delta, \epsilon)\), let \(b = \frac{2}{\epsilon}\), \(\delta = 2\).
Then use independent trials \(\frac{1}{2^{p}} < \delta\), then \(p = \log(\frac{1}{\delta})\).
Jaccard similarity
pick a random permutation of \(U\).
hash each element to the smallest element.
\(l_{2}\) distance  JohnsonLindenstrauss transform
motivation: random projection (inner product with a vector generated by standard Gaussian distribution)
random vector \(\textbf{r} \sim N^{n}(0, 1)\)
which means that \(X\) is an unbiased estimator of \(l^{2}_{2}\).
other distances ...
what about cosine similarity, edit distance and wasserstein distance.
Learning (Binary classification here)
Assume data are samples from a prior distribution \(D\) on the universe \(U\).
element has its only label.
mainly cares about the sample complexity.
finite wellseparated case
finite: there is a function set \(\{f_{1}, ..., f_{h}\}\) including the ground truth \(f^{*}\).
wellseparated: if \(f_{i} \neq f^{*}\), then the generalization error is at least \(\epsilon\).
Theorem: if
Then with probability \(1  \delta\), the output \(f = f^{*}\).
PROOF:
\(Pr[ f=^{X}f^{*}  f\neq f^{*}] \leq (1\epsilon)^{n} \leq e^{\epsilon n}\)
\(Pr[\exists f_{i} = f^{*}] \leq (h1)e^{\epsilon n} \leq he^{\epsilon n}\)
\(he^{\epsilon n} \leq \delta\)
\(n \geq \frac{1}{\epsilon}(\ln h + \ln \frac{1}{\delta})\)
Q.E.D
finite case
Theorem: if
Then with probability \(1  \delta\), the output \(f\)'s generalization error is less than \(\epsilon\).
easy to drive from previous theorem.
Linear Classifiers
Throw the assumption of finite function set.
However there is still the ground true that has the form of linear classifier.
Theorem: if
Then there is a constant \(c\), with probability \(1  \delta\), the generalization error is less than \(\epsilon\).
proof motivation: the curse of dimensionality (approximation).
note the number of samples is linear to the dimension.
(because we use about \(e^{d}\) functions to approximate in theory, but we usually calculate the superplane to express the function)
NonZero Training Error and the ERM Algorithm
ERM (empirical risk minimization)
just output the function with minimum training error.
Theorem (Uniform Convergence):
if
then for every linear classifier, it holds that
generalization error in training error + \(\epsilon\)
Theorem (PAC for nonzero training error):
easy to drive from previous theorem
Increasing Dimensionality
according to the previous section, when samples are larger than (or linear to) dimension, it will lead to bestfit.
but for nonzero error case, we do not know the dimension exactly.
So we may need to flirt with line between overfitting and generalization.
when n << d, we need to increase the dimension.
Polynomial embedding
just add crossproduct into the higher dimension.
It is better when the dimension is meaningful.
JL transform
actually we need to apply some nonlinear operator to each dimension.
Because JL transform actually the combination of each dimension.
In real world, the kernel function is a good way to implement.
Regularization
Regularization states that you have some preference of your model.
There are usually two views about the effect of regularization.
Bayesian view and frequentist view
Bayesian view
Here the regularization comes naturally from the likelihood.
For example, we assume the model is \(y = \left \langle x, a\right \rangle + z\)
\(z \sim N(0, 1)\), \(a_{i} \sim N(0, \sigma^2)\)
we assume that \(x\)'s are fixed for simplicity.
The likelihood is
max this likelihood means minimize \(\sum_{i=1}^{d}\frac{a_{i}^{2}}{2\sigma^2} + \sum_{i=1}^{n} (y_{i}  \left \langle x_{i}, a\right \rangle)^2\)
The first part is regularization.
Also it shows that the regularization may depend on the hypothesis of model.
Frequentist view
The given example is about \(l_{0}\) regularization. (define \(0^0 = 0\))
\(l_{0}\) regularization shows the sparsity.
\(l_{1}\) regularization
it can be a proxy of \(l_{0}\) regularization.
Principle Component Analysis
The motivation is that we want to map the data into a \(d\)  dimension vector space.
Somehow we want to preserve the \(\prod_{S}v \sim v\)
Luckily, we know that \(d\)  dimension space can be interpret into a span of d vectors \(v_{1}, ..., v_{d}\).
And the objective function is \(\max \sum_{i=1}^{n} \sqrt{\sum_{j=1}^{d} \left \langle x_{i}, v_{j} \right \rangle ^{2}}\).
why max? Because of the triangle inequality.
Usually we compact the data as a matrix \(A\) whose columns states for attributes and row stand for pieces of data.
We create a new matrix \(X\) of vectors \(v\) whose \(i\) th column is \(v_{i}\), in order to model the objective function as matrix operation.
The objective function is \((AX)^{T}AX = X^{T}A^{T}AX = X^{T}U^{T}DUX\) according to spectrum theorem.
\(U\) is orthogonal matrix and \(D\) is diagonal matrix.
Basically \(X\) is variable that we can choose.
If \(d=1\), \(Ux = x\), denote \(u = Ux\)the objective function becomes \(u^{T}Du\).
Assume the elements of \(D\) are sorted as decreasing order. \(u = e_{1}\).
So \(x = U^{T}e_{1}\) or x is the first column of \(U\).
Implementation (The Power Iteration)
The key problems of PCA are singular values and singular vectors.
Singular polynomial is hard to find roots ...
Maybe there are some ways to find reductions.
But here the motivation is that we pick a vector \(x\) and apply to operator many times.
Theorem, for any \(\delta, \epsilon > 0\), letting \(v_{1}\) denote the top eigenvector of \(A\), with probability at least \(1\delta\) over the choice of \(u_{0}\),
provided
where \(\frac{\lambda_{1}}{\lambda_{2}}\) is the spectral gap.
proof
let \(v_{1}, \cdots, v_{k}\) be k orthonormal vectors.
So, let this < \(\epsilon\).
Someone told me that \(\langle u_{0}, v_{1} \rangle > \frac{\delta}{2\sqrt{d}}\) with probability \(1  \delta\). (I do not how to prove this.)
so \(\log \frac{1}{\langle u_{0}, v_{1} \rangle} < \log d + \log \frac{1}{\delta}\) which completes the proof.
Lowrank Matrix Decomposition
SVD
The key method is the SVD(Singular Value Decomposition) which states that every matrix \(A\) can be interpreted as \(USV^{T}\)
I do not willing to include the proof here, because that the constructive proof shows few motivation.
Some motivations are here
According to PCA, V contains eigenvectors of \(A\).
Also
The similarity holds for \(U\).
Actually SVD links the eigenvalues of \(A\) and \(A^{T}\)
Also these facts help us to calculate the SVD(just use power iteration).
Lowrank Matrix
Recall the matrix with rank \(k\) can be interpret as \(\sum_{i=1}^{k}\textbf{u}_{i} \textbf{v}_{i}^{T}\).
under Frobenius norm, it can be shown the SVD derives the best approximation.
Frobenius norm
for any matrix \(A\) and its rankk approximation using SVD \(A_{k}\).
Then for any rankk matrix \(B\).
Then
Although we can approximate the matrix under Frobenius norm, but the decomposition is not unique.
But if we extend matrix to tensor, something will happen.
Lowrank tensor Decomposition
A rankk \(n \times n \times n\) tensor \(A\) can be interpret as
ATTENTION: \((u_{1}, \cdots, u_{k})\) linearly independent, same for \(v's\) and \(w's\)
And we can use the notation \(\oplus\).
Theorem: Given a 3tensor with rank \(k\), the decomposition is unique(up to scale a constant)
Jenrich's algorithm

random two unit vectors \(x, y \in \mathbb{R}^{n}\)

define \(A_{x}, A_{y}\).
 compute

with probability 1, \(Q\) describes \((u_{1}, u_{2}, \cdots, u_{k})\) and \(Y\) describes \((v_{1}, v_{2}, \cdots, v_{k})\).

solve the linear system to compute \((w_{1}, w_{2}, \cdots, w_{k})\).
correctness:
if \(x, y\) pick randomly, it's likely that \(A_{x}A_{y}^{1}\) and \(A^{1}_{x}A_{y}\)'s eigenvalues are distinct, so we can distinguish different eigenvectors.
And because of the \(S, T\) should be the reciprocative, so we can group correct eigenvectors.
Spectral Graph Theory
The magic about representing the graph as matrix.
Denote rank matrix as \(D\) and adjacent matrix as \(A\).
Define Laplacian matrix as \(L = D  A\).
Note \(L\) is a symmetric matrix.
now consider
So if we put all the vertices on the real numberline. \(v^{T}Lv\) is the square sum of all the path.
Eigenvalues and Eigenvectors
\(L\) has \(n\) nonnegative real eigenvalues because \(L\) is a symmetric matrix and \(v^{T}Lv \geq 0\).
The fact is that the minimum eigenvalue is \(0\). Let \(v = (\frac{1}{\sqrt{n}}, \cdots, \frac{1}{\sqrt{n}})\).
Theorem: The number of zero eigenvalues of the Laplacian matrix equals the number connected components of the graph.
proof
we first show that # connected components < # zero eigenvalues
Let \(S_{i}\) be a maximal connected components, construct a vector \(v\), \(v_{i} = \frac{1}{\sqrt{S_{i}}} \mathbb{I}[x \in S_{i}]\)
So it can form # connected component orthonormal vectors.
Then about the other side, recall \(v^{T}Lv\), if \(v_{k+1}\) is orthogonal to \(v_{1}, \cdots, v_{k}\), assume \(v_{k+1, j} \neq 0\), then for positions of the same maximal connected component must \(\neq 0\), so it must not orthogonal to all vectors.
\(\blacksquare\)
Conductance and isoperimeter
Some intuitions are that \(v^{T}Lv = \lambda v^{T}v = \sum_{i \sim j} (v_{i}  v_{j})^{2}\), the 2nd smallest eigenvalue somehow place vertices ``near''.
This fact can be used to embed vertices into a \(d\)dimensional space. Map vertex \(i\) to \((v_{2}(i), v_{3}(i))\) making the adjacancy near. While choose the 2 largest eigenvector will bring them apart.
This also lead to the implementation in clustering and graph coloring.
Definition: isoperimeter of a cut \(\theta(S)\)
The isoperimeter ratio of the graph \(G\) is
This number interpret the connectivity of a graph.
And the second smallest eigenvalue \(\lambda_{2}\) (maybe \(\lambda_{2} = 0\)) will give some insight of this number like those intuitions above.
Theorem: Given a graph \(G = (V, E)\) and any set \(S \subset V\), it holds that
proof:
We assume that \(S < V  S\).
We can interpret \(\delta(S)\) by \(v^{T}Lv\).
Define \(v\) as when \(i \notin S\), \(v(i) = \frac{S}{V}\), while \(i \in S\), \(v(i) = 1  \frac{S}{V}\).
Recall that if \(\lambda_{1}, \cdots, \lambda_{n}\) and \(v_{1}, \cdots, v_{n}\) are eigenvalues and eigenvectors of \(L\).
So if we extract the eigenvalue \(0\) and corresponding eigenvector \((\frac{1}{\sqrt{n}}, \cdots, \frac{1}{\sqrt{n}})\).
So
So
\(\blacksquare\)
Cheeger's Theorem
If \(\lambda_{2}\) is the second smallest eigenvalue of a dregular graph, Then there exists a set \(S \subset V\) such that
Diffusion model and Random walks
Let \(w: V \rightarrow \mathbb{R}\), define this process \(w_{i+1}(x) = \frac{1}{\mathrm{Adj}(x)}\sum_{x\sim y}w_{i}(y)\)
So we know that \(w_{i+1} = D^{1}Aw_{i}\)
This is actually the power iteration.
Sampling and estimation
Reservoir Sampling
want to sample \(k\) data from totally \(N\) data uniformly.
Just recall how to implement permutation in C++.
uniform random swap this datum with some element existing.
Markov inequality and Chebyshev's inequality
all included by the ''counting and sampling''
Importance Sampling
intuitions are sample more important data (longtail distribution) or much variance data (cut down variance).
Estimating the missing mass
GoodTuring frequency estimation scheme.
MCMC
Theorem Fundamental theory of Markov Chain

for every two states \(s_{i}, s_{j}\), it is possible to eventually get to state \(s_{j}\), if one starts in state \(s_{i}\).

The chain is aperiodic: for a pair states, \(s_{i}\), \(s_{j}\), consider the set of time \(\{t_{1}, t_{2},\cdots \}\) consisting of all \(t\) for which \(\mathbb{Pr}[X_{t}=s_{j}X_{0}=s_{i}]>0\). A chain is aperiodic if $gcd({t_{1}, t_{2},\cdots })=1 $
Then for any states \(s\),
other information will be included by Counting and sampling.
Discrete Fourier Transform and Convolution
As a mathematical tool, Fourier transform has a variety of implementations.
There are also different versions to define Fourier transform.
Matrix form
Let \(w_{n} = e^{\frac{2\pi i}{n}}\). Then \(M_{i,j} = w_{n}^{ij}\).
NB \(M_{n}\) is 0indexed.
To get the inverse of \(M\), first consider \(M^{2}\)
consider \(k\) th column of \(M\) is \((e^{\frac{2\pi i 0}{n}}, e^{\frac{2\pi ik}{n}}, e^{\frac{2\pi i2k}{n}}, \cdots, e^{\frac{2\pi i(n1)k}{n}})^{T}\)
\(M\) is a symmetric matrix. So its \(p\) th row is \((e^{\frac{2\pi i0}{n}}, e^{\frac{2\pi ip}{n}}, e^{\frac{2\pi i 2p}{n}}, \cdots, e^{\frac{2\pi i(n1)p}{n}})\)
compute innerproduct
When \(1 \neq e^{ \frac{2\pi i (k+p)}{n}}\)
So \(M^{2}\) is analogous with the \(\mathbb{I}_{n}\)
Let \(k\) th column of \(M\) be \(M^{i}\). Then
We can also use inverse representation of \(w_{n}\).
Polynomial evaluation and interpolation
Matrix in that form can be view as linear polynomial evaluation on specific points set.
Both the evaluation and interpolation can be done in \(O(n\log n)\) time.
Also the killer method is the convolution.
Change of basis
The discrete Fourier transform can be viewed as a way of basis decomposition like the continuous version of Fourier transform.
The Fast Fourier transform
Discrete fourier transform can be done in \(O(n\log n)\) time much faster than the matrix multiplication.
Intuitively \(M\) are wellstructured.
Convolution
There are many ways to define convolution. One is using polynomial multiplication.
Definition: the convolution of two vectors \(v, w\) of respective length \(n, m\) is denoted \(v*u = u*v\) and is defined to be the vector of coefficients of the product of the polynomials associated to \(v\) and \(w\), \(P_{v}(x)P_{w}(x)\).
Fact: convolution can be interpreted by Fourier transform as
NB here we do not specify the dimension of Fourier transform. But with the view of polynomial multiplication, the dimension is at least \(n + m\).
(The continuous version of Fourier transform usually has infinite dimensions.)
Also if \(q = v*w\), then
Another definition: Any transformation of a vector that is linear and translational invariant is a convolution.
Can be used in differential equations and physical force.(?)