Word Vectors

Word2vec: prediction function

1.Skip-gram

已知当前词语，预测上下文

L(θ): date Likelihood of the capacity of predicting words in context

m: size of predicting window

w_t: given center word

对于每一个位置t，给定窗口内每个词w_t+j预测出它的概率的累乘

$L(\theta)=∏_{t=1}^{T}∏_{-m≤j≤m,j≠0}P(w_{t+j}|w_t;\theta)$

J(θ): objective function / loss function

$J(\theta)=-\frac{1}{T}logL(\theta)=-\frac{1}{T}∑_{t=1}^{T}∑_{-m≤j≤m,j≠0}logP(w_{t+j}|w_t;\theta)$

累乘求最大值 =》累加求最小值

Softmax Function——Scale

max: amplify the probability of larger x_i
soft: still assign some probability of smaller x_i

$softmax(x) = \frac{exp(x)}{∑^n_{j=1}exp(x_j)}$

Prediction Function

use two vectors per word w to simplify math and optimization and can be built easily.

v_w: vector for center word w
u_w: vector of context word w

c : center word

o : context word

$P(o|c) = \frac{exp(u_o^Tv_c)}{∑_{w∈V}exp(u^T_wv_c)}$

u^T_ov_c : similarity of o and c

除法为了归一化

Gradient Descent

$\theta^{new} = \theta^{old} - \alpha \Delta_\theta J(\theta)$

在进行梯度下降之后，v_w与u_w会变得十分相似，因此一般直接取他们的平均，作为改词的词向量

Problems：

计算量很大，因为J(θ)涉及整个语料库
难以走出鞍点/局部最优点

solution: Stochastic gradient descent(SGD) 随机梯度下降

在一个小的batch里更新，即在每一个batch中，语料库只包含窗口内的所有词

2.SGNS:Skip-gram Negative sampling

$J(\theta) = \frac{1}{T}∑^T_{t=1}J_t(\theta)$

$J_t(\theta) = log \sigma(u_o^Tv_c) + ∑^k_{i=1}E_{j\sim P(w)}[log\sigma(-u_j^Tv_c)]$

$\sigma(x) = \frac{1}{1+e^{-x}}$

use word provabili tes to choose k negative samples

P(w): word probabilities
k: number of nagative samples

对于每个正例（中央词语及上下文中的一个词语）采样几个负例（中央词语和其他随机词语），训练binary logistic regression。

使得最大化中央词与上下文的相关概率，最小化与其他词语的概率

问题：

[x] 为什么对于一个中心词只选择一个正样本呢，maybe简化的很厉害？瞅瞅代码
- 只能说代码里面也确实只有一个正例

Language Modeling

Assigns probability to a piece of text:

$P(x^{(1)},...,x^{(T)})=P(x^{(1)})*P(x^{(2)}|x^{(1)})*...P(x^{(T)}|x^{(T-1)},...,x^{(1)}) =\prod^T_{t=1}P(x^{(t)}|x^{(t-1)},...,x^{(1)})$

N-gram Language Models

4-gram Language Model/ 3rd order Markov Model

$P(w_4|w_1w_2w_3) = \frac{count(w_1w_2w_3w_4)}{count(w_1w_2w_3)}$

Sparsity Problem ：

1.分子出现次数为0

smoothing： add small count 𝛅 for every w ∈ V

2.分母出现次数为0

backoff: 将4-gram改为3-gram直到bi-gram

Storage Preambles

Need to store count for all n-grams in the corpus

RNN: Recurrent Neural Networks

Advantages:

Can process any length input
Computation for step t can use information from many steps back(in theory)
Model size doesn`t increase for longer input context
Same weights applied on every timestep, so there is symmetry in how inputs are processed

Disadvantages:

Recurrent computation is slow
In practice, difficilt to access information from many steps back

Train

Loss

$Loss = J(\theta)=\frac{1}{T}\sum^T_{t=1}J^{(t)}(\theta)$

SGD can be used：such as compute loss J(θ) for a sentence

Backpropagation

$\frac{\rm d J^{(t)}}{\rm d W_h} = \sum^{t}_{i=1}\frac{\rm d J^{(t)}}{\rm d W_h}|_{(i)}$

The gradident a repeated weight is the sum of the gradient each time it appears Screenshot 2022-10-25 at 17.00.50

Vanishing and Exploding Gradients

梯度消失：

起因于链式法则中的某些导数过小，导致梯度过小
导致模型无法学习到距离较远的知识

梯度爆炸：

梯度过大，导致越过了收敛点

Generating

Give a begin words/token to RNN
RNN will generate text by repeated sampling

Screenshot 2022-10-25 at 16.26.37

Evalutate

Standard evaluation metric for Language Model is perplexity

lower is better

$perplexity = \prod^T_{t=1}(\frac{1}{P_{LM}(x^{(t+1)}|x^{(t)},...,x^{(1)})})^{1/T}$

$=\prod^T_{t=1}(\frac{1}{\hat{y}^{t}_{x_{t+1}}})^{1/T} =exp(J(\theta))$

如果perplexity=7，则对于预测出的结果的不确定性为：

掷出一个7面的骰子，得到一个正确答案的不确定性

Importance

Benchmark task: measure our progress on understanding language
Subcomponent of many NLP tasks, especially involving generating text or estimating the probability of text

Multi-layer RNNs

RNNs deep on one dimension(over many timesteps)
By applying multiple RNNs can learn higher-level features(sentence structure, sentiment polarity)
Multi-layer RNNs also called stacked RNNs

LSTM: Long Short-Term Memory RNNs

a solution to the vanishing gradients problem

on step, there is a hiiden state h^(t) and a cell state c^(t)
- Both are vectors length n
- The cell stores long-term information
- The LSTM can read, erase, and write information from the cell(like RAM)
Selection of which information is erased/writen/read is controlled by three corresponding gates
- Both are vectors length n
- Each timestrp, each element of the gates can be open(1), closed(0), or somewhere in-between.
- Gates are dynamic: value is computed based on the current context

Screenshot 2022-10-26 at 14.30.14

Key: 新的information通过加的方式累计在cell中，而不是乘

summary:

LSTM makes it easier for the RNN to preserve information over many timesteps
LSTM doesn’t guarantee that there is no vanishing/exploding gradient
LSTM只能学习到到之前的信息，因此可以再使用一个独立的反向LSTM进行叠加

NMT: Neural Machine Translation-The seq2seq model

Screenshot 2022-10-26 at 15.23.30

Train

optimized as a single system. Backpropagation operates “end-to-end”

Screenshot 2022-10-26 at 16.23.13

Greedy Decoding

In decode, always generate target sentence by taking argmax on each stop.

Problem:

当前最优的迭代未必是target
no way to undo decisions

Fix: Beam search decoding

On each step of decoder, keep track of the k most probable partial translations(which call hypotheses)

k is the beam size(in practice around 5 to 10)
Beam search is not guaranteed to find optimal solution ,but much efficient than exhaustive search

Example:

Beam size = k = 2

$score(y_1,....y_t)=\sum^t_{i=1}logP_{LM}(y_i|y_1,....,y_{i-1},x)$

Screenshot 2022-10-26 at 16.44.49

For each of the k hypotheses, find top k next words and calculate scores.

Stop Criterion

In greedy decoding, stop when the model produces an <END> token

In beam serach decoding, different hypotheses may produce <END> token on different timesteps

When a hypothesis produce <END> token means it`s complete, place it aside and continue exploring other hypotheses via beam search

Stop when:

We reach timesteps T(pre-defined cutoff), or
We have at least n completed hypotheses(pre-defined cutoff)

Finishing up

Problem:

Each hypothesis y₁ ,…, y_t on list has a score, but longer hypotheses have lower scores

Fix:

Normalize by length

$\frac{1}{t}\sum^t_{i=1}logP_{LM}(y_i|y_1,....,y_{i-1},x)=score_h/length_h$

Evaluate: BLUE(Bilingual Evaluation Understudy)

BLEU conpares the machine-written-translation to one or serveral human-written-translation(s), and conpute a similarity score based on:

n-gram precision(usually for 1, 2, 3 and 4-grams)
Plus a penalty for too-short system translations

BlUE is useful but imperfect

There`re many valid ways to translate a sentence, so a good translation can get a poor BLEU score because it has a low n-grams overlap with the human translation

Difficulties

OOV
Domain mismatch between train and test data
Maintaining context over long text
Low-resource language pairs
Failures to accurately capture sentence meaning
Pronoun(or zero pronoun) resolution errors (Chinese)
Morphological agreement errors(form of words and phrases)

Self-Attention and Transformers

Self-Attention

RNN的问题：

Linear interaction distance: O(sequence length)
Lack of parallelizability

Fix: Word window

对于每一个embedding提供一个word window以计算context，并消去timestep上的依赖
当word window size = 5 时，第一层可以覆盖长度为5的上下文，二层可以覆盖长度为9的上下文
但当sequence长度过长时，依旧会失去远距离的context

Screenshot 2022-10-27 at 10.12.36

改进：Self-Attention

In self-attention, the queries q_i, keys k_i, values v_i are drawn from the same source
Self-attention operation is as follows:

$e_{ij}=q_i^Tk_i$

$\alpha_{ij}=\frac{exp(e_{ij})}{\sum_{j'}exp(e_{ij'})}$

$output_i = \sum_j\alpha_{ij}v_j$

在这里插入图片描述

优势：相比全长度word window

dynamic connectivity ：

全连接层的权重是在训练中迭代学习，学到是你应该注意哪些神经元，但对于不同的句子有不同的结构，所需要注意的神经元也不同
self-attention是queries和keys的点积，依赖于实际的文本

interaction：

全连接层中，各部分是独立的连接在一起，没有与当前query交互，也没有其他的key的交互

Barriers and Solutions for Self-Attention as a building block

1.Doesn’t have an inherent notion of order

⓵Postion representation vectors through sinusoids

Pros:

Periodicity indicates that maybe “absolute position” isn’t as important
Maybe can extrapolate to longer sequence as period restart

Cons:

Not learnable

⓶Postion representation vectors learned from scratch: Learn a matrix

Pros:

Flexibility: each position gets to be learned to fit the data

Cons:

Definityle can`t extrapolate to indices outside 1,…T.

2.No nonlinearities for deep learning magic

Just apply the same feedforward network to each self-attention output

3.Need to ensure we don`t “look at future” when predicting

Masking the future in self-attention by setting attention scores

$e_{ij} = \begin{cases} q_i^Tk_j, i<j \\ -\infty,i≥j \end{cases}$

Transformers (what’s more )

Key-Query-Value Attention

x_i,…,x_T are input vectors to the Transformer encoder, x_i ∈ R^d

k_i = Kx_i, K ∈ R^dxd
q_i = Qx_i, Q ∈ R^dxd
v_i = Vx_i, V ∈ R^dxd

Calculate:

Let X = [x₁; … ; x_T ] ∈ R^Txd

$output = softmax(XQ(XK)^T)\times XV$

Screenshot 2022-10-27 at 11.41.01

Multi-headed attention

look in multiple places in the sentence at once

$Let \quad Q_l,K_l,V_l ∈ R^{d\times\frac{d}{h}}$

h is the number of attention heads, and l ranges from 1 to h

$output_l = softmax(XQ_l(XK_l)^T)\times XV_l, output_l∈R^{d/h}$

$output = Y[output_1,..., output_h], Y∈R^{d\times d}$

Screenshot 2022-10-27 at 12.14.29

并不会增加实际的运算量

Training Tricks:

Residual Connections

Help models train better

Instead of X⁽ⁱ⁾ = Layer(X^(i-1)),i represents the layer
Let X⁽ⁱ⁾ = X^(i-1) + Layer(X^(i-1))

Layer Normalization

Help models train faster

Screenshot 2022-10-27 at 12.36.34

Scale Dot Product

当向量维度过大，点积就很容易变得非常大，导致softmax变得peaky，从而导致其他其余梯度过小

$output_l = softmax(\frac{XQ_l(XK_l)^T}{\sqrt{d/h}})\times XV_l, output_l∈R^{d/h}$

Cross-attention

The keys and values are drawn from the encoder(like a memory)

The queries are drawn from the decoder

Let H = [h₁; … ; h_T ] ∈ R^Txd
Let Z = [z₁; … ; z_T ] ∈ R^Txd

k_i = Kh_i , V_i = Vh_i , q_i = Qz_i

$output = softmax(ZQ(HK)^T)\times HV$