RNN

1) sequence data
- sequence data is closely connected itself by an order.
- MLP or CNN cannot handle sequence data.
2) advantages
- Can process any length input
- Computation for step t can (in theory) use information from many steps back
- Model size doesn't increase for longer input
- Same weights applied on every timestep, so there is symmetry in how inputs are processed.
3) disadvantages
- recurrent computation is slow
- in practice, difficult to access information from many steps back
4) types of RNN

Structure of RNN
1) 개요
- Basic feed-forward network just pass information (input) to hidden layer and output. But, recurrent network's hidden layer gets information from input and former time step's hidden layer.
2) Calculating output & gradient
- forward
Wxh:inputhidden
Whh:recurrent
Why:hiddenoutput
ht=ϕh(Wxhxt+Whhht−1+bh)
ht=ϕh([WxhWhh][xtht−1])
yt=ϕy(Whyht+by)
- backward
L=T∑t=1Lt
∂LtWhh=∂Ltyt∂ytht(t∑k=1∂hthk∂hkWhh)
∂ht∂hk=t∏i=k+1∂hi∂hi−1
- Old weights are multiplied t-k times.
- |w| <1 : vanishing gradient
- |w| > 1 : exploding gradient
LSTM

1) 개요
(ifog)=(σσσtanh)W(ht−1xt)
- LSTM is introduced to overcome the gradient vanishing problem.
- Its basic component is the cell state.
- In the cell state, there are 4 gates which control memory of data whether save or not.
2) forget gate
- whether to erase to cell
- Sigmoid function's output range is (0, 1). So if a value is 1, preserve whole information.

3) input gate
- whether to write to cell
- i_t : sigmoid layer determines which value to be updated.
- C_t : tanh layer makes vector C_t with same input value as i_t.
- output is the element-wise multiplication i_t and C_t

4) cell state gate
- how much to write to cell
- multiply by f_t erases information.
- add by input gate value

5) output gate
- how much to reveal cell
- sigmoid function determines information to pass.
- put cell state gate value into tanh function -> output range (-1, 1)
- product with sigmoid gate output in that we can pass information what we want to keep.

6) Backpropagation

- LSTM makes gradient descent easier for the RNN to preserve information over many timesteps. (doesn't guarantee)
- Harder to learn weight in hidden state than vanilla RNN.
- the gradient contains the forget gate's vector of activations. (Use suitable parameter updates.)
GRU

1) 개요
- More simplified version of LSTM.
- No cell state, only hidden state
- Combination of forget gate and input gate
- Reset gate is added.
2) forget & input gate
- Use same forget gate as LSTM.
- c_t and h_t = h_t
- z_t controls forget and input gate. -> if t-1 is memorized, t is erased.
ref.
heung-bae-lee.github.io/2020/01/12/deep_learning_08/
머신러닝 교과서 with 파이썬 (길벗)
cs231n.stanford.edu/slides/2020/lecture_10.pdf
colah.github.io/posts/2015-08-Understanding-LSTMs/
ratsgo.github.io/natural%20language%20processing/2017/03/09/rnnlstm/