[ML] Reducing Loss

2021. 1. 7. 19:02

지난 시간, 손실을 구하는 법을 배웠다. 그렇다면 손실을 최소화하기 위한 매개변수는 어떻게 찾을 수 있을까? 이는 손실함수(볼록함수, convex)의 기울기(gradient)를 미분(derivative)함으로써 가능하다.

✔︎ Reducing Loss: An Iterative Approach

An iterative approach to training a model

STEP 1

The "model" takes one or more features as input and returns one prediction ( $y^{'}$ ) as output.

For linear regression problems, the starting values aren't important. We could pick random values, but we'll just take the following trivial values instead:

$b$ = 0
$w_{1}$ = 0

Suppose that the first feature value is 10. Plugging that feature value into the prediction function yields:

$y^{'} = 0 + 0 \cdot 10 = 0$

STEP 2

The "Compute Loss" part of the diagram is the loss function that the model will use. Suppose we use the squared loss function. The loss function takes in two input values:

$y^{'}$ : The model's prediction for features x
$y$ : The correct label corresponding to features x.

STEP 3

At last, we've reached the "Compute parameter updates" part of the diagram.

It is here that the machine learning system examines the value of the loss function and generates new values for $b$ and $w_{1}$ . The learning continues iterating until the algorithm discovers the model parameters with the lowest possible loss. When that happens, we say that the model has converged.

✔︎ Gradient Descent(경사 하강법)

Regression problems yield convex loss vs. weight plots.

볼록함수는 단 하나의 최솟값을 가지고, 그때의 기울기는 0이다.

그런데 w에 대한 모든 기울기를 구하는 것은 비효율적이다. 이때 경사하강법(gradient descent)을 활용한다.

먼저 시작점을 잡는다. 이때 시작점을 어디로 잡는지는 그닥 중요치 않다. 그림에서는 0보다 조금 큰 loss를 시작점으로 잡았다.

경사하강법 알고리즘은 시작점에서의 기울기를 계산한다.

만약 가중치(w)가 여러개라면 기울기는 가중치에 대한 편도 함수의 벡터이다.

(When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.)

Note that a gradient is a vector, so it has both of the following characteristics:

a direction
a magnitude

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

Gradient descent relies on negative gradients.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

A gradient step moves us to the next point on the loss curve.

The gradient descent then repeats this process, edging ever closer to the minimum.

✔︎ Learning Rate

Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.

ex) gradient magnitude: 2.5, learning rate: 0.01

⇒ The gradient descent algorithm will pick the next point 0.025 away from the previous point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms.

학습률이 너무 작으면 최소값을 찾는 시간이 너무 오래 걸리고, 반대로 학습률이 너무 커버리면 최소값을 아예 지나쳐버릴 수 있다.

학습률에도 골디락스(Goldilocks)가 존재하는데, 이는 손실함수가 얼마나 평평한지에 달렸다.

- 손실함수가 가파르다면 학습률을 작게 가져야 할 것이고, 평평하다면 크게 가져도 될 것이다.

- 혹은 시작점에서의 기울기가 0과 가깝다면 학습률을 작게 가져야 하고, 기울기의 절대값이 크다면 학습률을 크게 잡아도 될 것이다.

In practice, finding a "perfect" (or near-perfect) learning rate is not essential for successful model training. The goal is to find a learning rate large enough that gradient descent converges efficiently, but not so large that it never converges.

- The ideal learning rate in one-dimension is $\frac{1}{f (x)^{″}}$ (the inverse of the second derivative of f(x) at x).

- The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).

✔︎ Stochastic Gradient Descent(SGD)

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.

만약 배치가 커지면 iteration도 더 오래 걸리게 된다. 또한, 배치 안에 불필요한 데이터가 섞일 가능성도 높아진다.

하지만 examples를 data set에서 골라서 사용한다면 시간을 절약할 수 있다.

** stochastic(확률론적) : The one example comprising each batch is chosen at random.

full-batch	SGD	mini-batch SGD
the batch is the entire data set.	uses only a single example (a batch size of 1) per iteration.	typically between 10 and 1,000 examples, chosen at random.
A very large batch may cause even a single iteration to take a very long time to compute.	very noisy	- a compromise between full-batch iteration and SGD. - reduces the amount of noise in SGD but is still more efficient than full-batch.

To simplify the explanation, we focused on gradient descent for a single feature. Rest assured that gradient descent also works on feature sets that contain multiple features(ex. neural nets).

Convex	Non-convex

Just one minimum	More than one Minimum
Weighs can start anywhere	Strong dependency on initial values

🔑 정리하기

Model에서 Prediction Function을 만든다 - 결과값을 추론해본다 - Compute Loss에서 손실함수로 손실을 측정한다 - 결과값과 손실을 바탕으로 Compute Parameter Updates에서 Model의 매개변수를 수정한다
손실함수에서 한 점의 기울기가 0이 되는 지점이 손실의 최소값 지점이다.
벡터의 기울기(gradient)는 두 가지 성질을 가진다: 방향(direction), 크기(magnitude)
학습률를 설정하면, 경사하강법 알고리즘은 'gradient magnitude x learning rate'의 값만큼 다음 지점으로 이동한다.
full-batch, SGD, mini-batch SGD 구분

🏷 developers.google.com/machine-learning/crash-course/reducing-loss/video-lecture

'◦ Machine Learning > ML' 카테고리의 다른 글

[ML] Introduction to TensorFlow (0)	2021.01.11
[ML] Linear Regression (0)	2021.01.07
[ML] Basic Concepts (0)	2021.01.07

MINJU's code story