Linear regression¶
Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) which is continuous and one or more explanatory variables (or independent variables) (Ref.: Wikipedia).
쉽게 설명하면 주어진 데이터를 가장 잘 설명하는 직선 하나를 찾는 것이 Linear regression이다.
I will explain linear regression using a housing price prediction example.
Univariate linear regression¶
Univariate linear regression is the linear regression with one variable. → 즉, 하나의 설명 변수로 선을 찾는 것이 Univariate linear regression이다.
A below figure describes housing price prediction with univariate linear regression.
Process¶
A hypothesis function is a prediction model trained by a cost function using training set.
This is a training set example.
\(m = \text{The number of training examples (samples)}\)
\(x's = \text{Input variables (Features)}\)
\(y's = \text{Output variables (Target variables)}\)
\((x, y) = \text{One training example}\)
\((x_i, y_i) = i_{th} \text{training example}\)
결론적으로 주어진 데이터로 \(\theta_0\) 와 \(\theta_1\) 을 찾는 것이다.
Cost function¶
To find the best parameters, we should minimize costs for training examples. So, we need a cost function to calculate losses between predictions and answers.
Source: towards data science
Idea:
Choose \(\theta_0, \theta_1\) so that \(h_{\theta} (x)\) is close to \(y\) for training samples
Examples:
This is one of the methods called Mean Suqared Error (MSE or L2 loss) for the cost function and the goal is to minimize the squared error function.
Besides the MSE function, there are many cost functions (Link).
결국, 우리의 목적은 이러한 MSE가 최소화가 되도록 \(\theta_0,\ \theta_1\) 을 구하는 것이고, 단순히 최적의 \(\theta_1\) 을 구하는 방법은 \(\theta_1 = (x^T x)^{-1} x^T y\) (수식 이해 안감)를 푸는 것이다. 하지만 데이터가 커짐에 따라 시간 복잡도가 \(O(n^3)\) 로 증가하여 비효율적이다. 그래서 이러한 문제를 해결하는 방법이 Gradient descent이다.
Gradient descent¶
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point (Ref.: Wikipedia).
즉, Gradient descent는 기존 Weight에 Error function의 미분값을 빼주면서 Weight를 업데이트하는 방법이다.
This is how to update weights using gradient descent for all training dataset:
\(\displaystyle w:=w-\eta \nabla Q(w)=w-\eta \sum _{i=1}^{n}\nabla Q_{i}(w)/n,\)
Anyway, we talk about all from now step by step.
Idea:
Make arbitrary function \(J(\theta_0, \theta_1)\)
Find \(min_{\theta_0, \theta_1} J(\theta_0, \theta_1)\)
Process:
Start with some \(\theta_0, \theta_1\)
Keep changing \(\theta_0, \theta_1\) to reduce \(J(\theta_0, \theta_1)\) until we hopefully end up at a minimum
Types¶
Batch gradient descent
Each step of gradient descent uses all the training set.
Stochastic gradient descent (SGD)
Each step of gradient descent uses partial of the training set called mini-batch.
Others (Link)
Algorithm¶
Linear equation movement¶
In the cost function, a gradient speed can be decided by the learning rate.
Also, we don’t need to decrease the learning rate because gradient will be getting smaller in every step.
While being trained, the model can be stuck in a local minimum problem:
Multivariate linear regression¶
Multivariate linear regression is the linear regression with multi variable.
Features and hypothesis function¶
Univariate linear regression:
Features
Size
Price
2,104
460
1,416
232
1,534
315
…
…
Hypothesis function
\(h_{\theta}(x) = \theta_{0} + \theta_{0}x\)
Multiple linear regression:
Features
Size
Price
# of rooms
# of floors
Age
2,104
460
5
1
45
1,416
232
3
2
40
1,534
315
3
2
30
…
…
…
…
…
Hypothesis function
\(h_{\theta}(x) = \displaystyle\sum_{i=0}^{n} \theta_{i}x_{i}\ \ where\ \theta_{i}=weight,\ x_{0}=1\)
Gradient descent for multiple variables¶
Feature scaling¶
All features have different scale, so we need to make all features are on a similar scale
Before:
After:
Types¶
Mean normalization
\(x_{i\_mean} = \frac{x_{i} - average(x_{i})}{range(x_{i})}, (-1 \leq x_{i\_mean} \leq 1)\)
Standardization
\(x_{i\_std} = \frac{x_{i} - min(x_{i})}{range(x_{i})}, (-1 \leq x_{i\_std} \leq 1)\)
Normal equation¶
Alternative method to get weight value
Don’t need iteration
Method:
\(\text{For every } n, \text{training data } m \\ \frac{\partial J}{\partial \theta_{n}} = \displaystyle\sum_{i=1}^{m} (\theta_{0} + \theta_{1}x_{i} + \cdots + \theta_{1}x_{i}^{n} - y_{i}) \\ \Rightarrow \theta = (X^{T}X)^{-1}X^{T}Y\)
Gradient descent vs. Normal equation¶
Gradient descent |
Normal equation |
---|---|
Should decide learning rate |
Don‘t need to decide learning rate |
Many iteration |
No iteration |
Relatively little calculation |
A lot of calculation |
Summary¶
Linear regression is an regression analysis method by making a regression model with cost function and gradient decent using training set
Multivariate linear regression can be performed like univariate linear regression
There are two method for multivariate liner regression
Gradient descent
Normal equation
Each method has its own benefit