As I’m taking the Models and algorithms in NLP-application course, here I take notes of some classical machine learning models. In addition, I try to build the model without importing sklearn packages.
A brief background
I believe linear regression is a widely-known quantitative methods in many fields especially in economics. It gives weights to variables and then add a bias as a random noise. The the formula:
However, this method can’t really deal with classification task because the range of
Then next part discusses how to apply this approach in binary-classification task.
Define a loss function
So far we have defined the function to fit data. There is a question that how we measure if the function fits data well or not? Loss function is what we need. Before defining the loss function, we should consider another thing.
The probability of label should have something to do with the predicted result
When y =1, the probability is
If we log on both sides, then it turns to:
if y = 1,
if y = 0,
However, in the above formula, when y =1,
Calculate gradients
Now there is a loss function, then we need to update the weights. So we need to calculate the gradients about A and b.
The first and last items are pretty easy to compute. Let’s look at the Sigmoid function.
Thus, the equation can be concluded as:
The same for the bias gradient
Then the weights will be updated through:
More analysis
The original linear model is very useful in regression and classification (with sigmoid). There is also a concern about the formula due to the lack of weight limitation. We notice that the there is no limition on the values of weights. That’s to say, the weights could be extremely large or small. There is a risk of overfitting and unfortunately, this will lead to very volatile results. –Suppose the data value is always large in training dataset, but the model doesn’t work well when it meets small values.
One solution is to add a penalty on the loss function. The penalty can be
Code
Coming soon.