Another machine learning class.
We wrapped up linear regression by discussing work with error functions. I hadn’t really focused on this before, even though I’ve built some crude ‘by-hand’ error-minimizing functions at work.
The error functions are the sum of the squared values of the error terms in a regression. The squaring part of this is important because it really penalizes big individual deviations.
For simple linear regression, this is pretty easy because there are only going to be two function terms (x and y). When you start packing in the variables (or exponents of variables), minimizing this error function gets tough.
We then take this error function and multiply it by 1/(2 * the sample size) for some reason. This some reason is to be discussed later.
That’s where the gradient descent algorithm comes in, whose purpose is tofind a local minimum in the error function. It does this in a clever way: by just adding the value of the slope to each of the parameter terms. Eventually, you get smaller and smaller slopes and wind up at zero slope. Cool!
This is dampened (or amplified, I suppose) by another term to take bigger or smaller steps ‘down the slope’. This dampening term is called the ‘learning rate’. I find that terminology a bit confusing.
One key is to make ure you apply the slope to each term simultaneously, remembering that the slope in each ‘direction’ is going to be a bit different. If you visualize the whole process as a three-dimensional park with hills and valleys (I can’t find the picture he used in the notes) it’s easier to see that the slope in respect of each variable will be a bit different. So the equations for calculating that slope will be a bit different.
And there’s finally a discussion about resource use. We’re told that gradient descent is faster than the closed-form linear algebra solution at large database sizes.
That’s interesting and I wonder how sensitive such a conclusion is to language type. One thing I’ve learned is that speed of calculation is actually a multi-dimensional problem. Dataset size is one issue, of course, but so is language compilation/interpretation time. If you implemented the closed-form solution in C but did the gradient descent in some slower language like VBA or something, is the answer so clear? Maybe we could derive a cost function for compile times and figure it out!