# Stanford Machine Learning Notes

Wow, I’m really loving this class. Lecture4 slides.

[I should probably point out to regular readers that these notes aren’t really fit for mass consumption. I’m not going to bother even trying to build a complete understanding of each of the concepts so they’re really just for personal use.

That being said, they’re on here and if you’re interested in seeing what I spend just about all my spare time doing, read on!]

Ok, today we covered multi-variate regression and we’re venturing into some virgin territory for me, now. It’s pretty awesome stuff.

So we started out with the insight that we could express a multivariate regression as a transposed matrix multiplication. What a mouthful. Believe me, it’s simpler than it sounds.

The idea is that you have a sec of values (slope values for the dependent variable) and a set of inputs (independent variables) and matrix multiplication just gives us a clean way of grouping them and then mashing them all together at once. This is clearly a programming optimization. If you did it by hand, it wouldn’t really be any easier.

The second idea is to express the error function of the gradient descent algorithm as a matrix. I’m barely holding on at this point, actually, and am looking forward to my first actual exercise.

Feature and mean scaling are next. These are neat little tricks to optimize the program. The idea is that if you have two features, sq footage and age of a house for example, which take on values of massively different magnitudes, your application of a uniform transformation of the slopes (the alpha term) will really frig up the algorithm’s progress.

So let’s say the slope of the sq. footage term is 350 and age is 5. If you apply a 0.01 modification to adjust the algorithm, you’ll barely move the age term. If you apply 0.5, you’ll be blazing away on the sq footage.

There’s some talk about graphing the error term of the error function so you can see your progress. I like visuals, so I’m on board.

There’s also a neat discussion about how to use the variables supplied to build your own variables. Using length and depth to compute sq footage, for example. Also arbitrarily raising some variables to some power: price of a house being related to the sq footage and negatively related to the sq of the sq footage. This is a nice way of introducing non-linearities.

We closed with a discussion of a closed-form solution for some of these problems. I finally lost my grip and will need to spend more time learning about the ‘normalized’ equation, which involves transposing the coefficient matrix and multiplying it by the training vector.

I totally get that there are tradeoffs to this approach versus the iterative gradient descent solution, though. Specifically, the trick is transposing that matrix of coefficients. Once you start transposing 10,000 x 10,000 matrices, it takes quite some time. I wonder if the transpose function in Octave is just an iterative function itself?

Back to the drawing board to deepen my understanding…