Stanford Course Notes – Not Unreasonable

Machine Learning Course Notes

Posted on October 24, 2011October 24, 2011 by David Wright

Still at it. I am loving this course.

Today we go to logistic regression, which is a fancy term and means that it is used to predict binary outcomes.

Binary outcomes are super-risky evaluations because while math doesn’t like discrete data, humans love it. Think about medical evaluations: you’re either ‘sick’ or ‘not sick’ in your own mind, but according to mathematized science, you have a particular combination of abnormal scores on a blood test, etc. These combine to produce a binary evaluation, “sick”, but that’s only because we need to cross a decision boundary to take action (begin treatment).

Logistic regression tackles this in a few ways. First, it lets you set where you think your decision boundary is going to be, when evaluated against a series of inputs (blood cell count, let’s say) and set an overall threshold for the evaluation. Let’s say that you assign a certain number of points to each input: 50 points per 100 red blood cells, -20 points if you work out every week, + 10 points for every cigarette you smoke. Then we say, if this person has more than 750 points, we declare them sick.

Now this point system isn’t perfect, there will be people we should have labeled sick with 300 points and people who are actually fine at 1000 points. Logistic regression gets around this by imposing a non-linear cost for being wrong. When fitting the curve (and figuring out that 750 level), the algorithm is penalized more heavily for misses at 1000 points than at 500.

Error in logistic regression is ALWAYS non-zero.

Machine Learning Course Notes – Bittersweet

Posted on October 22, 2011 by David Wright

Finished this week’s exercises in a 5-hour marathon starting at 4:30am this morning. Today’s meta-lesson: implementation is way harder than reading slides and ‘kinda getting it’. My god is it hard to actually write a program that uses even what appear to be simple concepts.

So there are three tracks for this course: first is the spectator track (my term), where you just do the basic assignments (enough to be dangerous and spew plausible-sounding BS).

There’s the ‘advanced’ track, which I’ve chosen, which asks you to do some actual programming assignments (this morning’s marathon). Within the advanced track there are ‘extra credit’ assignments, which ask you to implement even more of the course material in Octave (a programming language). I haven’t gotten to the extra credit stuff. More on this later.

The final track is the ‘real’ track, where you pay real money, show up to class and all the rest. I read a discussion thread on the course website that speculates that my ‘advanced’ track covers about 40%-50% of the real course material. The real course is about 1.5x as long (3 months instead of 2), so let’s say we’re about 60%-75% of the pace of a real university course.

I’m starting to think it was a mistake to take two of these courses. I just don’t have enough time to learn everything I want to learn. I want to do the extra credit stuff, because what’s the point of reading the slides on stuff if you don’t REALLY get it? And my first crack at the extra credit stuff shows that I don’t REALLY get it.

And there are all these dudes (yes, all dudes) carpet-bombing the discussion boards who obviously REALLY get this stuff, while I only kinda get it. How many times in University did I wish I were smarter? That I wish I had really learned the background material in high school like I should have and I could have picked this up quicker?

Anyway, I’m done complaining and it’s just too time-costly for me to learn more of this right now, so I won’t. I wish it were different but that’s just too bad for me, isn’t it.

Machine Learning Course Notes

Posted on October 21, 2011 by David Wright

Not much to report or record at the moment, as all of the lectures and exercises this week have the goal only of teaching us the Octave programming language, which is an open source version of Matlab.

I’m constantly impressed by the power of expressing spreadsheets as matrices and vectors and thinking of analytical operations as expressed by linear algebra. Lots of new things to think about here, but gotta work through the programming exercises first.

Stanford Machine Learning Notes

Posted on October 18, 2011October 18, 2011 by David Wright

Wow, I’m really loving this class. Lecture4 slides.

[I should probably point out to regular readers that these notes aren’t really fit for mass consumption. I’m not going to bother even trying to build a complete understanding of each of the concepts so they’re really just for personal use.

That being said, they’re on here and if you’re interested in seeing what I spend just about all my spare time doing, read on!]

Ok, today we covered multi-variate regression and we’re venturing into some virgin territory for me, now. It’s pretty awesome stuff.

So we started out with the insight that we could express a multivariate regression as a transposed matrix multiplication. What a mouthful. Believe me, it’s simpler than it sounds.

The idea is that you have a sec of values (slope values for the dependent variable) and a set of inputs (independent variables) and matrix multiplication just gives us a clean way of grouping them and then mashing them all together at once. This is clearly a programming optimization. If you did it by hand, it wouldn’t really be any easier.

The second idea is to express the error function of the gradient descent algorithm as a matrix. I’m barely holding on at this point, actually, and am looking forward to my first actual exercise.

Feature and mean scaling are next. These are neat little tricks to optimize the program. The idea is that if you have two features, sq footage and age of a house for example, which take on values of massively different magnitudes, your application of a uniform transformation of the slopes (the alpha term) will really frig up the algorithm’s progress.

So let’s say the slope of the sq. footage term is 350 and age is 5. If you apply a 0.01 modification to adjust the algorithm, you’ll barely move the age term. If you apply 0.5, you’ll be blazing away on the sq footage.

There’s some talk about graphing the error term of the error function so you can see your progress. I like visuals, so I’m on board.

There’s also a neat discussion about how to use the variables supplied to build your own variables. Using length and depth to compute sq footage, for example. Also arbitrarily raising some variables to some power: price of a house being related to the sq footage and negatively related to the sq of the sq footage. This is a nice way of introducing non-linearities.

We closed with a discussion of a closed-form solution for some of these problems. I finally lost my grip and will need to spend more time learning about the ‘normalized’ equation, which involves transposing the coefficient matrix and multiplying it by the training vector.

I totally get that there are tradeoffs to this approach versus the iterative gradient descent solution, though. Specifically, the trick is transposing that matrix of coefficients. Once you start transposing 10,000 x 10,000 matrices, it takes quite some time. I wonder if the transpose function in Octave is just an iterative function itself?

Back to the drawing board to deepen my understanding…

Down Memory Lane: Stanford Machine Learning Course Notes

Posted on October 14, 2011October 15, 2011 by David Wright

These courses are serving up distinct reminders of why I’ve always done poorly at school: I’m lazy and sloppy. Very lazy and very sloppy. And my god do schools punish you for these personality traits.

The DB course is teaching me about my laziness. I’ve learned to call my brand of laziness “programmer’s laziness“. I would rather spend a bajillion hours building something that prevents me from doing 5 hours of work, as long as I can satisfy two conditions:

I find a way of engineering the task in a way that interests me (this is easier than it sounds: lots of things interest me)
Nobody tells me to do it this way

Usually the ratio isn’t a bajillion : 5. Usually I save a bit of time doing it because it would probably take me longer to use the conventional method. I suppose it’s not really laziness, as in an aversion to work; rather, it’s an extreme aversion to doing things in a manner I don’t enjoy/choose.

My second problem is that I’m sloppy. This one KILLS me in math-related courses. Now, my brand of sloppiness doesn’t really manifest itself in the workplace because the one-shot-and-done testing environment doesn’t really exist in real life.

Real math and real problem solving happen in an iterative, collaborative and failure-laden environment. I normally get so excited about solving a problem that I stop concentrating on stuff. I can go back later, realize I’ve been screwing it up and crunch away harder than I possibly could on the first pass. Computers take care of the arithmetic and, presto, the product improves. This makes me a TERRIBLE test-taker.

And I’m turning out some TERRIBLE test results right now. Ick.

Stanford DB-class notes

Posted on October 13, 2011 by David Wright

Well, I’m still in the course despite my grumblings. I’m determined to not screw this up.

I have a history with classes like this. In my second year of University, I took a finance course and COMPLETELY effed it up. Like, completely.

It was a tactical error, actually. I focused on the concepts and didn’t drill the equations. I’m still pissed off about that, 10 years later (holy *#$@, TEN years?!).

Anyway, this is clearly a course that looks to teach wrote-learning and I want to redeem myself. So I’m drilling* Relational Algebra this morning and XML data structures this weekend. Making the 7-hour drive back up to the in-laws, so maybe I can find a way for my wife to quiz me.

Ha!

*As an aside, I forgot how much I prefer to write on the right-hand page of a spiral-bound notebook, rather than the left. It just always feels cleaner. I am right-handed, of course.

Irritated Rant – Stanford DB-Class Course Notes

Posted on October 5, 2011October 5, 2011 by David Wright

Got an old-fashioned ass-whooping this morning on my XML quiz. I despise this kind of test, though, and have always been terrible at them.

Let’s take it from the top: XML is a standard for machines to read data and so is an excellent example of something humans are crap at. To write valid XML in one swoop on a test, for example, you need to memorize a variety of rules.

Such as: make sure that when there’s a <xs: sequence> opening tag for a subelement, the actual elements need to appear IN ORDER.

Or this one: ” avalid document needs to have unique values across ID attributes. An IDREF attribute can refer to any existing ID attribute value.”

Who the #$%# cares? When you’re actually implementing XML, you are probably using some kind of developing environment that either makes these kinds of errors difficult to make or very easy to identify and fix quickly. Why are we teaching people to do something that COMPUTERS ARE A BAJILLION TIMES BETTER AT?!?

Why not make me take a test on grinding coffee beans or loading ink into a ballpoint pen? ‘Cause, you know, these are things that are important for an office to function as well. No? The division of labor means that we pay others to do these tasks for us? Well, howdy-effing-do.

Now, there MAY be a valid argument that goes like this: memorizing all this garbage really plants an understanding of what XML is good at in my head. Sometimes sequencing elements is really important for a database.

Bollocks. These things are tested because they’re easy to test*. Period.

I’m considering dropping this class.

*and by that I mean easy for machines to grade. Well, I don’t want to learn how to be an effing machine. That’s what I buy machines for.

-=-

Update: I got quite a lot of pushback on the discussion forum for posting something similar to this. Hard to say whether it’s my aggressive and off-putting personality or whether my views actually have no merit.

I was (implicitly) called lazy and one guy said that “somebody has to build the validation tools”.

Well, I certainly am lazy, but mostly I’m just a douchebag. Anyway, here is my response:

Unpopular sentiment, it seems. Maybe it’s just my unpleasant tone. Let me try again.

I’m not sure I understand the first reply, but I like a lot of the second. Building an XML validating tool is a much more creative and effective way of learning what is and is not valid XML than the given assignment. I’d rather spend 5 hours doing that than 30 minutes wrote-memorizing tag syntax.

Is it so wrong to expect more of a university course than this?

How about testing me on these questions: 1. When should XML be used vs some other standard? (I think this is what the first response is getting at). 2. What are the limit cases for XML use and why might it break down? 3. What are some examples of instances when XML was used and it failed, or was successful?

I remain disappointed. Am I really so alone on this?

What Goes With Oatmeal? Today It’s Linear Algebra

Posted on October 4, 2011October 4, 2011 by David Wright

Linear Algebra… doesn’t it sound so impressive?

When I was in my last year of high school, we had three options for math courses: calculus, statistics (called finite for some reason) and linear algebra. Honest to god, I skipped LA because it sounded so daunting (and wasn’t a strict prerequisite for any university programs I applied to).

So often intimidating jargon masks very simple procedures and concepts.

Well, I’m learning LA over breakfast today because matrix multiplication is the fastest way of comparing linear regression functions’ effectiveness (that’s what we’re hinting at, anyway). Matrix multiplication is actually so simple I’m not even going to bother with notes.

What’s interesting to me is why it’s useful in this context. Quite simply, it’s useful because somebody (somebodies) spent a bunch of time building super-fast matrix multiplication functionality in every imaginable programming language.

Now, I don’t know why people have designed super-optimal implementations of matrix multiplication, but it’s a pretty awesome public good. Did they do this before Machine Learning made selecting from among various linear regression algorithms was a problem to solve?

Realistically, it was probably a bunch of kids looking to do an awesome PhD dissertation: why not build a super-optimized matrix multiplication library?

Learning by solving problems. That’s what it’s all about. Hat Tip to Alan Kay.

Machine Learning Notes

Posted on October 2, 2011 by David Wright

Another machine learning class.

We wrapped up linear regression by discussing work with error functions. I hadn’t really focused on this before, even though I’ve built some crude ‘by-hand’ error-minimizing functions at work.

The error functions are the sum of the squared values of the error terms in a regression. The squaring part of this is important because it really penalizes big individual deviations.

For simple linear regression, this is pretty easy because there are only going to be two function terms (x and y). When you start packing in the variables (or exponents of variables), minimizing this error function gets tough.

We then take this error function and multiply it by 1/(2 * the sample size) for some reason. This some reason is to be discussed later.

That’s where the gradient descent algorithm comes in, whose purpose is tofind a local minimum in the error function. It does this in a clever way: by just adding the value of the slope to each of the parameter terms. Eventually, you get smaller and smaller slopes and wind up at zero slope. Cool!

This is dampened (or amplified, I suppose) by another term to take bigger or smaller steps ‘down the slope’. This dampening term is called the ‘learning rate’. I find that terminology a bit confusing.

One key is to make ure you apply the slope to each term simultaneously, remembering that the slope in each ‘direction’ is going to be a bit different. If you visualize the whole process as a three-dimensional park with hills and valleys (I can’t find the picture he used in the notes) it’s easier to see that the slope in respect of each variable will be a bit different. So the equations for calculating that slope will be a bit different.

And there’s finally a discussion about resource use. We’re told that gradient descent is faster than the closed-form linear algebra solution at large database sizes.

That’s interesting and I wonder how sensitive such a conclusion is to language type. One thing I’ve learned is that speed of calculation is actually a multi-dimensional problem. Dataset size is one issue, of course, but so is language compilation/interpretation time. If you implemented the closed-form solution in C but did the gradient descent in some slower language like VBA or something, is the answer so clear? Maybe we could derive a cost function for compile times and figure it out!

Database Course Notes and Observations

Posted on October 1, 2011 by David Wright

Dicussion on XML:

There are two ways of organizing an XML document: using something called a DTD and something called a Schema. One is somewhat more strict than the other (Schema) and one is easier to deal with (DTD).

The point to me is that there is SOME way of automatically validating a document. The most common ‘dirty data’ problems I’ve come across are with databases that are not validated properly against some kind of template. In the world of validaion, I separate problems into human-type errors and computer-type errors.

Human-type errors are things like miskeyed addresses, duplicating entries and writing in a premium value for the policy limit, for example. A computer-type error would be something like repeating the column heading as a value in every single field.

The point is that Computer-type errors are really easy for humans to spot and Human-type errors are easy for computers to spot. That’s why automatic validation is so powerful: if you can combine a computer validation process with a human dataset, you’re probably going to get something that works. And vice versa.

From a user-experience standpoint, a good validation process can be maddening (I have to CAPITALIZE the first letter every single time?!) because most of it is perceived as fiddly formatting busywork. Luckily, increases in computing power have ushered in the dawn of the autofill era and this bug has become a feature (we’ll help you type!).

Anyway, I’m supposed to learn that a DTD is a bit messier because the ID aren’t typed and there is no control for sets of keys, blah blah blah, but these problems aren’t there when using a Schema/xsd. The XSD document can be daunting so my ‘homework’ is to download schema for XML and play around with it.

How boring. I feel like I’m in “school” now, a feeling I loathe.

Speaking of which, relational algebra is a ridiculous topic. I don’t care if it’s the “underpinnings” of query languages. I hardly needed to learn C to build Python scripts, even though C is its “underpinnings”. If I had some deep problem with SQL I’d be happy to mess around with relational algebra to figure it out. But until then, keep me in the dark, please.