What Goes With Oatmeal? Today It’s Linear Algebra

Linear Algebra… doesn’t it sound so impressive?

When I was in my last year of high school, we had three options for math courses: calculus, statistics (called finite for some reason) and linear algebra. Honest to god, I skipped LA because it sounded so daunting (and wasn’t a strict prerequisite for any university programs I applied to).

So often intimidating jargon masks very simple procedures and concepts.

Well, I’m learning LA over breakfast today because matrix multiplication is the fastest way of comparing linear regression functions’ effectiveness (that’s what we’re hinting at, anyway). Matrix multiplication is actually so simple I’m not even going to bother with notes.

What’s interesting to me is why it’s useful in this context. Quite simply, it’s useful because somebody (somebodies) spent a bunch of time building super-fast matrix multiplication functionality in every imaginable programming language.

Now, I don’t know why people have designed super-optimal implementations of matrix multiplication, but it’s a pretty awesome public good. Did they do this before Machine Learning made selecting from among various linear regression algorithms was a problem to solve?

Realistically, it was probably a bunch of kids looking to do an awesome PhD dissertation: why not build a super-optimized matrix multiplication library?

Learning by solving problems. That’s what it’s all about. Hat Tip to Alan Kay.

A Teaching Moment

To my everlasting surprise, somebody made it far enough through some of my course notes to understand what on earth I was going on about.

I was forwarded a link to a real life implementation of xml. Actual examples are always nice to think through the implications of the theory.

But be forewarned, ye hapless Web denizens, this is a discussion not fit for all. Formatting reports for transferring retirement-related employee data among federal agencies. Has quite the ring to it, non?

Here’s the question: why and how do people use these tools?

The purpose of all this nonsense is to get machine readable data into the mothership system. Surely they’re choking on the fedex bills and warehouses of paper files. It’s the friggen 21st century after all.

XML does give you machine readable data. And it has this other benefit: it doesn’t really matter how you create it. Each government agency could format a report out of a sophisticated relational database or pay a legion of underemployed construction workers to handcode a text file. Either works as long as the format checks out.

So XML just plugs into your existing system (even if it’s a system of handwritten forms and carbon copies). Database systems are not quite so forgiving. You need a “new system”, in the most horrible, time/cost draining meaning of the term.

In this case, I’d speculate that the xml format is considered an early first step. It’s hardly feasible to lay the redundant paper-form jockeys off any time soon. Unions will make sure of that. But having a continuous corporate structure holds you back, too.

In more lightly-regulated process-heavy industries, most companies were either acquired or driven out of business before the haggard survivors finally completed their metamorphosis, which is actually never really complete. Google ‘COBOL programming language’ for an taste of the eternal duel against legacy software. And paper files?! Machines barely even read that crap. Try finding (with your computer!) any reliable data collected before 2000 (ie the dawn of machine history). Oh, you found some? Well, hide your grandkids, ’cause that shit was INPUTTED BY HAND!

Anyway, back to Uncle Sam’s pension files. The endgame is obvious: direct API links between the central system and every payroll/HR system in each office. This eliminates costs (jobs) and will improve accuracy. Good stuff.

Until then we’re still building XML files and presumably emailing them around. I can hardly be critical here as I’ve only just started to see the emergence of API links between insurers and reinsurers. No XML schemas, though, because they’re using a type-controlled relational database. Fancy way of saying they keep the data clean at the entry point: pretty hard to soil those databases. As it should be.

To my novice eye the system impresses. Flicking through the documentation suggests they might want to cool off on the initialisms and structured prose as it reads a bit like an engineering manual from the 60s. But engineers they probably are (and targeting an engineering audience to boot), so I’m probably being unfair.

Bless ’em.

The Allure is Near… The Pain Is Far

Imagine everyone you knew was telling you they wanted you to be the most powerful person in the world. And everyone you knew was certain you’d be successful in your bid. Even the weakest ego, so supercharged, would eventually overpower better judgment and the strongest of iron wills.

Enter Chris Christie, who for months has, probably wisely, said that running for president is not for him. It’s probably the most insanely stressful thing a person could ever undertake; probably worse if you win. And if you have any skeletons in that closet…

And yet: Chris Christie Performing “Due Diligence” On 2012 Run

Well, that phrase wins the prize for ridiculous euphemism of the week. Translation: “I’m raising money”. He’s so in. There’s no way the yes-men and money-men and other boosters don’t get their way now.

I shed a tear for my friends in New Jersey. By all accounts he’s a good governor and he’s lost to them now. This campaign will either destroy him or forever thrust him onto the national stage, win or lose.

Machine Learning Notes

Another machine learning class.

We wrapped up linear regression by discussing work with error functions. I hadn’t really focused on this before, even though I’ve built some crude ‘by-hand’ error-minimizing functions at work.

The error functions are the sum of the squared values of the error terms in a regression. The squaring part of this is important because it really penalizes big individual deviations.

For simple linear regression, this is pretty easy because there are only going to be two function terms (x and y). When you start packing in the variables (or exponents of variables), minimizing this error function gets tough.

We then take this error function and multiply it by 1/(2 * the sample size) for some reason. This some reason is to be discussed later.

That’s where the gradient descent algorithm comes in,  whose purpose is tofind a local minimum in the error function. It does this in a clever way: by just adding the value of the slope to each of the parameter terms. Eventually, you get smaller and smaller slopes and wind up at zero slope. Cool!

This is dampened (or amplified, I suppose) by another term to take bigger or smaller steps ‘down the slope’. This dampening term is called the ‘learning rate’. I find that terminology a bit confusing.

One key is to make ure you apply the slope to each term simultaneously, remembering that the slope in each ‘direction’ is going to be a bit different. If you visualize the whole process as a three-dimensional park with hills and valleys (I can’t find the picture he used in the notes) it’s easier to see that the slope in respect of each variable will be a bit different. So the equations for calculating that slope will be a bit different.

And there’s finally a discussion about resource use. We’re told that gradient descent is faster than the closed-form linear algebra solution at large database sizes.

That’s interesting and I wonder how sensitive such a conclusion is to language type. One thing I’ve learned is that speed of calculation is actually a multi-dimensional problem. Dataset size is one issue, of course, but so is language compilation/interpretation time. If you implemented the closed-form solution in C but did the gradient descent in some slower language like VBA or something, is the answer so clear? Maybe we could derive a cost function for compile times and figure it out!

Database Course Notes and Observations

Dicussion on XML:

There are two ways of organizing an XML document: using something called a DTD and something called a Schema. One is somewhat more strict than the other (Schema) and one is easier to deal with (DTD).

The point to me is that there is SOME way of automatically validating a document. The most common ‘dirty data’ problems I’ve come across are with databases that are not validated properly against some kind of template. In the world of validaion, I separate problems into human-type errors and computer-type errors.

Human-type errors are things like miskeyed addresses, duplicating entries and writing in a premium value for the policy limit, for example. A computer-type error would be something like repeating the column heading as a value in every single field.

The point is that Computer-type errors are really easy for humans to spot and Human-type errors are easy for computers to spot. That’s why automatic validation is so powerful: if you can combine a computer validation process with a human dataset, you’re probably going to get something that works. And vice versa.

From a user-experience standpoint, a good validation process can be maddening (I have to CAPITALIZE the first letter every single time?!) because most of it is perceived as fiddly formatting busywork. Luckily, increases in computing power have ushered in the dawn of the autofill era and this bug has become a feature (we’ll help you type!).

Anyway, I’m supposed to learn that a DTD is a bit messier because the ID aren’t typed and there is no control for sets of keys, blah blah blah, but these problems aren’t there when using a Schema/xsd. The XSD document can be daunting so my ‘homework’ is to  download schema for XML and play around with it.

How boring. I feel like I’m in “school” now, a feeling I loathe.

Speaking of which, relational algebra is a ridiculous topic. I don’t care if it’s the “underpinnings” of query languages. I hardly needed to learn C to build Python scripts, even though C is its “underpinnings”. If I had some deep problem with SQL I’d be happy to mess around with relational algebra to figure it out. But until then, keep me in the dark, please.

When a Business Model Dies

Spare a thought for poor Kodak, a company surely in the throes of death. Here is the key paragraph:

Intellectual property licensing and lawsuits have largely funded Kodak’s cash needs but stalled earlier this year, prompting Kodak to decide to sell 1,100 of its digital patents.

Sell your brain and what are you supposed to think with?

What do they do at Kodak? Is it just a bunch of lawyers trolling around for people to sue? This is a company that is at least two paradigms in the dust.

“First there were Film Cameras and all was good. Then there were digital cameras and we got scared but at least that made sense. But phone cameras? How are we supposed to compete with THAT?!”

No normal family is going to spend anything like the kind of money people used to spend on cameras of any kind.

Kodak, Yahoo!, HP, the list goes on.

It’s sad when familiar brands to die, sure. But die they must.

More On Fire

Horace Dediu has an interesting and long discussion on the Kindle. Ultimately he shares a view I notice I didn’t put in my notes, but with which I completely agree:

Fire will not have the opportunity to disrupt the iPad or tablets in general. Amazon sees the hardware and software of a device as a commodity and the content and its distribution as valuable. This assumes that the device is “good enough” and will not require deep re-architecting or that new input methods can be easily absorbed. In short, they see the tablet as at the end of its evolutionary path. Apple sees the exact opposite.

Amazon On Fire

Kindle fire, eh. Pretty neat.

A few observations:

  1. As many others have noted in many ways: 10 years ago how many leaps of faith would it take to name the three biggest innovators in tablets/smartphones: Apple, Amazon and Google? Shit like that don’t happen in Soviet Russia.
  2. It’s also interesting that each of these players have their fingers in lots of different pots. Any one-trick ponies left in mainstream consumer electronics? They call this convergence, I hear.
  3. Stripping the luxury features and going cheap isn’t exactly a hard strategy to figure out. Executing it is tough, though, and Amazon’s new browser appears to be the cog that makes it work: speed without speed, storage without storage.
  4. But it’s not like that’s a new idea, either. Web developers make very careful decisions about what content gets rendered and processed locally and remotely and this balance changes with advances in technology. There have always been workarounds for slower connections.
  5. Still, Amazon crushes it with its ready-to-go access to all those resources in its cloud. Apple will get there, too, eventually. See #2.
  6. And it’s cheap. Cheap is good. Never underestimate cheap.