What I Learned Today (SQLITE vs MySQL)

It’s a law of programming that early optimization is a very bad idea. Why fill your kitchen with gadgets when you don’t know how to cook?

On the other hand, what if I’m using an oxy-acetylene torch to cut onions? Shouldn’t I go to the market to see if knives are for sale?

So I’ve looked into replacing sqlite with MySQL. Maybe it’s faster, maybe it’s better; all the web programming books teach PHP and MSQL forchrissakes, and I’m using neither. So today I tried downloading MySQL and learned a couple things:

1. Wow, was that installation hard work. This means ANOTHER interplanetary programming detour and I’m sick of detours. I want to get to work.

2. MySQL’s python for windows implementation REALLY sucks.

Discouraged, I thought I’d read around a bit. At stackoverflow I found a great post that lead to an even better podcast with the inventor of Sqlite.

The upshot appears to be this: if you don’t need lots of simultaneous WRITING to your database, sqlite is probably better. Another heuristic: if your website fits on one computer (!?), sqlite is probably better. Yet another: if you get fewer than 100,000 hits a day (!!!) sqlite is the winner.

The big downside to sqlite is that it only allows one user to write data into it at one time: the database “locks”. The big upside? Just about everything else, I think.

Anyway, that little bit of research makes this whole question absolutely ridiculous for my purposes, so I’m sticking with sqlite.

Programming Detour: Backups

The weekend project has me downloading a bunch of weather forecast data and working on some processes for analyzing it. I’ve been trying to not EVER delete any raw data because I never know I’ll need to redo something I screwed up.

This habit has saved my can a few times.

Somewhat recently, I learned SQLITE so I could stop storing everything on the hard drive in flat files, which is probably ridiculously inneficient for the computer and is definitely a pain in the ass to monitor.

But here’s the thing with SQL, it’s a tricky and brutally unforgiving language. You can ACTUALLY delete things by mistake and never see them again. And last week that’s what happened to ALL of my backup data. I still don’t know what on earth I did.

Luckily it was only the backup data (still had the main database) and even more luckily I backed up the backup a week earlier when I left town on a trip.

In the end, I lost a few days worth of backup info forever. The upshot is that I quickly moved backup management to the top of the to-do list. The thought of losing all this work gives me the willies. I’d probably have a nervous breakdown.

So, here’s my ‘backup plan’:

I recently bought this nifty router that can host an external hard drive on an ftp server. I just plug it in and presto, it’s accessible online from anywhere. Perfect.

But netowrk programming is finicky. Connections interrupt for reasons I cannot fathom, which means that the old drag and drop method fails when the interweb gods frown upon me.

So I spent all day today figuring out a way to cycle through my directory and compare all the files to the existing backups on the ftp site and keep the new ones. If the process gets interrupted it just waits until I get it back. No restarts.

Cycling through the installation-related junk slows me down, but brtue-force backupping is good for peace of mind. I know it’s all going to be there.

This is the latest in a long series of programming detours I’ve had to make. It’s amazing how little time I spend doing the real analysis that is the ENTIRE POINT of this project. If I knew how hard this whole thing was going to be, I’d have been a bit less cavalier in committing myself to it.

Two months and I’m not even remotely close to a working prototype.

My target is the last weekend in July, now. We shall see.

Today’s Programming Idea

I’m a gigantic fan of several blogs.

And these blogs have big archives that, realistically, I’ll never go through even if I know (and I do know) they have lots of cool posts that I would find interesting.

So here’s my idea: why not build a little tool that lets you specify a few blogs and, at the push of a button, summons an archived blog post from one at random? I figure there are three possible outcomes:

1. You find a post from the archives you never read.

2. You find a post form the archives you read and remember.

3. You find a post form the archives you read but don’t remember.

Now, I know I like these blogs so I’m likely to enjoy the posts right away. The cool part, though, is that I’m also likely to feel little remorse at refreshing away a post I’m not too keen on and getting another.

I like this idea.

From an implementation standpoint, I don’t much feel building and maintaining an archive of all the blogs I like. So my solution is to do this by date. Here’s the method:

Pick a date to start: let’s say jan 1, 2000.

Pick a random day between today and that date.

Test whether there is a post from that blog in that day. If not, pick another random day. If there is more than one, index the posts from that day and pick one of THEM at random.

This will bias towards older dates, sure, but that’s the idea.

I bet that I would enjoy flicking through a bunch of old posts even more than I would enjoy reading the ‘cutting edge’ posts from my favorite bloggers. Current events don’t tend to interest me much, it’s the thought processes and insights that I love.

So I’m going to build this.

How to Improve

This caught my eye this morning: “10 ways to improve your programming skills”

Since I’m learning how to program (has it been two months!?), and I want to get better at it, I should try to follow some of this advice. Here’s the list:

1. Learn a new programming language

Um. It’s all new!

2. Read a good, challenging programming book

Ok, bit advanced.

3. Join an open source project

Yeah, right.

4. Solve programming puzzles

Possible candidate here. Sounds like a lot of work, though.

5. Program

Got enough of that to do!

6. Read and study code

Ugh.. no time.

7. Hang out at programming sites and read blogs

No “hanging out”, but I read.

8. Write about coding

Hmm….

9. Learn low-level programming

Nope.

10. Don’t rush to StackOverflow. Think!

Meh.

-=-=-=-=-

So maybe writing about programming is the low-hanging fruit here.

Ok, so here’s what I did today at work.

There’s this company called AM Best who are an insurance-specialist rating agency, like S&P but much narrower in focus. I go there periodically for financial information on our clients and markets and for the industry in general.

Anyway, I noticed that there is a press release archive going back to 2000. The thought struck me that it would be neat to have a database of all these press releases to crunch and see if there are any patters in the rating actions taken on companies.

THEN it would be neat to link these rating actions to stock prices, to see whether the ratings actually, um, you know, work.

For instance: how good of a predictor are they of default? Is there an immutable ‘snowball effect’ where a rated entity just keeps getting downgraded until it fails or merges with someone else?

So this project has been bubbling around in my head for a few weeks and this morning I finally had enough spare time in which to implement it.

I’ve put together a scraping routine (busily ‘scraping’ as I type) that is pulling down all 10,000+ press releases and dropping them into a database.

I considered doing all the actual data mining today, too, but that’s going to take a bit too much time. I’m happy with just sitting on the data for now.

My next objectives:

1. Parse the text to figure out what the various categories of press releases are.

I know there are downgrades and upgrades of companies, but what about actions against subsidiaries only? What about debt ratings? Most of this crap is useless to me.

2. Figuring out a system for identifying companies that matter.

There is going to be a ton of mergers, defaults, spin-offs and goodness knows what going on that I’ll need to work out. That will be tricky.

3. Isolating the rating actions associated with corporates and building a more ordered database of actions over time.

This is the ‘real work’, obviously. How ironic that it’s going to by far be the easiest step once everything’s organized. Regular Expressions, baby. Cinch.

3. Figuring out how money has been made or lost in this process.

I want to link these names to stock symbols and see if there is any perceived contagion (by the market) and, even more importantly, whether there is any ACTUAL contagion. I suspect not.