The Time Smear

Soon after the advent of ticking clocks, scientists observed that the time told by them (and now, much more accurate clocks), and the time told by the Earth’s position were rarely exactly the same. It turns out that being on a revolving imperfect sphere floating in space, being reshaped by earthquakes and volcanic eruptions, and being dragged around by gravitational forces makes your rotation somewhat irregular. Who knew?

That’s the official Google blog.

Now that’s all well and good but eventually we started building systems that need to all talk to each other and agree on the time. If our arbitrary system of time keeping doesn’t/can’t exactly match the (changing) benchmark of the Earth’s position in space, what are we to do? In other words:

Very large-scale distributed systems, like ours, demand that time be well-synchronized and expect that time always moves forwards. Computers traditionally accommodate leap seconds by setting their clock backwards by one second at the very end of the day. But this “repeated” second can be a problem. For example, what happens to write operations that happen during that second? Does email that comes in during that second get stored correctly?

Well, Google does something called a Time Smear:

The solution we came up with came to be known as the “leap smear.” We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day. All of our servers were then able to continue as normal with the new year, blissfully unaware that a leap second had just occurred.

Cool! More discussion on the topic from HN here.

Got Crap To Do?

Another common hack I use is to hire people on oDesk and other freelancing sites for various tasks. There is no programatic way to shift my Amazon Wishlist to my local library. I could spend hours figuring it out, do it manually or just pay someone $1 in the Philippines to do it for me.

That’s from Steve Coast’s setup. The Setup is one of my favorite blogs.

Helluva interesting idea. Never occurred to me to get cheapo programmers to automate personal gruntwork (here’s his story about it). I have a fairly large backlog of ideas here. I shall investigate.

Michael Bloomberg, Javascript Jockey

Here’s Jeff Atwood with some, perhaps needed, pushback on the whole “everyone should learn to code” thing. The final straw for him was Mayor Bloomberg’s recent tweet:

Jeff asks: if everyone needs to code, how would coding make the Mayor better at his job?

Most jobs don’t need coding today, that’s a fact. But here are some other arguments:

  • Substantially all productivity improvements in most industries are coming from coding.
  • There are people out there who would be excellent programmers today if they were exposed to programming at a young enough age.
  • The world needs more programmers solving programming problems.

My love affair with coding as a macro phenomenon isn’t about supporting today’s patters of production, it’s about supporting the rate of change of those patterns.

Michael Bloomberg, who owns a software company fercrissakes, should support this movement.

How R Is Used

I don’t use R regularly, though I’m somewhat familiar with it. My work is in 90% Excel (the lingua franca of my world) and 10% Python, which I just plain like.

Yet here is a paper evaluating R’s design and how it is *actually* used. Neat.

We assembled a body of over 3.9 million lines of R code. This corpus is intended to be representative of real-world R usage, but also to help understand the performance impacts of different language features. We classified programs in 5 groups. The Bioconductor project open-source repository collects 515 Bioinformatics-related R packages.

The Shootout benchmarks are simple programs from the Computer Language Benchmark Game implemented in many languages that can be used to get a performance baseline. Some R users donated their code; these programs are grouped under the Miscellaneous category. The fourth and largest group of programs was retrieved from the R package archive on CRAN.

Some excerpts of the results:

We used the Shootout benchmarks to compare the performance of C, Python and R. Results appear in Fig. 7. On those benchmarks, R is on average 501 slower than C and 43 times slower Python. Benchmarks where R performs better, like regex-dna (only 1.6 slower than C), are usually cases where R delegates most of its work to C functions.

…Not only is R slow, but it also consumes significant amounts of memory. Unlike C, where data can be stack allocated, all user data in R must be heap allocated and garbage collected.

…One of the key claims made repeatedly by R users is that they are more productive with R than with traditional languages. While we have no direct evidence, we will point out that, as shown by Fig. 10, R programs are about 40% smaller than C code. Python is even more compact on those shootout benchmarks, at least in part, because many of the shootout problems are not easily expressed in R. We do not have any statistical analysis code written in Python and R, so a more meaningful comparison is difficult. Fig. 11 shows the breakdown between code written in R and code in Fortran or C in 100 Bioconductor packages. On average, there is over twice as much R code. This is significant as package developers are surely savvy enough to write native code, and understand the performance penalty of R, yet they would still rather write code in R.

…Parameters. The R function declaration syntax is expressive and this expressivity is widely used. In 99% of the calls, at most 3 arguments are passed, while the percentage of calls with up to 7 arguments is 99.74% (see Fig. 12).

…Laziness. Lazy evaluation is a distinctive feature of R that has the potential for reducing unnecessary work performed by a computation. Our corpus, however, does not bear this out. Fig. 14(a) shows the rate of promise evaluation across all of our data sets.

And the upshot:

The R user community roughly breaks down into three groups. The largest groups are the end users. For them, R is mostly used interactively and R scripts tend to be short sequences of calls to prepackaged statistical and graphical routines. This group is mostly unaware of the semantics of R, they will, for instance, not know that arguments are passed by copy or that there is an object system (or two)…

One of the reasons for the success of R is that it caters to the needs of the first group, end users. Many of its features are geared towards speeding up interactive data analysis. The syntax is intended to be concise.

Via LtU and here is an interesting discussion on this related video.

Data Science

The Netflix competition will probably go down as the event that gave birth to the Data Science Era. Like all iconic events there was absolutely nothing groundbreaking or new about it, it was just the firs time a few trends came together in a public way: large scale data, a public call for solutions, a prominent relatively recent startup disrupting an ‘evil empire’ kind of industry. And a bunch of money.

And the winner’s solution was never used:

If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.

To me it makes the whole thing an even better story as a cautionary tale in the differences between academic indulgence and commercial needs.

Perfect is often the enemy of good.

How They Started

Fun little discussion on how people got into programming.

I started by writing simple VBA macros to automate Excel at work. A buddy that sat nearby studied engineering in school and showed me some of the basics. From then on it was hitting the ‘record’ button and googling what popped up!

My programming has always been focused on data management and analysis. I learned SQL in the Stanford DB course and Matlab/Octave in the Stanford ML class. I took up Python to build a simple weather webapp which I’ve never finished then bolted on scipy and numpy after reading some Kaggle winners’ submissions. Now I use Python all the time. The webapp project also introduced me to django (which I rejected – too painful to learn), some ftp syncing automation, sqlite, and a bunch of rudimentary php and javascript/jquery.

I’m messing around with C, now, but I can’t see a great reason for ditching Python yet. I still don’t really know what an Object is (wtf is __init__?) though I’d love to learn about it eventually.

Revolution’s Achilles Heel

Pete Warden didn’t ask us to square this circle, but he should have. Both quotes from his blog.

Quote 1:

Our tech community chooses its high-flyers from people who have enough money and confidence to spend significant amounts of time on unpaid work. Isn’t this likely to exclude a lot of people too?

…I look around at careers that require similar skills, like actuaries, and they include a lot more women and minorities. I desperately need more good people on my team, and the statistics tell me that as a community we’re failing to attract or keep a lot of the potential candidates.

Appreciate the shoutout to actuaries and all, but isn’t the simple solution to encourage more education in this field?

Quote 2 comes from the comments to his first post:

I’m a female who majored in computer science but then did not use my degree after graduating (I do editing work now). While I was great with things like red-black trees and k-maps, I would have trouble sometimes with implementations because it was assumed going into the field that you already had a background in it. I did not, beyond a general knowledge of computers. 

I was uncomfortable asking about unix commands (just use “man”! – but how do I interpret it?) or admitting I wasn’t sure how to get my compiler running. If you hadn’t been coding since middle school, you were behind. I picked up enough to graduate with honors, but still never felt like I knew “enough” to be qualified to work as a “true” programmer. 

How is this possible? Even the people with degrees in field can’t code? And this isn’t the first time I’ve come across a story of Comp Sci graduates that couldn’t program.

Actuaries aren’t the best comparison because so much of Actuarial Science builds on pre-existing math knowledge and adds insurance and finance training. Coding is more fundamental. I’d say an actuary is to a .NET (or whatever) programmer what a generalized ‘math geek’ is to a ‘programmer’.

There’s only one way to learn to code, and it’s not the easy way. Like math, or any other language for that matter, you’ve got to sit down and crank away, learning from your mistakes; few could call themselves mathematicians three years after picking up their first calculators.

Of course, you don’t need to master the coding equivalent of calculus to be useful any more than you need to take integrals to do your taxes.  But right now the whole programming ecosystem is starved of talent. Pete needs ninjas and everyone else needs front end web devs.

That means every kid should in the world should figure out whether they like programming or not in a middle school classroom.

Famous Programmers Hating on Speaking

I recently read an essay by Paul Graham on how much he likes writing better than speaking:

Having good ideas is most of writing well. If you know what you’re talking about, you can say it in the plainest words and you’ll be perceived as having a good style. With speaking it’s the opposite: having good ideas is an alarmingly small component of being a good speaker.

…Audiences like to be flattered; they like jokes; they like to be swept off their feet by a vigorous stream of words. As you decrease the intelligence of the audience, being a good speaker is increasingly a matter of being a good bullshitter. That’s true in writing too of course, but the descent is steeper with talks.

And this from a quick profile of Linus Torvalds:

In fact, Linux’s creator doesn’t really even like to talk about technology. He’d rather write. “I think it’s so much easier to be very precise in what you write and give code examples and stuff like that,” he says. “I actually think it’s very annoying to talk technology face-to-face. You can’t write down the code.”

Something about Graham’s view doesn’t sit well with me. Let’s say you have a great idea. A complicated, great idea. Which medium maximizes the audience’s reception: great writing or great speaking?

I’d say it is undoubtedly great speaking. By a country mile.

Which one maximizes the size of the audience? Writing, of course.

Some might say that to learn complicated ideas you need to just stop and think sometimes. But I’d say that each time you need to do that you’ve found a spot where the writer/speaker utterly failed. They lost you. They skimmed over a key point or lost empathy with their audience. Or maybe you just got distracted.

Speaking is a more powerful tool than writing. A richer medium with which to educate, entertain or simply distract dummies.

Permission to do Evil

Update (2009-12-09): Via @miraglia, here’s a hilarious excerpt from Doug’s talk, “The JSON Saga”, in which he gives some background on why he added this clause to the license and how often people ask him to remove it:

When I put the reference implementation onto the website, I needed to put a software license on it. I looked up all the licenses that are available, and there were a lot of them. I decided the one I liked the best was the MIT license, which was a notice that you would put on your source, and it would say: “you’re allowed to use this for any purpose you want, just leave the notice in the source, and don’t sue me.” I love that license, it’s really good.

But this was late in 2002, we’d just started the War On Terror, and we were going after the evil-doers with the President, and the Vice-President, and I felt like I need to do my part.

[laughter]

So I added one more line to my license, which was: “The Software shall be used for Good, not Evil.” I thought I’d done my job. About once a year I’ll get a letter from a crank who says: “I should have a right to use it for evil!”

[laughter]

“I’m not going to use it until you change your license!” Or they’ll write to me and say: “How do I know if it’s evil or not? I don’t think it’s evil, but someone else might think it’s evil, so I’m not going to use it.” Great, it’s working. My license works, I’m stopping the evil doers!

Audience member: If you ask for a separate license, can you use it for evil?

Douglas: That’s an interesting point. Also about once a year, I get a letter from a lawyer, every year a different lawyer, at a company–I don’t want to embarrass the company by saying their name, so I’ll just say their initials–IBM…

[laughter]

…saying that they want to use something I wrote. Because I put this on everything I write, now. They want to use something that I wrote in something that they wrote, and they were pretty sure they weren’t going to use it for evil, but they couldn’t say for sure about their customers. So could I give them a special license for that?

Of course. So I wrote back–this happened literally two weeks ago–“I give permission for IBM, its customers, partners, and minions, to use JSLint for evil.”

[laughter and applause]

And the attorney wrote back and said: “Thanks very much, Douglas!”

You can see the full video of the talk at YUI Theater (the excerpt above is from 39:45).

more here via this via HN