data – Not Unreasonable

When Data Cannot Do Insurance

Posted on February 19, 2013 by David Wright

Here is David Brooks on what Data can’t do:

Data struggles with the social. Your brain is pretty bad at math (quick, what’s the square root of 437), but it’s excellent at social cognition. People are really good at mirroring each other’s emotional states, at detecting uncooperative behavior and at assigning value to things through emotion.

Computer-driven data analysis, on the other hand, excels at measuring the quantity of social interactions but not the quality. Network scientists can map your interactions with the six co-workers you see during 76 percent of your days, but they can’t capture your devotion to the childhood friends you see twice a year, let alone Dante’s love for Beatrice, whom he met twice.

In insurance we care about scale (the law of large numbers) and not getting f*@#ed over (avoiding moral hazard). Data has definitely helped where moral hazard is somewhat easily guarded against: such as in homeowners or auto liability insurance.

And these are the largest insurance markets on earth. Consumers have no doubt benefited, either through lower insurance premiums or (more likely) through a far more generous tort system and subsidy for people to build their homes in flood planes and on fault lines and hurricane tracks.

For more complicated lines of insurance, we still need really expensive underwriters to decide who is worthy of trust. Data has a long way to go there.

Will Google Write Catastrophe Insurance?

Posted on September 24, 2012September 25, 2012 by David Wright

Catastrophe insurance is the sexy part of my industry: lots of data and “analytics” and in tune with the information age. It’s also alternated between the most and least profitable line of business in the business.

Here’s what you need to write the stuff:

A really good map of where buildings are.
Some knowledge of what those buildings are made of and, just as importantly, what they’re worth.
An idea of the susceptibility of each region to natural catastrophes.

In my experience, people in the insurance business put a bit too much emphasis on #3, which a cursory understanding of is easy to get but a deep understanding of is currently beyond any intelligence yet discovered. The reality is that all of the science in the underwriting is in #s 1 and 2: where are the buildings and what are they worth?

What if Google just suddenly realizes it can probably do this better than anyone else?

“We already have what we call ‘view codes’ for 6 million businesses and 20 million addresses, where we know exactly what we’re looking at,” McClendon continued. “We’re able to use logo matching and find out where are the Kentucky Fried Chicken signs … We’re able to identify and make a semantic understanding of all the pixels we’ve acquired. That’s fundamental to what we do.”

More here.

I like imagining an even more tantalizing project: open source cat underwriting. Open Street Maps does most of what Google does except for free.

Will some actuary use this public data to check an industry-changer into Github one day? Might Index Funds (capitalizing this automated underwriting platform) and governments (subsidizing coastal homeowners) one day split all catastrophe insurance between them?

THUMP [PG’s head into the sand]

Posted on September 12, 2012 by David Wright

There’s this old joke that I really like:

One night a police officer sees an economist looking around a park bench near a light.
“What happened?” asks the police officer.
“I lost my keys but I’m having a really hard time finding them” replies the economist.
“Here, let me help” and they look for the keys awhile.
After not getting anywhere, the police officer asks, “where did you drop them?”
“Oh, replies the economist, way over there” and he gestures vaguely towards a nearby park, drenched in darkness.
“Well, then why on earth are we looking here?” asks the police officer.
“Because this is where the light is”

A powerful lesson. Sometimes we are so desperate for an answer we look for it in a very unlikely place and try to extrapolate back to the thing we want. Sometimes this works, but it can be devilishly hard. And it can also be stupidly useless.

Meanwhile, the one thing you can measure is dangerously misleading. The one thing we can track precisely is how well the startups in each batch do at fundraising after Demo Day. But we know that’s the wrong metric. There’s no correlation between the percentage of startups that raise money and the metric that does matter financially, whether that batch of startups contains a big winner or not.

…I don’t know what fraction of them currently raise more after Demo Day. I deliberately avoid calculating that number, because if you start measuring something you start optimizing it, and I know it’s the wrong thing to optimize.

That’s the inestimable Paul Graham. Perhaps economists should spend more time thinking about what they should and should not be measuring.

In a related discussion he says this:

The counter-intuitive nature of startup investing is a big part of what makes it so interesting to me. In most aspects of life, we are trained to avoid risk and only pursue “good ideas” (e.g. try to be a lawyer, not a rock star). With startups, I get to focus on things that are probably bad ideas, but possibly great ideas. It’s not for everyone, but for those of us who love chasing dreams, it can be a great adventure.

And we also get this interesting tidbit:

thaumaturgy: Off-topic, but something I’ve been chewing on lately: what’s it like to have your every written (or spoken!) word analyzed by a bunch of people? Esp. people that you end up having some form of contact with. It seems like it would be difficult to just have a public conversation about a topic. Do you think about that much when you write?

PG: It’s pretty grim. I think that’s one of the reasons I write fewer essays now. After I wrote this one, I had to go back and armor it by pre-empting anything I could imagine anyone willfully misunderstanding to use as a weapon in comment threads. The whole of footnote 1 is such armor for example. I essentially anticipated all the “No, what I said was” type comments I’d have had to make on HN and just included them in the essay. It’s a uniquely bad combination to both write essays and run a forum. It’s like having comments enabled on your blog whether you want them or not.

Big Data To Cure Cancer? Matter of Time

Posted on July 16, 2012July 16, 2012 by David Wright

I almost can’t believe this is happening. Incredibly exciting. Get used to these kinds of projects.

In 2007, Ian Clements was given a year to live. He was diagnosed with terminal metastatic bladder cancer. Ian began charting, quantifying, and recording as much of his life as possible in an effort to learn which lifestyle behaviors have the greatest impact on his cancer.

Ian has fought his disease successfully for five years, and now he asks the Kaggle community to look at his data to see what significant correlations and connections we can find. We at Kaggle are humbled by his efforts and want to help Ian share his data with the wider world by hosting it on our website.

This is an exercise in collaborative data exploration rather than a standard Kaggle competition. The ideal result would be a model suggesting which lifestyle behaviors may have the greatest effect on Ian’s health, but any insights into his dataset are welcome. While we understand it may not be possible to extrapolate insights from this dataset to the overall population, it will nevertheless be very helpful for Ian in generating hypotheses and suggesting different behaviors. We hope that you will find it interesting to take a look and see what you can find.

Dear youths of the world: GET INTO THIS FIELD.

Links on Data

Posted on May 20, 2012May 21, 2012 by David Wright

CalculatedRisk rounds up some links on how data collection can come under political fire, which is, of course, terrifying. He also tells this story:

The Depression led to an effort to enhance and expand data collection on employment, and I was hoping the housing bubble and bust would lead to a similar effort to collect better housing related data. From the BLS history:

[T]he growing crisis [the Depression], spurred action on improving employment statistics. In July [1930], Congress enacted a bill sponsored by Senator Wagner directing the Bureau to “collect, collate, report, and publish at least once each month full and complete statistics of the volume of and changes in employment.” Additional appropriations were provided.In the early stages of the Depression, policymakers were flying blind. But at least they recognized the need for better data, and took action. All business people know that when there is a problem, a key first step is to measure the problem. That is why I’ve been a strong supporter of trying to improve data collection on the number of households, vacant housing units, foreclosures and more.

New data is useless and if we had more data on what happened in the Great Depression we might not be scratching our heads as much today. Here’s an example of a chart that tells some kind of story but really doesn’t have enough history to teach us much of use:

(The chart annoys me in that clearly these two datasets have radically different statistical properties: they don’t belong on the same scale, or probably not even the same chart.)

So Government datasets are excellent because they’re (mostly) impartial and consistently measured: I’d rather have a consistently flawed dataset that I can correct than one whose basis changes unpredictably throughout.

But it’s painful to audit data collection and analysis policies, which is why it took so long for economists to figure out the way the government measures productivity changes due to offshoring is garbage. Michael Mandel blew the top off of this recently and taught us all a lesson.

But governments aren’t the only game in town. There are countless surveys of this and that group (architects, real estate agents, industrial producers, etc etc), which are ok, but big data is hopefully changing that, too. MIT’s Billion Prices Project is a ‘simple’ web scrape but is potentially a vastly better measure of inflation in the cost of goods. Check out their charts.

Hopefully data won’t be a bottleneck to knowledge some day.

addendum: Michael Mandel reports a huge revision in the domestically produced computers figures:

There are four important points here.

1) A big chunk of those computer shipments were supposedly going into domestic nonresidential investment. Post-revision, either the U.S. investment drought was deeper than we thought, or imports of computers were a lot bigger (see the recent PPI piece on Hidden Toll: Imports and Job Loss Since 2007).

2) The U.S. shift from the production of tangibles to the production of intangibles (think the App Economy) has been even sharper and more pronounced than we realized.

3) Budget cutbacks for economic statistics, such as the House Republicans are proposing, would increase the odds of big revisions like this one.

4) Bad data leads to bad policy mistakes, especially at times of turmoil. We need more funding for economics statistics, rather than less.

The Next Revolution Approaches

Posted on March 27, 2012 by David Wright

Bold headline, non?

Well, have a rummage through these links and you tell me if this is a big deal.

First the economist gives us a story about patenting and medicine (may be gated). The bottom line here is that natural laws cannot be patented though there are some loopholes…

For example, a genetic mutation can identify patients who are susceptible to a given disease or treatment. The mutation is a natural occurrence, as is the reaction to the drug. But the invention comes in connecting the dots between these elements.

Which aren’t as big as everyone thought…

Stephen Breyer, writing the court’s opinion, affirmed that Prometheus’s patents claimed a natural law and would restrict further innovation. Administering thiopurines, observing the body’s reaction and offering dosing advice did not add up to a patentable process. “Einstein could not patent his celebrated law that E=mc2”, wrote Mr Breyer. Nor could Einstein have patented the observation by “simply telling linear accelerator operators to refer to the law to determine how much energy an amount of mass has produced.”

The biotechnology industry did not expect the ruling. It is now in a minor panic. Personalised medicine inevitably includes the application of natural laws. It is unclear which applications may be patented.

The Economist doesn’t seem to come down hard on either side of this debate, even though it has been mildly skeptical of patents in the past.

Patents are tricky buggers. Have a listen to a patent skeptic, Alex Tabarrok, talk to Russ Roberts about them. Most patents and trademarks (especially) lie somewhere between trivially stupid and economically radioactive. The one kind of patents that seem to promote innovation? Ones on drugs.

Clearly we need to find a way to research these natural processes. And now that the results may well be in the public domain (I’m not a lawyer but I realize that that probably isn’t strictly, or perhaps remotely, true – just grant it to me for a minute), who’s going to pay for the data collection, analysis, etc?

Well, let’s start with the data: the same Alex Tabarrok from that excellent Econtalk interview linked above points us to a fascinating study (abstract and writeup) where a doctor did this to himself (from the writeup):

Snyder provided about 20 blood samples (about once every two months while healthy, and more frequently during periods of illness) for analysis over the course of the study. Each was analyzed with a variety of assays for tens of thousands of biological variables, generating a staggering amount of information.

…The researchers call the unprecedented analysis, which relies on collecting and analyzing billions of individual bits of data, an integrative Personal “Omics” Profile, or iPOP.

…To generate Snyder’s iPOP, he first had his complete genome sequenced at a level of accuracy that has not been achieved previously. Then, with each sample, the researchers took dozens of molecular snapshots, using a variety of different techniques, of thousands of variables and then compared them over time. The composite result was a dynamic picture of how his body responded to illness and disease — and it was a number of molecular cues that led to the discovery of his diabetes.

Ok, so a battery of tests can give us BIG DATA on our bodies just at the dawn of the age of our ability to swallow it.

Let’s pretend I know what I’m talking about and imagine the possibility of Kickstarter projects for accumulating giant biometric databases and Kaggle competitions to work out what they mean?

Now there’s a charity I’d donate to!

Science (?) And My Insurance BS Test

Posted on January 12, 2012 by David Wright

Richard Feynman defines science as the study of nature and engineering as the study of things we make. I like that logic and it makes the idea of an insurance company hiring a Chief Science Officer faintly ridiculous. Science today means ‘using tools that scientists use’.

Anyway, I have a test for the degree to which an article on insurance is BS or not. It’s the Climate Change Test. If the article or interviewee mentions climate change as a problem they want to think about in connection with insurance rates, they’re probably full of it.

My point is that big politicized science questions have no place at an underwriter’s desk: identifying claims trends is fine, but don’t dress the discussion up in some topic du jour just to pretend to be talking about something ‘people care about’. That’s pure, irritating status affiliation.

Well guess what:

MB: For the present we’ll be organized such that the operational analytics will continue to reside in the business units. On one end of a continuum is the traditional loss modeling; on the other end we’ll be responding to things like climate change in partnership with institutions such as the RAND Corporation. On a scale of one to ten, the familiar operational analytics may be a “one” and collaboration with RAND might be a 10. The sweet spot for the office is probably between four and 10. I envision that the science team will support the businesses in questions that have been asked but not addressed because of immediate burning issues or haven’t been asked in the most cohesive way.

Jim Lynch is puzzled about whether this is an actuarial role or not. It sure is. In most companies, C-suite folks all have ridiculously busy jobs so can’t focus on data mining and statistical analysis. But most companies don’t employ hundreds of highly trained statisticians to think about these problems every day. AIG does.

Anyway, what’s his strategy? Go fancy:

MB: Commercial and personal property insurance is largely about low-frequency, high-severity risk. The industry has tried with limited success is to model that risk through traditional analytic techniques. However, there remains a huge amount of volatility associated with an insurance company’s finances. We hope to explore ways of thinking about risk questions differently, approaching them from a different angle while leveraging relevant data. It’s more than a matter of using traditional and even non-traditional statistical analysis; it’s about bringing game theory, possibly real options theory and more broadly about reshaping the approach fundamentally to gain new insight into how to manage claims and better understand low-frequency, high-value events.

He’s been an internal consultant in insurance for 10 years. I’ll be surprised if he can come up with ways of out-analyzing the teams of actuaries AIG employs.

*Bad writing award for this line from his CV:

Creating and leading the team challenged to inculcate science driven decision making into an organization that has achieved great success by making heuristic decisions on the backbone of its sales force.

When Borrowing Dominates

Posted on January 6, 2012 by David Wright

David Merkel has another thought-provoking post. I like when people put real data up. Here is his chart:

So the stock market is being driven by inflation expectations. Hm… Let me put on my Monday Morning QB hat and think about this.

I think this is normal behavior in bad times. If things are healthy, the boom should rise for far longer than inflation expectations coming out of a recession. You need healthy demand to fuel inflation and the series really never had a chance to decouple from the early 2000s slodown.

My second point is that the series really link up in 2007 when subprime hits the headlines and investors probably start seriously thinking about worst case scenarios. Worst case scenario here meant a debt-deflation spiral.

Deflation KILLS borrowers by making the real burden go up. This makes the borrowers less likely to pay back their loans, which means, ironically, that the banks lose big time.

And the market tells us what happens when the banks lose big time:

Higgs and Stats

Posted on December 20, 2011 by David Wright

Every time there is some science news, I always hold my breath until SWAB comments. And on this issue Ethan Siegel does not disappoint. I highly recommend reading him if you’re interested in great science writing.

Anyway, I’ve been pretty confused about a lot of the statistics around the evidence of the Higgs Boson. I’ll set this up, first, though. Here’s Ethan:

Back in 1976, there were only four quarks that had been discovered, but suspicions were incredibly strong that there were actually six. (There are, in fact, six.) If you look at the above graph, the dotted line represents the expected background, while the solid line represents the signal published here from a E288 Collaboration’s famous Fermilab experiment. Looking at it, you would very likely suspect that you’re seeing a new particle right at that 6.0 GeV peak, where there ought to be no background. Statistically, you can analyze the data yourself and find that you’d be 98% likely to have found a new particle, rather than have a fluke. In fact, the particle was named (the Upsilon), but when they looked to confirm its existence… nothing!

In other words, it was a statistical fluke, now known as the Oops-Leon (after Leon Lederman, one of the collaboration’s leaders). The real Upsilon was found the next year, and you shouldn’t feel too bad for Leon; he was awarded the Nobel Prize in 1988.

But the lesson was learned. It takes a 99.99995% certainty in order to call something a discovery these days.

6 sigmas?! WTF?! That’s humongous. That says to me that they’re either using the wrong distribution or the number of observations is immensely higher than any dataset I’ve ever seen. Considering these are probably the most competent statisticians on earth, I have to assume the latter, but… seriously?! SIX standard deviations?

I’d love to see the data.

Monte Carlo Simulation Implemented ENTIRELY in Excel (no VBA macros)

Posted on November 8, 2011 by David Wright

When I moved to NY from our Toronto office I was a bit hamstrung by needing to log into a remote server to use the stochastic software we tend to run our simulations with. Not that it didn’t work, but it was slow and a bit of a pain to use.

So the first thing I did was try to implement my own Monte Carlo simulation program. I wanted to be able to send it to people wholesale so I forced myself to use VBA.

It worked, but it was really really slow. I hate slow.

So my latest iteration is a relatively simple model written entirely in excel formulas, basically only using the rand() function, which has been greatly improved by MS in Office 2003.

The attached file simulates Poisson random numbers with the famous Knuth algorithm (a lovely bit of math, by the way).

Next up will be installing normal, lognormal and beta distributions. With those, I can probably simulate just about anything I need to, really. I’m going to limit myself to Excel’s native (C) implementations of all of these functions, which are a ba-jillion times faster than anything you can handcode in VBA.

So why reinvent the wheel?