GPT-4 Fails Final Actuarial Exam

I gave GPT-41 the 2019 version of CAS Exam 9. This is the last exam in the Casualty Actuarial Society progression to a fellowship (FCAS), the highest designation available to Property & Casualty actuaries. You can see GPT’s answers here, the grading rubric here and GPT’s grades summarized here. You can see a youtube video of me walking through some of this content here: https://youtu.be/dMvjku-4hUY.

Summary

GPT 4 failed pretty miserably2 (19.75 / 52.5, about half the passing score of 38.5) but I think the score could improve with some better prompt engineering and the right plug-ins. This raises two questions:

  1. Can GPT-4 get all the way to a pass? 
  2. If it passes, would that mean its capabilities match an FCAS? 

I’d bet GPT-4 could pass some old exams now and with some clever hacking maybe it could get through this one, too. However, the exams are a moving target and they’ve been evolving away from GPT’s strengths for a generation. I think GPT will accelerate that evolution, strengthening the profession’s value proposition. 

Let’s dive in!

Context

These exams are hard, people3. The pass rate on the 2019 Exam 9 was 56% (338/601) and I like to point out that the 601 candidates who sat for this exam had mostly passed almost a dozen other exams with *similar pass rates*. In a world of grade inflation and decadence, most actuaries I speak to agree the exams have gotten harder over time. These are among the ultimate standardized tests that push candidates on analytical depth, domain knowledge and technical skill under a stressful time constraint. For goodness sake grading the exams is a brutal exercise that itself demands all these skills!

My process was to use a pretty simple prompt, starting each question by saying “this is an actuarial exam question” then pasting the question. Each question got a fresh instance of GPT 4. All answers were generated between March 23rd and 27th, 2023. As I was going through the answers I realized that I made some mistakes in transcribing the questions and regenerated the answers4. For a couple questions GPT refused to do calculations so I experimented with more prompts like “you are an actuarial student” etc (see notes) but I was unable to get it to show me the calculations so I kept all the answers from the prompt above. I’m sure there are ways of engineering better prompts that would generate dramatically better answers for some questions. But even though there is certainly low hanging fruit available to improve GPT’s score, I also think its successes and failures contain clues to some hard boundaries on its performance.

Performance Analysis

For analysis I’ll focus on questions 1, 6, 7, 18 and 19. 

Starting with problems 6 and 18, which were the two times GPT got full marks. These two problems were very well defined, fairly straightforward formulations of well known analytical models. These are the bread and butter of ‘easy’ exam questions since they don’t really challenge the candidate on comprehension. You have to memorize a technique and notice the problem requires it and work it without error. There are probably lots of examples online for these techniques and they are mostly worthless as analytical tools for a practicing actuary I think. If GPT embarrasses the profession into killing these questions, good riddance.

The next class of problem is one that GPT didn’t get correct but I think the gap to cross is small. Examples here are problems 1 and 7. In both these situations GPT got confused about some complicated modeling process that was presented in a weird way. It missed the “trick”. I think prompt engineering and some kind of plug-in5 that gives GPT a formal model structure to “plug and chug” and then leave it to interpret results will quickly improve its scores on these. Even humans don’t necessarily deeply understand the models we use and GPT interestingly often attempted to build a model from first principles when it didn’t recognize some obscure actuarial terminology. It really didn’t work. Which brings me to the last category.

I think the failure on 19 is most instructive of the limits of GPT. Here we have a toy model for calculating the capital requirements of an insurance company and GPT completely whiffed. This model is coherent in the sense that it captures underlying concepts of what a leverage model should cover but the figures and mathematical structure are ridiculous in their simplicity. It is recognizable only to someone who “speaks math” and understands what models are “trying to do”. Novel, ad hoc, toy models are tools of an effective actuary who is distilling a process to capture basic ideas and communicate with colleagues. Mostly these models don’t exist in a formal sense, aren’t studied on the internet and are reflections of the quality of judgment of a very good (or bad) actuary. Building and manipulating these models is creative analytical work at its best. I don’t see how GPT figures these out without practicing as an actuary for a decade. 

Implications for CAS Exams?

I predict the CAS exams will (and should!) continue to get ‘harder’ in the coming years, partly in response to GPT, which frankly is exposing the weakest parts of the exam system for what it is: wasteful memorization. But partly this has been the direction of the exam committee for a generation already and I think this strategy has been deeply vindicated by GPT. Many of our cousin professions are asking deeper existential questions than actuaries need to. It’s hard to see what a lawyer does after we integrate even today’s version of GPT into the economy, much less whatever on earth we have in 2030. For actuaries there is a pretty clear path to continued relevance. My favorite parts of the exams were when a question made me examine my own knowledge and experience and integrate it with the syllabus material to produce something new, which aligns to the differentiating skill of the best the actuaries. GPT won’t do that for a long while.  

So let’s keep focusing there!

Questions and Answers

See google doc

Notes

I gave GPT-4 the 2019 version of CAS Exam 9 which you can see here (including grading rubric). 

  • I started a new prompt window for each numbered question. The only prompt I used was to say “this is an actuarial exam” then pasted the text below
  • For some answers (Q1), GPT refused to perform calculations. After noticing this I tried a variety of different prompts like: “answer this as an actuarial student” “answer this with a perfect answer and perform all calculations” but it never gave me any calculations. So I kept the answers below since they’re equivalent to the answers I saw when experimenting with different prompts
  • I noticed that GPT has some very favorite techniques for analyzing problems, liking VaR and Sharpe Ratio and pulled those in frequently. I think it lacks an ability to differentiate among more nuanced analytical techniques for highly adjacent problem spaces. 
  • GPT performs much better in essay type questions that use words instead of analytical techniques to find numerical answers to problems
  • Question 4: There is an idiosyncratic definition of market price of risk in the syllabus which was misinterpreted by gpt. This is a bit of a trick the cas pulled to write an exam question I think
  • In question 5 the syllabus departs from investopedia’s analysis of CAPM vs APT. Why?
  • On Question 6 it got all the calculations right. I made an error and regenerated the response which was really really wacky.. then I discovered another error so regenerated again and got it right.. how do the papers capture this kind of variability?
  • In Q 7 “Tricks” in presenting problems really mess up its capabilities. Exams (and life!) are full of problem types that do not present themselves in a way that can be simulated on a test. Very complex interest rate term structures or scrambled problem information that requires a highly generalized model of the operation of a market totally mess it up.
  • In a sense it feels like to the extent our specified models are accurate, GPT can navigate problem spaces. 
  • In Q8, GPT totally missed the shortcut formulas and did not understand the concept of an insurance book that renews
  • In Q9 the questions were pretty rote. The yield curve implications of prepayment risk were lost on gpt here which is kind of a layer deep on the standard terminology. I would bet that this could be improved with prompt engineering. 
  • On question 14 we have a solid understanding of the relationship of irr and surplus and investment income even though the calculation mechanics are wonky. It can interpret results it seems, which is pretty neat. 
  • On 15 there is a volatile mix of hallucination and real answers. Problems with the time value of money
  • On 16, there are some conceptual mistakes in figuring out timings of cash flows and more complex discounting than straight PV
  • For 17, it lacks an understanding of the cash flow timing of investment income and loss payments. There’s a kind of real world knowledge that actuaries have there that gpt doesn’t. 
  • In 18, this question was right in the wheelhouse and GPT nailed it.
  • For 19, this was an ad hoc arbitrary function for leverage and GPT both misunderstood the point of the function and how to use it to calculate capital.

1. What is GPT-4? The best way to find out is to ask GPT! Go to chat.openai.com, register for free and ask GPT what it is and how to use it. I’m not kidding! At the time of writing GPT-4 is the latest model and only available to paying subscribers. For many ‘normal’ questions GPT-3.5 is about 95% as effective. 

2. How did I grade this? The exam has an examiner’s report attached that has sample answers and a grading rubric for each question. I followed that. There were definitely judgment calls to make on many questions, usually where GPT did something unusual but reasonable and I awarded part marks. From my experience comparing my self-graded practice exams to the real thing I’ve always been a bit generous!

3.I myself never sat for Exams 8 or 9 though I’ve read the 9 syllabus a few times to implement some of the ideas at work and later did a whole podcast series on the text that will likely supplant a big portion of the material. See here!

4.In doing this I noticed something many others see: that GPT generates radically different answers if you regenerate the response. Very minor changes in content sometimes resulted in completely new problem solving strategies, sometimes doing weird stuff. GPT is an example of a model occasionally dismissively called a stochastic parrot. It repeats things it has ‘seen’ before. Since it has seen almost everything textual it needs to make choices and those choices can change with each click of “Regenerate Response”. In this exercise I kept the first valid choice so it is plausible that just by hitting regenerate response you could land on a better set of answers if you knew how to identify them in advance. But how might you know when to stop regenerating responses to a question you don’t know the answer to? Figure that out and you’ve a good strategy for improving GPT’s score! 

5. Plug-ins are a very important recent addition to GPT where it can access other models that allow it to use external software to take actions (book flights, hotels, etc), search the Internet or basically anything else. (I fear that this footnote will age very poorly!)

Death Spirals and Other Selection Problems with Amy Finkelstein

youtube: https://youtu.be/nvVlNSolE3s

podcast audio link: https://www.buzzsprout.com/126848/12145818

Why did I do this episode?

Selection problems are probably the most surprising and challenging feature of managing an insurance portfolio. I was frankly very surprised to see it get this kind of high profile treatment in a popular book! How could I not do a show on it!?

What did I learn?

I was very surprised at the degree to which we can actually have private information about our own risk in many areas.

What was my favorite part?

The stories about selection problems in the book give fantastic, vivid examples of this problem for learning about how difficult insurance problems really are.

Pricing Insurance Risk With Steve Mildenhall and John Major

youtube: https://youtu.be/ZQHpMVH7d9s

episode page: https://www.buzzsprout.com/126848/11998732

Why did I do this show?

This is important work and needs to be celebrated and I honestly can’t imagine any other media property than the Not Unreasonable Podcast that could give it proper treatment. Steve and John have achieved something tremendous here in capturing the latest theory and practice of pricing insurance.

What did I learn?

Ho, boy, a great many things in my studying for this. During the show I got the chance to really solidify my understanding of the concept of allocating margin vs allocating capital. Like all brilliant ideas it’s absurdly simple once you get it. For bonus marks I’ll go with Steve explaining Freddy Debean’s remarkable result for how to break a TVAR tie.

What was my favorite part?

There has been some incredible work by incredible researchers that will be used by actuaries for generations to come and we deliberately tried to work in the names of as many of these giants as we could. Insurance, oddly, has too little of a sense of its own intellectual pantheon. It exists! We should better celebrate the intellectual giants that support our work.

Some papers referenced:

Freddy Delbean!
https://scholar.google.com/citations?user=mVF1X_UAAAAJ&hl=en&oi=ao

The paper is *Coherent Risk Measures*
https://link.springer.com/book/9788876423055

As distinct from *Coherent Measures of Risk* which is his top cited paper (and also a good one of course!)

clips:

https://www.linkedin.com/posts/david-wright-73661214_stephen-j-mildenhall-and-john-major-set-about-activity-7018206925800538112-FyXI

https://www.linkedin.com/posts/david-wright-73661214_what-is-reinsurance-worth-should-i-buy-it-activity-7018539536934674432-6PR_

What I Learned In 2022 (and the books I read!)

Three things I learned in 2022:

  1. Florida is not weird, insurance is just really hard to get right. For one, it is a community of migrants who don’t totally trust each other yet and, two, it’s massively exposed to hurricanes. Natural disasters put a lot of pressure on the social fabric of any society and Florida’s social fabric is pretty thin! See my episodes with Gary Mormino and Dave DeMott on Florida. For 2023: A hypothesis here is that all disaster prone regions/societies will have dysfunctional insurance industries. Is that true? 
  1. Insurance is an expression of our value system. If you’re an insurer in Florida, for example, you have to be very careful to choose which subculture you want to insure as each will deploy a different set of values in their relationship with an insurer. I wrote an essay describing how it is the job of the insurance underwriter to use moral judgment to assess an insured’s values. See episodes with Howard Kunreuther, Joe Edelman and Jen Brady for discussions of values and insurance and this clip with Stan McChrystal nails it! 
  1. Your social theory only works if you can sell insurance with it. Joe Henrich taught me that we don’t pick our culture, we copy the culture of the prestigious. A lot of people have a problem with a big implication of this: it takes generations for a culture to change. If our treatment of risk is dictated by our culture and culture changes slowly, does that imply that we can’t do anything about newly identified risks? It does! My conversation with Joe Edelman explores some possibilities of another way and I’ve listened to a ton of interviews by Jim Rutt and Jordan Peterson studying this. It feels like something will be discovered here but not yet. Today, political compulsion is the only way to influence behavior reliably and no other theory of motivation is worth a damn. For 2023: What other theories can I test?

My top episodes for this in terms of downloads were: 

  1. Tyler Cowen on Talent
  2. Robin Hanson on Distant Futures and Aliens
  3. Howard Kunreuther on Behavioral Economics of Risk

Someone once asked me to list out the materials I use to prepare for my podcast. This year I studied about 40 books (some are re-reads, very few cover to cover), probably twice that many academic papers, probably 10x that many podcasts and interviews and a huge amount of time just thinking about what on earth I believe about all this stuff. Below are the books on my shelf and in my kindle this year (plus one library book I took back!).

  • Aspiration by Agnes Callard
  • Democracy and Decision by Brennan and Lomasky
  • Nudge by Thaler and Sunstein
  • Moral Economies of Corruption by Steven Pierce
  • Emerging Perspectives on Judgment and Decision Making
  • Thinking, Fast and Slow by Kahneman
  • Team of Teams by McChrystal
  • Leaders by McChrystal 
  • Risk a User’s Guide by McChrystal
  • My Share of the Task by McChrystal
  • Bureaucracy by James Wilson
  • Insurance and Behavioral Economics by Kunreuther
  • Mastering Catastrophic Risk By Kunreuther
  • The Complacent Class by Tyler Cowen
  • Stubborn Attachments by Tyler Cowen
  • Talent by Tyler Cowen
  • Creative Destruction by Tyler Cowen
  • What Price Fame by Tyler Cowen
  • Big Business by Tyler Cowen
  • The Art of Not Being Governed By James Scott
  • The Moral Economy of the Peasant by James Scott
  • The Sociology of Philosophies by Randall Collins
  • Why we Fight by Chris Blattman
  • Diminished Democracy by Skocpol
  • Human Agency and Language by Charles Taylor
  • The Righteous Mind by Jonathan Haidt
  • Self Efficacy by Bandura
  • Self Determination Theory by Ryan and Deci
  • Pricing Insurance Risk by Mildenhall and Major
  • Utilitarianism and Beyond by Amartya Sen
  • The Secret of Our Success by Joe Henrich
  • Seeing Like a State by James Scott
  • The Limits of Organization by Ken Arrow
  • Modern Warfare by Roger Trinquier
  • Risk Society by Ulrich Beck
  • The idea of a social society by Peter Winch
  • Why Democracies Need Science by Collins and Evans
  • Artifictional Intelligence by Harry Collins
  • The WEIRDest people in the world by Joe Henrich
  • The moral foundations of politics by Ian Shapiro
  • Rethinking Expertise by Collins and Evans
  • Land of Sunshine, State of Dreams by Gary Mormino
  • Dreams of a New Century by Gary Mormino
  • Age of Em by Robin Hanson
  • Elephant in the Brain by Robin Hanson

Addendum: I pulled my notes docs from a few of my interviews that listed out academic papers I took notes on. The two sources are the citations from the books above and me applying Cowen’s second law. Mostly these are from deep dives into literatures on Meaning and Motivation. I haven’t captured quite everything here because I don’t always take notes on all the papers. Google Scholar is a ridiculous treasure trove but not all literature is good!

  • Understanding a Primitive Society – Peter Winch
  • The Rapacious Hardscrapple Frontier – Robin Hanson
  • Testing the Automation Revolution Hypothesis – Keller Scholl, Robin Hanson
  • Post-conflict Recovery in Africa: The Micro Level – Blattman
  • Civil War – Blattman
  • Gang Rule – Blattman
  • ENGINEERING INFORMAL INSTITUTIONS – Blattman
  • Forscher, P. S., Lai, C. K., Axt, J. R., Ebersole, C. R., Herman, M., Devine, P. G., & Nosek, B. A. (2019). A meta-analysis of procedures to change implicit measures. Journal of Personality and Social Psychology,
  • Consensus-based guidance for conducting and reporting multi-analyst studies
  • Mapping the moral domain. -Nosek
  • Understanding and Using the Implicit Association Test: I. An Improved Scoring Algorithm – Nosek, Greenwald, Banaji
  • Michie, S., van Stralen, M.M. & West, R. The behaviour change wheel: A new method for characterising and designing behaviour change interventions
  • EVERYONE DESIRES THE GOOD: SOCRATES’ PROTREPTIC THEORY OF DESIRE – Agnes Callard
  • Capability and Well Being – Amartya Sen
  • Equality of What – Amartya Sen
  • Rights and Agency – Amartya Sen
  • Rational Fools – Amartya Sen
  • Putting together morality and well-being – Rachel Chang
  • What is Human Agency – Charles Taylor
  • Utilitarianism and Beyond – Amartya Sen and Bernard Williams
  • The Centered Self – David Velleman
  • Systematic review of meaning in life instruments – Monika Branstatter
  • Beyond the Search for Meaning: A Contemporary Science of the Experience of Meaning in Life – King, Heintzelman, Ward
  • Three Forms of Meaning and the Management of Complexity – Jordan Peterson
  • Meaning and Belonging in a Charismatic Congregation: An Investigation into Sources of Neo-Pentecostal Success – Douglas B. McGaw
  • Finding” meaning” in psychology: a lay theories approach to self-regulation, social perception, and social development – Molden, D. C., & Dweck, C. S.
  • Life is Pretty Meaningful – Heintzelman
  • The three meanings of meaning in life: Distinguishing coherence, purpose, and significance – Frank Martela & Michael F. Steger
  • Routines and Meaning in Life – Heintzelman & King
  • Encounters with objective coherence and the experience of meaning in life – Samantha J. Heintzelman, Jason Trent and Laura A. King
  • (The Feeling of) Meaning-as-Information – Heintzelman & King
  • Motivation to learn: an overview of contemporary theories – David A Cook & Anthony R Artino Jr
  • Motivating the academic mind: High-level construal of academic goals enhances goal meaningfulness, motivation, and self concordance -William E. Davis, Nicholas J. Kelley, Jinhyung Kim,  David Tang, Joshua A. Hicks
  • Motivation for accepting parent values – Ariel Knafo,  Avi Assor
  • Grit: Perseverance and Passion for Long-Term Goals – Duckworth, Angela L., et al
  • Grit, basic needs satisfaction, and subjective well-being -Jin, B., & Kim, J. 
  • Facilitating Internalization: The Self-Determination Theory Perspective – Edward L. Deci, Haleh Eghrari, Brian C. Patrick, Dean R. Leone
  • Regulatory fit: A meta-analytic synthesis – Tamar Avnet
  • Happy Soldiers are Highest Performers – PB Lester

Mark Friedlander on Florida’s Insurance Overhaul

youtube: https://youtu.be/8Bfkrbiazxw

episode page: https://www.buzzsprout.com/126848/11907447

Why did I do this show?

Florida has passed legislation and I wanted an excuse to dig through why it’s so important for insurance!

What did I learn?

That some reinsurers still aren’t happy!

What was my favorite part?

Speculating on what the eventual political consequences to insurance

Gary Mormino on the Social History of Forida

youtube: https://youtu.be/WT0iS-sDa54

episode page: https://www.buzzsprout.com/126848/11848870

Do you think Florida is weird? Most everyone does. Why? Gary is the man to answer this question. Gary is Professor Emeritus of the University of South Florida and has dedicated his career to studying the social history of Florida.

Here is Gary on wikipedia
Here is Gary on Amazon

Quote of the show: “Do crazy people immigrate to Florida or do perfectly normal people come here, and then be a little goofy and go crazy.”

What is the most unusual social characteristic of Florida? 0:00
What are some of the most distinctive features of Florida? 9:37
Florida’s “Florida Man” reputation. 15:49
California and Florida are neck and neck in population density growth in last 100 years. 24:51
Florida is running out of options for reinsuring barrier islands. 35:55
What it costs to live on the coast in Florida. 40:18
How is Florida a Ponzi State? 42:28
What’s the real alternative? 46:55
What are the similarities and differences between Florida and other states in terms of immigration? 53:49
How the Cuban vote has been a solid republican vote since 1961

Why did I do this show?

Gary is ridiculously underrated. Florida is a massive deal and its reputation for weirdness screams out for explanation. There is an insurance question here of course but come on, why don’t more people know Gary Mormino!

What did I learn?

As I expected Florida can probably be explained using a kind of factor analysis of its sociological parts. It is old, its immigrant politics are relatively dominated by communist refugees so skews right, it is “California on the cheap”. We can dig into each of these things and understand that they are all perfectly normal but their combination yields a fascinating place.

What was my favorite part?

This section is sometimes hard for me to write but I’ve got an embarrassment of riches in this episode. The winner HAS to be Gary describing the giant senior community that launches presidential campaigns where everyone drives golf carts instead of cars.

Dave DeMott’s Stories About Florida Insurance

episode: https://www.buzzsprout.com/126848/11840226-dave-demott-s-stories-about-florida-insurance

youtube: https://youtu.be/G7iGuLLZNwk

Dave DeMott is President-Elect of The Florida Surplus Lines Association, Chair of the Legislative committee and sits on the national Wholesale & Specialty Insurance Association committee.
Most importantly for today, Dave DeMott is a real, legit, on-the-ground insurance practitioner in Florida. He gets into the real details and war stories about insurance claims in Florida. 

Dave’s introduction to the Florida insurance market. 0:00
What is a lodestar fee multiplier? 5:23
The problem with AOB 12:29
What work is being done to curb predatory behavior by carriers? 19:03
Water damage and leaky roofs  23:15
What’s the distinct about Florida? 29:28
Lobbyists have their golden opportunity.

Why did I do this show?

I have been studying the Florida insurance market and Florida culture generally trying to figure out why Florida is weird. Dave DeMott is a real on-the-ground practitioner with real war stories and we got some in this episode!

What did I learn?

I learned what a one-way attorney’s fee was! If the attorney settles for $1 more in claim payment than the carrier offered, they get their entire fee paid for plus a multiplier. Holy cow!

What was my favorite part?

Here’s my favorite quote: “The very first time you hear about the loss, you get the you get the notice of loss to your claims team. There’s a notice of loss written by the insurance agent, there’s an assigned AOB, a public adjusters contract and a notice of intent to sue all together at once at first loss.” Insurers get sued at the very moment they learn of the claim!!

clips: https://www.linkedin.com/posts/david-wright-73661214_next-week-the-florida-legislature-sits-for-activity-7007081290201997313-yZYr

Joe Edelman on Designing Meaningful Things

youtube: https://youtu.be/Sjennrn5LNA

episode page: https://www.buzzsprout.com/126848/11585772

More on Joe Edelman

Why did I do this show?

I am looking for a way that we might be able to design a new insurance system. By “insurance system” I mean a way for us to make decisions about what kinds of risks we want to face and what we want to ignore. The amount of information we need to process to make these decisions is too much for an individual to handle so we make them socially. Social decision making is another way of saying “politics” and “regulation”. Does Joe open the door to a better way of regulation?

What did I learn?

In preparing for this and in doing this interview I learned how incredibly powerful it is to have your values identified and named and discussed. There really is something to this!

What was my favorite part?

Hearing about what the most common values are in people. I had thought that there was a lot of similarity and I think that’s right. “contributing” “showing up”. Interesting!

Mark Friedlander on Problems with Insurance in Florida

youtube: https://youtu.be/g63n9Kgq4CY

episode page:https://www.buzzsprout.com/126848/11582094

Why did I do this show?

I’m trying to learn about what the problem is with Florida insurance!

What did I learn?

From Mark I learned some details about fraud rings and how the problem really extends beyond property insurance.

What was my favorite part?

Learning about how property developers are dying to get back into Florida!

Joe Petrelli on Trouble with Insurance in Florida

episode link: https://www.buzzsprout.com/126848/11547382

youtube: https://youtu.be/Cd9JpHsjDr8

Why did I do this show?

Insurance companies operate based on trust. Probably the single most important gatekeepers of trust of insurance companies are rating agencies. Some will find that surprising! Rating agencies are a dominant force in insurance especially in supremely dysfunctional markets like homeowners insurance in Florida where Joe’s company, Demotech, is the dominant rating agency. Florida is a mess and then that mess got hit with the largest hurricane in its long history of hurricanes!

What did I learn?

That litigation reform is very, very hard!

What was my favorite part?

I learn again and again, entrepreneurs are pretty similar everywhere, even those in the weird business of rating agencies!

Clips: