I gave GPT-41 the 2019 version of CAS Exam 9. This is the last exam in the Casualty Actuarial Society progression to a fellowship (FCAS), the highest designation available to Property & Casualty actuaries. You can see GPT’s answers here, the grading rubric here and GPT’s grades summarized here. You can see a youtube video of me walking through some of this content here: https://youtu.be/dMvjku-4hUY.
Summary
GPT 4 failed pretty miserably2 (19.75 / 52.5, about half the passing score of 38.5) but I think the score could improve with some better prompt engineering and the right plug-ins. This raises two questions:
- Can GPT-4 get all the way to a pass?
- If it passes, would that mean its capabilities match an FCAS?
I’d bet GPT-4 could pass some old exams now and with some clever hacking maybe it could get through this one, too. However, the exams are a moving target and they’ve been evolving away from GPT’s strengths for a generation. I think GPT will accelerate that evolution, strengthening the profession’s value proposition.
Let’s dive in!
Context
These exams are hard, people3. The pass rate on the 2019 Exam 9 was 56% (338/601) and I like to point out that the 601 candidates who sat for this exam had mostly passed almost a dozen other exams with *similar pass rates*. In a world of grade inflation and decadence, most actuaries I speak to agree the exams have gotten harder over time. These are among the ultimate standardized tests that push candidates on analytical depth, domain knowledge and technical skill under a stressful time constraint. For goodness sake grading the exams is a brutal exercise that itself demands all these skills!
My process was to use a pretty simple prompt, starting each question by saying “this is an actuarial exam question” then pasting the question. Each question got a fresh instance of GPT 4. All answers were generated between March 23rd and 27th, 2023. As I was going through the answers I realized that I made some mistakes in transcribing the questions and regenerated the answers4. For a couple questions GPT refused to do calculations so I experimented with more prompts like “you are an actuarial student” etc (see notes) but I was unable to get it to show me the calculations so I kept all the answers from the prompt above. I’m sure there are ways of engineering better prompts that would generate dramatically better answers for some questions. But even though there is certainly low hanging fruit available to improve GPT’s score, I also think its successes and failures contain clues to some hard boundaries on its performance.
Performance Analysis
For analysis I’ll focus on questions 1, 6, 7, 18 and 19.
Starting with problems 6 and 18, which were the two times GPT got full marks. These two problems were very well defined, fairly straightforward formulations of well known analytical models. These are the bread and butter of ‘easy’ exam questions since they don’t really challenge the candidate on comprehension. You have to memorize a technique and notice the problem requires it and work it without error. There are probably lots of examples online for these techniques and they are mostly worthless as analytical tools for a practicing actuary I think. If GPT embarrasses the profession into killing these questions, good riddance.
The next class of problem is one that GPT didn’t get correct but I think the gap to cross is small. Examples here are problems 1 and 7. In both these situations GPT got confused about some complicated modeling process that was presented in a weird way. It missed the “trick”. I think prompt engineering and some kind of plug-in5 that gives GPT a formal model structure to “plug and chug” and then leave it to interpret results will quickly improve its scores on these. Even humans don’t necessarily deeply understand the models we use and GPT interestingly often attempted to build a model from first principles when it didn’t recognize some obscure actuarial terminology. It really didn’t work. Which brings me to the last category.
I think the failure on 19 is most instructive of the limits of GPT. Here we have a toy model for calculating the capital requirements of an insurance company and GPT completely whiffed. This model is coherent in the sense that it captures underlying concepts of what a leverage model should cover but the figures and mathematical structure are ridiculous in their simplicity. It is recognizable only to someone who “speaks math” and understands what models are “trying to do”. Novel, ad hoc, toy models are tools of an effective actuary who is distilling a process to capture basic ideas and communicate with colleagues. Mostly these models don’t exist in a formal sense, aren’t studied on the internet and are reflections of the quality of judgment of a very good (or bad) actuary. Building and manipulating these models is creative analytical work at its best. I don’t see how GPT figures these out without practicing as an actuary for a decade.
Implications for CAS Exams?
I predict the CAS exams will (and should!) continue to get ‘harder’ in the coming years, partly in response to GPT, which frankly is exposing the weakest parts of the exam system for what it is: wasteful memorization. But partly this has been the direction of the exam committee for a generation already and I think this strategy has been deeply vindicated by GPT. Many of our cousin professions are asking deeper existential questions than actuaries need to. It’s hard to see what a lawyer does after we integrate even today’s version of GPT into the economy, much less whatever on earth we have in 2030. For actuaries there is a pretty clear path to continued relevance. My favorite parts of the exams were when a question made me examine my own knowledge and experience and integrate it with the syllabus material to produce something new, which aligns to the differentiating skill of the best the actuaries. GPT won’t do that for a long while.
So let’s keep focusing there!
Questions and Answers
Notes
I gave GPT-4 the 2019 version of CAS Exam 9 which you can see here (including grading rubric).
- I started a new prompt window for each numbered question. The only prompt I used was to say “this is an actuarial exam” then pasted the text below
- For some answers (Q1), GPT refused to perform calculations. After noticing this I tried a variety of different prompts like: “answer this as an actuarial student” “answer this with a perfect answer and perform all calculations” but it never gave me any calculations. So I kept the answers below since they’re equivalent to the answers I saw when experimenting with different prompts
- I noticed that GPT has some very favorite techniques for analyzing problems, liking VaR and Sharpe Ratio and pulled those in frequently. I think it lacks an ability to differentiate among more nuanced analytical techniques for highly adjacent problem spaces.
- GPT performs much better in essay type questions that use words instead of analytical techniques to find numerical answers to problems
- Question 4: There is an idiosyncratic definition of market price of risk in the syllabus which was misinterpreted by gpt. This is a bit of a trick the cas pulled to write an exam question I think
- In question 5 the syllabus departs from investopedia’s analysis of CAPM vs APT. Why?
- On Question 6 it got all the calculations right. I made an error and regenerated the response which was really really wacky.. then I discovered another error so regenerated again and got it right.. how do the papers capture this kind of variability?
- In Q 7 “Tricks” in presenting problems really mess up its capabilities. Exams (and life!) are full of problem types that do not present themselves in a way that can be simulated on a test. Very complex interest rate term structures or scrambled problem information that requires a highly generalized model of the operation of a market totally mess it up.
- In a sense it feels like to the extent our specified models are accurate, GPT can navigate problem spaces.
- In Q8, GPT totally missed the shortcut formulas and did not understand the concept of an insurance book that renews
- In Q9 the questions were pretty rote. The yield curve implications of prepayment risk were lost on gpt here which is kind of a layer deep on the standard terminology. I would bet that this could be improved with prompt engineering.
- On question 14 we have a solid understanding of the relationship of irr and surplus and investment income even though the calculation mechanics are wonky. It can interpret results it seems, which is pretty neat.
- On 15 there is a volatile mix of hallucination and real answers. Problems with the time value of money
- On 16, there are some conceptual mistakes in figuring out timings of cash flows and more complex discounting than straight PV
- For 17, it lacks an understanding of the cash flow timing of investment income and loss payments. There’s a kind of real world knowledge that actuaries have there that gpt doesn’t.
- In 18, this question was right in the wheelhouse and GPT nailed it.
- For 19, this was an ad hoc arbitrary function for leverage and GPT both misunderstood the point of the function and how to use it to calculate capital.
1. What is GPT-4? The best way to find out is to ask GPT! Go to chat.openai.com, register for free and ask GPT what it is and how to use it. I’m not kidding! At the time of writing GPT-4 is the latest model and only available to paying subscribers. For many ‘normal’ questions GPT-3.5 is about 95% as effective.
2. How did I grade this? The exam has an examiner’s report attached that has sample answers and a grading rubric for each question. I followed that. There were definitely judgment calls to make on many questions, usually where GPT did something unusual but reasonable and I awarded part marks. From my experience comparing my self-graded practice exams to the real thing I’ve always been a bit generous!
3.I myself never sat for Exams 8 or 9 though I’ve read the 9 syllabus a few times to implement some of the ideas at work and later did a whole podcast series on the text that will likely supplant a big portion of the material. See here!
4.In doing this I noticed something many others see: that GPT generates radically different answers if you regenerate the response. Very minor changes in content sometimes resulted in completely new problem solving strategies, sometimes doing weird stuff. GPT is an example of a model occasionally dismissively called a stochastic parrot. It repeats things it has ‘seen’ before. Since it has seen almost everything textual it needs to make choices and those choices can change with each click of “Regenerate Response”. In this exercise I kept the first valid choice so it is plausible that just by hitting regenerate response you could land on a better set of answers if you knew how to identify them in advance. But how might you know when to stop regenerating responses to a question you don’t know the answer to? Figure that out and you’ve a good strategy for improving GPT’s score!
5. Plug-ins are a very important recent addition to GPT where it can access other models that allow it to use external software to take actions (book flights, hotels, etc), search the Internet or basically anything else. (I fear that this footnote will age very poorly!)
One thought on “GPT-4 Fails Final Actuarial Exam”