…the use of Python and other scripting languages have long been used to clean up and prepare data. In both cases the question arises – is this an IT job or not?
The problem with insurance data is that it is often inputted into a system that isn’t built to help insurers manage risk, but rather to pass financial audit.
Questions like: “did you measure you income, cash in, cash out and claims liabilities correctly” matter.
Questions like: “how much money can I lose in scenario x” don’t matter.
It’s bizarre to think that an industry entirely concerned with risk management doesn’t introduce systems to manage risk.
The core issue is related to the quote above: systems are an “IT thing”, not a core competency of an insurance professional/risk manager. They’re tantalizingly close, though, and getting closer.
I’ve found myself desperately building skills once reserved for IT people. They’re effing useful.
The rest of this post describes my latest project (to help me think through the process).
Insurance management systems are really accounting systems with fields added in to record some extra policy data. Typically, the only field audited thoroughly is the premium field. For one thing, it’s the easiest one to audit because you have an independent data source (actual cash received) to check it against.
So we get these listings which have very accurate premium transaction numbers (hopefully those data aren’t scrubbed using a DIFFERENT system) and try to answer these kinds of questions:
- How big are the limits offered by this company for different covergaes?
- What are the distributions of these limits?
- What limits does this company offer to a single insured?
- How many insureds does this company insure?
- How many separate policies does this company write?
- How many coverages does this company write per policy?
- Can we link all the claims (separate database) to the policies?
A very important step is to establish what is an insured, what is a policy and what is a transaction. A mentor of mine drilled me with the mantra: “count everything once and ONLY once”.
So, one project I’m working on is to build a database analysis tool that fixes mistyped insured names.
The key concept is Levenshtein Distance.
The idea behind LD is to measure the number of edits a word would need to undergo to turn it into another word. Useful for weeding out garbage in search engine terms, which is its most common use.
In my case, because these insured names are sometimes made up words or strange spellings of words (names of people or businesses), I want to run a LD analysis against the listing itself and tell the program to ask me when it thinks it’s found a mistake.
So what I want is a routine that builds a dictionary of all the separate words in the listing and tells me which words would get wiped out to build a more ‘efficient’ policy listing name.
I need to make sure I know how the original listing was structure so I can put it back together again, of course.
So here is the process:
- Build a database of each unique word in the file
- Discard one letter words (‘a’ and such)
- Arrange the words in alphabetical order
- Depending on how the words are arranged, compare each word to the last.
- … this is as far as I’ve gotten.