NPV Modeling for Direct Mail Insurance
C. Olivia Rud, VP of Analytic Services, DirectCom, West Chester, PA
ABSTRACT
2) Risk - indices in matrix of gender by marital status by
age group based on actuarial analysis.
Acquisition modeling for direct mail insurance has
3) Product Profitability - present value of product specific
the unique challenge of targeting responsive
5-year profit measure usually provided by product
manager.
customers while minimizing risk. In addition, factors
4) Marketing Expense - cost of package, mailing &
such as the probability of a paid premium and
processing (approval, fulfillment).
product profitability all effect the net present value
of a prospect. This presentation will detail the steps
to develop a net present value, which can used to
optimize the selection of prospects. Topics covered
THE DATA COLLECTION
include how to structure and prepare data for
analysis to model construction and validation.
A previous campaign mail tape is overlaid with response
Additional directions for overlaying a risk
and paid sale results. Since these campaigns are mailed
segmentation model and product profitability are
quarterly, a 6-month-old campaign is used to insure mature
described. A final discussion will cover criteria for
results.
final name selection based on company objectives.
The present value of the 5-year product profitability is
determined to be $553. This includes a built in attrition and
INTRODUCTION
cancellation rate.
Increasing competition and slimmer profit margins in the
The marketing expense which includes the mail piece and
direct marketing industry have fueled the demand for data
postage is $.78.
storage, data access and tools to analyze or ‘mine’ data.
While data warehousing has stepped in to provide storage
The risk matrix (see Appendix A) represents indices, which
and access, data mining has expanded to provide a plethora
adjust the overall profitability based actuarial analysis. It
of tools for improving marketing efficiency.
shows that women tend to live longer than men, married
people live longer than singles and course, one of the
This paper details a series of steps in the data mining
strongest predictors of death is old age.
process, which takes raw data and produces a net present
value (NPV). The first step describes the process used to
To predict the performance of future insurance promotions,
extract and sample the data. The second step uses
data is selected from a previous campaign consisting of
elementary data analysis to examine the data integrity and
966,856 offers. To reduce the amount of data for analysis
determine methods for data clean up. The third step defines
and maintain the most powerful information, a sample is
the process to build a predictive model. This includes
th
created using all of the ‘Paid Sales’ and 1/25 of the
defining the objective function, variable preparation and the
remaining records. This includes non-responders and non-
statistical methodology for developing the model. The next
paying responders. The following code creates the sample
step overlays some financial measures to calculate the
dataset:
NPV. Finally, diagnostic tables and graphs demonstrate
how the NPV can be used to improve the efficiency of the
DATA A B;
selection process for a life insurance acquisition campaign.
SET LIB.DATA;
An epilogue will describe the ease with which all of these
IF PREMIUM > 0 THEN OUTPUT A;
steps can be performed using the SAS® Enterprise Miner
ELSE OUTPUT B;
data mining software.
DATA LIB.SAMPDATA;
SET A B (WHERE=(RANUNI(5555) < .04));
OBJECTIVE FUNCTION
SAMP_WGT = 25;
RUN;
The overall objective is to measure Net Present Value
(NPV) of future profits over a 5-year period. If we can
This code is putting into the sample dataset, all customers
predict which prospects will be profitable, we can target our
th
who paid a premium and a 1/25
random sample of the
solicitations only to those prospects and reduce our mail
balance of accounts. It also creates a weight variable called
expense. NPV consists of four major components:
SAMP_WGT with a value of 25.
1) Paid Sale - probability calculated by a model. Individual
must respond, be approved by risk and pay their first
premium.
The following table displays the sample characteristics:
and hence may have received fewer pieces of direct mail.
This will often lead to better response rates.
Campaign
Weight
Sample
The following code is used to replace missing values:
Non Resp/Non Pd Resp
929,075
37,163
25
Responders/Paid
37,781
37,781
1
IF INCOME = . THEN INC_MISS = 1;
Total
966,856
74,944
ELSE INC_MISS = 0;
The non-responders and non-paid responders are grouped
IF INCOME = ‘.’ THEN INCOME = 49;
together since our target is paid responders. This gives us
a manageable sample size of 74,944.
MODEL DEVELOPMENT
THE DATA CLEAN-UP
The first component of the NPV, the probability of a paid
To check data quality, a simple data mining procedure like
sale, is based on a binary outcome, which is easily modeled
PROC UNIVARIATE can provide a great deal of
using logistic regression. Logistic regression uses
information. In addition to other details, it calculates three
continuous values to predict the odds of an event
measures of central tendency: mean, median and mode. It
happening. The log of the odds is a linear function of the
also calculates measures of spread such as the variance
predictors. The equation is similar to the one used in linear
and standard deviation and it displays quantile measures
regression with the exception of the use of a log
and extreme values. It is good practice to do a univariate
transformation to the independent variable. The equation is
analysis of all continuous variables under consideration.
as follows:
The following code will perform a univariate analysis on the
log(p/(1-p)) = B0 + B1X1 + B2X2 + …… + BnXn
variable income:
PROC UNIVARIATE DATA=LIB.DATA;
Variable Preparation - Dependent
VAR INCOME;
RUN;
To define the dependent variable, create the variable
PAIDSALE defined as follows:
The output is displayed in Appendix B. The variable
INCOME is in units of $1000. N represents the sample
IF PREMIUM > 0 THEN PAIDSALE = 1;
size of 74,944. The mean value of 291.4656 is suspicious.
ELSE PAIDSALE = 0;
With further scrutiny, we see that the highest value for
INCOME is 2000. It is probably a data entry error and
should be deleted.
Variable Preparation - Independent: Categorical
The two values representing the number of values greater
Categorical variables need to be coded with numeric values
than zero and the number of values not equal to zero are the
for use in the model. Because logistic regression reads all
same at 74,914. This implies that 30 records have missing
independent variables as continuous, categorical variables
values for income. We choose to replace the missing value
need to be coded into n-1 binary (0/1) variables, where n is
with the mean. First, we must delete the observation with
the total number of categories.
the incorrect value for income and rerun the univariate
analysis.
The following example deals with four geographic regions:
north, south, midwest, west. The following code creates
The results from the corrected data produce more
three new variables:
reasonable results (see Appendix C). With the outlier
deleted, the mean is in a reasonable range at a value of 49.
IF REGION = ‘EAST’ THEN EAST = 1;
This value is used to replace the missing values for income.
ELSE EAST = 0;
Some analysts prefer to use the median to replace missing
IF REGION = ‘MIDWEST’ THEN MIDWEST = 1;
values. Even further accuracy can be obtained using
ELSE MIDWEST = 0;
cluster analysis to calculate cluster means. This technique
IF REGION = ‘SOUTH’ THEN SOUTH = 1;
is beyond the scope of this paper.
ELSE SOUTH = 0;
Because a missing value can be indicative of other factors,
If the value for REGION is ‘WEST’, then the values for the
it is advisable to create a binary variable, which equals 1 if
three named variables will all have a value of 0.
the value is missing and 0 otherwise. For example, income
is routinely overlaid from an outside source. Missing values
often indicate that a name didn’t match the outside data
source. This can imply that the name is on fewer databases
Variable Preparation - Independent: Continuous
evaluates all of the variables in relation to the dependent
variable while considering interactions among the
Since, logistic regression looks for a linear relationship
independent or predictor variables. It begins by measuring
between the independent variables and the log of the odds
the significance of all the variables and then removing one at
of the dependent variable, transformations can be used to
a time until only the significant variables remain. A
make the independent variables more linear. Examples of
reasonable significance level is the default value of .05. If
transformations include the square, cube, square root, cube
too many variables end up in the final model, the signifiance
root, and the log.
level can be lowered to .01, .001, or .0001.
Some complex methods have been developed to determine
The sample weight must be included in the model code to
the most suitable transformations. However, with the
recreate the original population dynamics. If you eliminate
increased computer speed, a simpler method is as follows:
the weight, the model will still produce correct ranking-
create a list of common/favorite transformations; create new
ordering but the actual estimates for the probability of a
variables using every transformation for each continuous
‘paid-sale’ will be incorrect. Since our NPV model uses
variable; perform a logistic regression using all forms of
actual estimates, we will include the weights.
each continuous variable against the dependent variable.
This allows the model to select which form or forms fit best.
The following code is used to build the final model.
Occasionally, more than one transformation is significant.
After each continuous variable has been processed through
PROC LOGISTIC LIB.MODEL:
this method, select the one or two most significant forms for
WEIGHT SMP_WGT;
the final model. The following code demonstrates this
MODEL PAIDSALE = AGE_MISS AGE_CUBE EAST
technique for the variable AGE:
MIDWEST SOUTH INCOME INC_MISS LOG_INC
MARRIED SINGLE POPDENS MAIL_ORD//
PROC LOGISTIC LIB.DATA:
SELECTION=BACKWARD;
WEIGHT SMP_WGT;
RUN;
MODEL PAIDSALE = AGE AGE_MISS AGE_SQR
AGE_CUBE AGE_LOG / SELECTION=STEPWISE;
The resulting model has 7 predictors. (See Appendix E) The
RUN;
parameter estimate is multiplied times the value of the
variable to create the final probability. The strength of the
The logistic model output (see Appendix D) shows two
predictive power is distributed like a chi-square so we look
forms of AGE to be significant in combination: AGE_MISS
to that distribution for significance. The higher the chi-
and AGE_CUBE. These forms will be introduced into the
square, the lower the probability of the event occurring
final model.
randomly (pr > chi-square). The strongest predictor is the
variable MAIL_ORD. This has a value of 1 if the individual
has a record of a previous mail order purchase. This may
imply that the person is comfortable making purchases
Partition Data
through the mail and is therefore a good mail-order
insurance prospect.
The data are partitioned into two datasets, one for model
development and one for validation. This is accomplished
The following equation shows how the probability is
by randomly splitting the data in half using the following
calculated, once the parameter estimates have been
SAS® code:
calculated:
DATA LIB.MODEL LIB.VALID;
prob = exp(B0 + B1X1 + B2X2 + …… + BnXn)
SET LIB.DATA;
(1+ exp(B0 + B1X1 + B2X2 + …… + BnXn))
IF RANUNI(0) < .5 THEN OUTPUT LIB.MODEL;
ELSE OUTPUT LIB.VALID;
This creates the final score, which can be evaluated using a
RUN;
gains table (see Appendix F). Sorting the dataset by the
score and dividing it into 10 groups of equal volume creates
If the model performs well on the model data and not as well
the gains table.
on the validation data, the model may be over-fitting the
data. This happens when the model memorizes the data
The validation dataset is also scored and evaluated in a
and fits the models to unique characteristics of that
gains table (see Appendix G).
particular data. A good, robust model will score with
comparable performance on both the model and validation
Both of these tables show strong rank ordering. This can
datasets.
be seen by the gradual decrease in predicted and actual
probability of ‘Paid Sale’ from the top decile to the bottom
As a result of the variable preparation, a set of ‘candidate’
decile. The validation data shows similar results, which
variables has been selected for the final model. The next
indicates a robust model. To get a sense of the ‘lift’ created
step is to choose the model options. The backward
by the model, a gains chart is a powerful visual tool (see
selection process is favored by some modelers because it
Appendix H). The Y-axis represents the % of ‘Paid Sales’
captured by each model. The X-axis represents the % of
help marketers in understanding their markets. In addition, it
the total population mailed. Without the model, if you mail
provides powerful tools for improving efficiencies, which can
50% of the file, you get 50% of the potential ‘Paid Sales’. If
have a huge impact on the bottom line.
you use the model and mail the same percentage, you
capture over 97% of the ‘Paid Sales’. This means that at
50% of the file, the model provides a ‘lift’ of 94% {(97-
50)/50}.
Epilogue
SAS® has developed a tool called the SAS® Enterprise
Financial Assessment
Miner, which automates much of the process we just
completed. Using icons and flow charts, the data is
To get the final NPV we use the formula:
selected, sampled, partitioned, cleaned, transformed,
NPV = Pr(Paid Sale) * Risk * Product Profitability -
modeled, validated, scored, and displayed in gains tables
Marketing Expense
and gains charts. In addition, it has many other features for
scrutinizing, segmenting and modeling data. Plan to attend
At this point, we apply the risk matrix and product
the presentation and get a quick overview of this powerful
profitability value we discussed earlier. The financial
tool.
assessment shows the models ability to select the most
profitable customers based on (See Appendix H). Notice
how the risk index is lower for the most responsive
customers. This is common in direct response and
References
demonstrates ‘adverse selection’. In other words, the
riskier prospects are often the most responsive.
Hosmer, DW., Jr. and Lemeshow, S. (1989), Applied
Logistic Regression, New York: John Wiley & Sons, Inc.
At some point in the process, a decision is made to mail a
percent of the file. In this case, you could consider the fact
SAS Institute Inc. (1989) SAS/Stat User’s Guide, Vol.2,
that in decile 7, the NPV becomes negative and limit your
Version 6, Fourth Edition, Cary NC: SAS Institute Inc.
selection to deciles 1 through 6. Another decision criteria
could be that you need to be above a certain ‘hurdle rate’ to
cover fixed expenses. In this case, you might look at the
cumulative NPV to be above a certain amount such as $30.
AUTHOR CONTACT
Decisions are often made considering a combination of
criteria.
C. Olivia Rud
DirectCom
The final evaluation of your efforts may be measured in a
1554 Paoli Pike #286
couple of ways. You could determine the goal to mail fewer
West Chester, PA 19380
pieces and capture the same NPV. If we mail the entire file
with random selection, we would capture $13,915,946 in
Voice: (610) 918-3801
NPV. This has a mail cost of $754,155. By mailing 5
Fax: (610) 429-5252
deciles using the model, we would capture $14,042,255 in
Internet: Olivia Rud@aol.com
NPV with a mail cost of only $377,074. In other words, with
the model we could capture slightly more NPV and cut our
marketing cost in half!
SAS is a registered trademark or trademark of SAS Institute
Inc. in the USA and other countries. ® indicates USA
Or, we can compare similar mail volumes and increase
registration.
NPV. With random selection at 50% of the file, we would
capture $6,957,973 in NPV. Modeled, the NPV would climb
to $14,042,255. This is a lift of over 100% ((14042255-
6957973)/ 6957973 = 1.018).
Conclusion
Through a series of well designed steps, we have
demonstrated the power of Data Mining. It clearly serves to
APPENDIX A
MALE
FEM
ALE
M
S
D
W
M
S
D
W
< 40
1.22
1.15
1.18
1.10
1.36
1.29
1.21
1.17
40-49
1.12
1.01
1.08
1.02
1.25
1.18
1.13
1.09
50-59
0.98
0.92
0.90
0.85
1.13
1.08
1.10
1.01
60+
0.85
0.74
0.80
0.79
1.03
0.98
0.93
0.88
APPENDIX B
Univariate Analysis
Variable=INCOME
Moments
Quantiles
Extremes
Low
High
N
74,944
100% Max 2000
6
74
Mean
291.4656
75% Q3 57
13
75
Std Dev
43.4356
50% Med 47
28
77
Num ^= 0
74,914
25% Q1 41
30
130
Num > 0
74,914
0% Min 6
32 2000
APPENDIX C
Univariate Analysis
Variable=INCOME
Moments
Quantiles
Extremes
Low
High
N
74,944
100% Max 130
6
73
Mean
49
75% Q3 56
13
74
Std Dev
6.32946
50% Med 46.5
28
75
Num ^= 0
74,913
25% Q1 38.5
30
77
Num > 0
74,913
0% Min 6
32 130
APPENDIX D
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 10.1594 27.1690 0.1398 0.7085 . .
AGE 1 -23.2172 0.3284 0.0057 0.9358 -4.287240 0.000
AGE_MISS 1 -3.8671 1.7783 4.7290 0.0297 -0.997359 .
AGE_CUBE 1 0.0033 1.3594 5.9005 0.0411 0.851626
AGE_LOG 1 1.9442 0.2658 0.0633 0.8013 0.936637
AGE_SQR 1 0.8499 0.7291 1.5507 0.2130 0.672450
APPENDIX E
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -2.5744 0.0169 0.1398 0.0001 . .
AGE_CUBE 1 -0.0166 0.0059 0.0057 0.0049 -0.030639 0.000
MIDWEST 1 0.0263 0.0063 4.7290 0.0001 0.043238 1.027
LOG_INC 1 0.0620 0.0085 5.9005 0.0001 0.081625
1.064
INC_MISS 1 0.0291 0.0105 0.0633 0.0055 0.038147
1.030
MARRIED 1 0.0353 0.0081 1.5507 0.0001 0.046115
1.036
POPDENS 1 -0.2117 0.0057 0.0633 0.0001 -0.263967
0.809
MAIL_ORD 1 0.0634 0.0062 7.5507 0.0001 0.079093
1.065
Association of Predicted Probabilities and Observed Response
Concordant = 57.1%
Somers’ D = 0.209
Discordant = 36.2%
Gamma = 0.224
Tied = 6.6%
Tau-a = 0.030
(7977226992 pairs)
c = 0.604
APPENDIX F
Model Data
NUMBER OF
PREDICTED %
ACTUAL % OF
NUMBER OF
CUM ACTUAL %
DECILE
ACCOUNTS
OF PAID SALES
PAID SALES
PAID SALES
OF PAID SALES
1
48,342
11.47%
11.36%
5,492
11.36%
2
48,342
8.46%
8.63%
4,172
9.99%
3
48,342
4.93%
5.03%
2.429
8.34%
4
48,342
2.14%
1.94%
935
6.74%
5
48,342
0.94%
0.95%
459
5.58%
6
48,342
0.25%
0.28%
133
4.70%
7
48,342
0.11%
0.11%
51
4.04%
8
48,342
0.08%
0.08%
39
3.54%
9
48,342
0.00%
0.00%
2
3.15%
10
48,342
0.00%
0.00%
1
2.84%
APPENDIX G
Validation Data
NUMBER OF
PREDICTED %
ACTUAL % OF
NUMBER OF
CUM ACTUAL %
DECILE
ACCOUNTS
OF PAID SALES
PAID SALES
PAID SALES
OF PAID SALES
1
48,342
10.35%
10.12%
4,891
10.12%
2
48,342
8.44%
8.16%
3,945
9.14%
3
48,342
5.32%
5.76%
2.783
8.01%
4
48,342
2.16%
2.38%
1,151
6.60%
5
48,342
1.03%
1.07%
519
5.50%
6
48,342
0.48%
0.56%
269
4.67%
7
48,342
0.31%
0.23%
112
4.04%
8
48,342
0.06%
0.05%
25
3.54%
9
48,342
0.01%
0.01%
5
3.15%
10
48,342
0.00%
0.00%
1
2.83%
APPENDIX H
Financial Analysis
NUMBER OF
PREDICTED %
RISK
PRODUCT
AVERAGE
CUM AVERAGE
SUM CUM
DECILE
ACCOUNTS
OF PAID SALES
INDEX
PROFITABILITY
NPV
NPV
NPV
1
96,685
10.35%
0.94
$553
$58.27
$58.27
$5,633,985
2
96,686
8.44%
0.99
$553
$46.47
$52.37
$10,126,713
3
96,686
5.32%
0.98
$553
$26.45
$43.73
$12,684,175
4
96,685
2.16%
0.96
$553
$9.49
$35.17
$13,602,084
5
96,686
1.03%
1.01
$553
$4.55
$29.05
$14,042,255
6
96,685
0.48%
1.00
$553
$0.74
$24.33
$14,114,007
7
96,686
0.31%
1.03
$553
($0.18)
$20.83
$14,096,406
8
96,685
0.06%
0.99
$553
($0.34)
$18.18
$14,063,329
9
96,685
0.01%
1.06
$553
($0.76)
$16.08
$13,990,047
10
96,686
0.00%
1.10
$553
($0.77)
$14.39
$13,915,946
Add New Comment
Showing 1 comment