We are unable to create an online viewer for this document. Please download the document instead.
Ressler 0
Crash
the
Crash Test
Dummy
An Analysis of
Individual Factors in
Fatal Car Crashes
ESE 302
May 7, 2004
Alexandra Ressler
Ressler 1
Table of Contents
Introduction.................................................................................................................................... 2
Crash’s Friends ........................................................................................................................... 3
Data Selection ............................................................................................................................. 4
Logistic Regression..................................................................................................................... 5
Assumptions................................................................................................................................ 7
A Brief Summary of Findings..................................................................................................... 8
Logistic Regression of All Data...................................................................................................... 9
Logistic Regression of Driver Data .............................................................................................. 12
Conclusions................................................................................................................................... 18
Crash’s Friends Revisited ......................................................................................................... 19
Questions Raised for Further Study.......................................................................................... 20
Appendix A: Data Selection Criteria ........................................................................................... 21
Appendix B: Histograms for All Data ......................................................................................... 29
Appendix C: Problems with Airbags ............................................................................................ 32
Ressler 2
Introduction
Crash the crash test dummy is vitally interested in which individual factors have the
greatest effect on his chance of survival in a fatal car crash. Car crashes are the leading cause of
accidental death in the United States, and in 2002, there was a car crash fatality every 12 minutes
and a disabling injury every 14 seconds. In that year, “motor vehicle crashes were the leading
cause of death for people ages 1 to 33”1.
Leading Causes of Unintentional Injury Deaths
United States, 2002
Motor Vehicle
44,000
Poisoning
15,700
Falls
14,500
Suffocation by
Inhalation or Ingestion of Food or Other Object
4,200
Drowning
1
3,000
Crash simulates a lot of fatal car crashes, but he doesn’t get to pick the circumstances of
each crash, such as the causes, environmental conditions, and vehicle characteristics. He does
know that due to a limited testing budget, he’ll only be tested in the most common types of
private automobiles, including cars, utility vehicles, vans, and pickup trucks. He also knows that
due to limited testing equipment, he only has to worry about impacts from the front, left side,
right side, and rear. Given this information, Crash would like to know whether he can improve
his predicted chance of survival by emulating any of his friends, each of whom personifies a
particular individual trait.
1 National Safety Council, < http://www.nsc.org/library/report_injury_usa.htm>
Ressler 3
Crash’s Friends
Seatbelt Sid- Sid always uses a restraining system,
whether it’s a lap belt, a shoulder belt, or a child safety
seat (in the case of Sid Jr.). He’s always telling Crash to
buckle up- is he being sanctimonious, or does he have a
2
point?
Crash Jr.- Okay, he can’t drive, but just because he’s not
behind the wheel doesn’t mean Crash Jr.’s out of harm’s
way. Or does it? And will he be any safer when he’s as
old as Crash and has his own license? In the meantime,
he’d love to ride shotgun (as it’s the cool thing to do), but
whether Crash should let him depends on Backseat Bob.
3
Crashella- Does Crashella give new meaning to the term
“femme fatale”? Or is she better off than her bulkier
brother? And who’s a safer driver, anyway?
4
Driver Dan- He’s cool, he’s hot, and he’s driving, thank
you. Driver Dan loves the freedom of the road and the
wheel at his fingertips- but would he be safer if he let his
girlfriend Crashella drive?
5
Backseat Bob- Forget backseat driver, Bob’s a backseat
passenger! He won’t touch the passenger seat, and he
doesn’t think anyone else should either- especially
Crash Jr. Should Crash listen to Bob and put Crash Jr.’s
reputation on the critical list, or would that be
overprotective parenting at its worst?
6
Airbag Al- Al’s a bit of an airhead, but he claims there’s
less to damage that way. Should Crash listen to his
bubbleheaded philosophy, or is Al just a windbag?
7
2 http://www.bertonex19.de/Technik_/Crashtest/TC1999dummy.jpg
3 http://webs.lanset.com/aeolusaero/images/Crash%20test%20dummy%20baby.jpg
4 http://www.level7.zeves.ru/art/crash_dummies.jpg
Ressler 4
Data Selection
Crash took the data for his analysis from the Fatality Analysis Reporting System’s
(FARS) 2002 Case Listings8. His data set included information for 56,833 individuals involved
(but not necessarily killed or injured) in fatal car crashes in 2002. These individuals represented
35,783 vehicles and 25,765 fatal crashes9. In general, Crash took the following data from each
individual, converted it to binomial data (with the exception of Age) and sorted it as follows:
Impact Point- Principal
1 Front
2 Left
3 Right
4 Rear
Age
Not sorted; left as a continuous numerical variable.
Air Bag
1 Airbag Deployed
0 No Airbag Deployed
Injury Severity
0 FATAL
1 NOT
Person Type
0 Driver
1 Passenger
Restraint System-Use
0 No Restraint System-Use
1 Restraint System Used
Seating Position
0 Front Seat
1 Second Seat
Gender 0
Male
1 Female
Body Type (refers to the vehicle
1 Automobiles
body type)
2 Utility Vehicles
3 Vans
4 Pickup Trucks
All independent variables are continuous, while the dependent variable Injury Severity is
nominal. Impact Point-Principal and Body Type are not actually used in the regression analysis,
but serve as important selection criteria. The remaining variables used in the regression analysis
are Bernoulli variables, to simplify the analysis10.
5 http://www.aidanbell.com/pics/thumbs/Crash%20Test%20Dummy.jpg
6 http://www.7er.com/modelle/e32/images/e32_crashtest_dummy.jpg
7 http://www.n-tv.de/images/200207/3053071_VW_CrashtestDummy.jpg
8 Fatality Analysis Reporting System’s Web-Based Encyclopedia, < http://www-fars.nhtsa.dot.gov/queryReport.cfm?stateid=0&year=2002>
9 For the exact criteria used to select individuals, please see Appendix A.
10 And because crash test dummies like dummy variables. Were these variables not Bernoulli and treated as
nominal, the logistic regressions would calculate an estimate for every possible outcome of the nominal variable; for
Ressler 5
Logistic Regression11
As the dependent variable Injury Severity is nominal, Crash cannot use multiple regression,
and consequently cannot determine prediction intervals, r-squared values, or variance inflation
factors. Instead, Crash uses logistic regression, which is designed for Bernoulli dependent
variables and predicts the probability of an outcome rather than the outcome itself. Under
logistic regression, the parameter estimate for each independent variable is called the maximum-
likelihood estimate (as opposed to the least-squares estimate for multiple regression). As a set,
the maximum-likelihood estimates are such that the given observed values are most likely to
occur.
As an example of how the estimates work, let’s say the estimate for Age is -0.02. For each
additional year of age, the individual’s predicted probability of being a fatality decreases by two
percent. In the case of a binomial variable, if Restraint System-Use has an estimate of -0.78,
then the use of a restraint system decreases the individual’s predicted probability of being a
fatality by 78%.
The ?2 value for a maximum-likelihood estimate is its equivalent of a least-squares estimate’s
F value, and equals the square of its standardized value under the null hypothesis. The greater
the ?2 value, the more significant the variable, or the more it maximizes the chances of having
the given observed values occur. The p-value for each estimate is represented as Prob>ChiSq, or
the probability that one would get such data randomly if the null hypothesis (estimate=0) were
true, and the lower the p-value, the more significant the variable. Other tests for variable
significance are the Wald Tests for Effects, in which one runs the regression with and without
example, each seating position would be treated as a separate variable. N.B. The independent variables must be
continuous.
11 Information from “Notes on Logistic Regression” by Tony E. Smith and the JMPIN 4 online manual.
Ressler 6
the variable and compares the results to determine significance. Crash uses ?2 values to compare
the relative significance of independent variables, as the Wald Tests for Effects produce the same
relative significances among independent variables in his regressions.
It is important to note that as n approaches infinity, the asymmetric ?2 distribution becomes
increasingly skewed, the standard deviations of the parameter estimates become increasingly
asymptotic, and the ?2 values themselves become less relevant in the absolute sense. In other
words, “the scope of Chi-square statistics is limited when n becomes very large, the smallest
departure from the target becoming statistically significant”12. In regressions with very large
sample sizes, ?2 values are appropriate to compare the relative significance among variables and
regressions, but the values themselves do not adhere to the normal absolute standards. For
example, a value of 4 may be significant or “reasonably good” for a distribution with n=100, but
may be insignificant for a distribution with n=10,000.
To examine goodness of fit, Crash uses two metrics: the ChiSquare from the Whole Model
Test, which compares the regression model to a model with all parameters but the intercepts
removed, and the success rate. The success rate is calculated by rounding each individual’s
predicted chance of being a fatality to predict whether or not he or she was a fatality, then
comparing said prediction to the individual’s actual injury severity. The success rate is the
percentage of accurate predictions. Since the goal of all regressions is to find the mix of
independent variables that allows the most accurate predictions, success rate is obviously a better
metric.
12 <http://www.stat.auckland.ac.nz/~iase/publications/3/3269.pdf>
Ressler 7
Assumptions
• Crash assumes that his independent variables are the most significant individual factors
of those included in the FARS case listings. It is possible the he’s neglecting more
significant but less intuitive factors (e.g., perhaps he should consider his friend Drake the
Drunk or Donny the Designated Driver, though making alcohol level a selection criteria
would severely limit his sample size).
• Crash assumes that he hasn’t inadvertently excluded any variable-specific categories that
are significant within his chosen variables13.
• Crash assumes that his individuals are independent within his chosen variables. This is a
faulty assumption for many reasons, e.g. when two or more people come from the same
car, they cannot have independent seating positions and at most one can be the driver.
• Crash assumes that his variables are independent within each individual (i.e., no multi -
collinearities). This is a very faulty assumption, as Person Type dictates Seating Position
(the driver must be in the front) and Airbag availability is also limited to the front.
Furthermore, Person Type influences Age (almost no drivers are younger than 16). This
flaw will be addressed within the regressions.
• The Gauss-Markov assumptions of linearity, independence, and homoscedasticity are not
used for logistic regression. Though logistic regression “does not have the requirements
of the independent variables to be normally distributed, linearly related, nor equal
variance within each group (Tabachnick and Fidell, 1996, p575)”, it requires large
sample groups14.
13 For example, Person Type 3. For more information, please see Appendix A.
14 http://www.kmentor.com/socio-tech-info/archives/000480.html
Ressler 8
A Brief Summary of Findings
Using logistic regressions, Crash finds that he can accurately predict whether a person
was a fatality about 70% of the time. As long as he includes Restraint System as an independent
variable, this is true whether he looks at all individuals, just drivers, or just passengers. This
figure, while moderately disappointing compared to other logistic regressions (where the mid-
80’s is considered “reasonably good” and 90% is considered “quite respectable”15) is nonetheless
impressive when taken in context. The factors that affect a person’s survival in a fatal car crash
are not limited to individual variables, but also extend to the causes of the crash, environmental
conditions, and vehicle characteristics. If one of the individuals in the sample drove their car off
a hundred foot cliff, the data would register whether the driver used a restraint system, but not
that regardless of whether or not the driver used a restraint system he or she had almost no
chance of survival. The logistic regression models based on individual variables are limited in
their prediction accuracy because they ignore significant external factors that affect chance of
survival. When viewed in this light, Crash’s 70% success rate is respectable.
15 Based on the “Analysis of Changing Religious Perspectives” report and “Notes on Logistic Regression” by Tony
E. Smith on the class website.
Ressler 9
Logistic Regression of All Data
Crash first examines the entire data set. His first regression includes all variables.
Nominal Logistic Fit for Injury Severity SORTED
The statistics report has several notable
Whole Model Test
Model
-LogLikelihood
DF ChiSquare Prob>ChiSq
Difference
4885.923
6
9771.846
0.0000
features. First of all, the Whole Model
Full
33264.685
Reduced
38150.608
Test has an astronomical ChiSquare
RSquare (U)
0.1281
Observations (or Sum Wgts)
56833
value of 9,771.846. Rather than
Converged by Gradient
Lack Of Fit
Source
DF -LogLikelihood ChiSquare
indicating a ludicrously good fit, the
Lack Of Fit
1739
1399.472
2798.944
Saturated
1745
31865.213 Prob>ChiSq
Fitted
6
33264.685
<.0001
order of magnitude suggests that the
Parameter Estimates
Term
Estimate
Std Error ChiSquare Prob>ChiSq
unusually monstrous sample size has
Intercept
-0.0798436 0.0249661
10.23
0.0014
Restraint System-Use SORTED
-1.5615478
0.020127
6019.4
0.0000
Age
0.02082281 0.0004871
1827.1
0.0000
produced a radically skewed
Gender SORTED
0.10701923 0.0200556
28.47
<.0001
Person Type SORTED
-0.4761275 0.0233042
417.42
<.0001
Seating Position SORTED 1v2
-0.4809617 0.0340468
199.56
<.0001
?2 distribution. Consequently, all of
Airbag SORTED
0.14609065
0.021654
45.52
<.0001
For log odds of FATAL/NOT
Effect Wald Tests
Crash’s logistic regressions will have
Source
Nparm
DF Wald ChiSquare Prob>ChiSq
Restraint System-Use SORTED
1
1
6019.39665
0.0000
enormous ChiSquare values, which
Age
1
1
1827.09479
0.0000
Gender SORTED
1
1
28.474345
0.0000
Person Type SORTED
1
1
417.424313
0.0000
therefore cannot be used to determine
Seating Position SORTED 1v2
1
1
199.558036
0.0000
Airbag SORTED
1
1
45.5165284
0.0000
absolute significance, but are nonetheless helpful in determining relative significance.
Having navigated around this first pothole, Crash hits another bump in the road when he
notes that Airbag has a positive estimate, meaning it decreases one’s chances of survival. Since
this seems very counterintuitive, ceteris paribus, this leads Crash to suspect the multicollinearity
among Person Type, Airbag, and Seating Position mentioned earlier. By fitting Person Type and
Airbag by Seating Position, Crash realizes that all drivers sit in the front and that no one in the
second row of seats has an airbag, meaning these three variables are inescapably collinear.
Add New Comment