This is not the document you are looking for? Use the search form below to find more!

Report home > Science

Predicting patients length-of-stay in hospitals using historical data

5.00 (1 votes)
Document Description
71 million individuals require yearly hospitalisation in the United States of America. Studies show that over 30 billion dollars were spent to unnecessary hospitalisation in 2006. The situation in Europe is even worse because over-hospitalisation is socially accepted. The scope of this thesis is to investigate methods that enable predicting a patients length-of-stay at a hospital during the upcoming year. If these predictions are accurate enough they can help identifying hazardous cases. This allows the health care providers to treat patients before emergencies occur. This thesis partic- ipates in an international competition conducted by HPN (http://www.heritagehealthprize.com) from where historical data has been obtained. After processing this data and organizing it in a database, we make use of the programming language R to implement a selection of relevant and often used machine-learning algorithms. We make a comparative study of these algorithms and investigate how the final result can be improved by combining the best algorithms.
File Details
Submitter
  • Name: Robert Stohr
Embed Code:

Add New Comment




Related Documents

Cost of Surrogacy in Delhi

by: metrixindia, 2 pages

Cost of Surrogacy in Delhi Mother's Lap IVF Centre is an initiative by Dr. Kuldeep Raj and Dr. Shobha Gupta where we believe that every couple has the biological right to have children. It is a ...

Signs of Infidelity in Men - How to Know If He Is Cheating

by: marcoslawson1213, 2 pages

Do you have a feeling that your boyfriend or husband might have someone else on the side? How can you be sure if he is cheating or not? What are the signs of infidelity in men? There may be times in ...

PERFORMANCE OF SCHOOLS IN ALPHABETICAL ORDER December 2010 NURSES LICENSURE EXAMINATION

by: NURSING BOARD EXAMS, 19 pages

PERFORMANCE OF SCHOOLS IN ALPHABETICAL ORDER December 2010 NURSES LICENSURE EXAMINATION

Symptoms Of Autism In Toddlers - My Autism Support

by: marysa, 2 pages

Symptoms Of Autism In Toddlers - My Autism Support

Norwich University - Master of Arts in Military History Student Outcomes

by: jelle, 2 pages

Norwich University - Master of Arts in Military History Student Outcomes

Status of Women In Islam

by: hakem, 2 pages

Status of Women In Islam

Use of ict in primary school breg ptuj

by: gusta, 18 pages

USE OF ICT IN PRIMARY SCHOOL BREG PTUJOliver Bu?ek, prof.GENDER75 % of women and 25% of menLevel of professionals education Most ...

QUANTIFICATION OF ETHANOL'S ANTIPUNISHMENT EFFECT IN HUMANS USING THE GENERALIZED MATCHING EQUATION

by: shinta, 20 pages

Increases in rates of punished behavior by the administration of anxiolytic drugs (called antipunish- ment effects) are well established in animals but not humans. The present study examined

THE EFFECT OF PRE-TREATMENT, TEMPERATURE AND LENGTH OF FROZEN STORAGE ON THE RETENTION OF CHLOROPHYLLS IN FROZEN BRASSICAS

by: shinta, 14 pages

The investigation covered broccoli, green cauliflower and Brussels sprouts. The evaluation concerned the raw material; the material after blanching; the material after cooking; and frozen ...

Point Of View In A Short Story Powerpoint

by: frediano, 8 pages

Point of View in a Short Story Objective Point of View In the objective point of view, the writer tells what happens without stating more than can be inferred from…

Content Preview
Predicting patients length-of-stay in hospitals using historical data
Robert St
ohr-Botar
IT department, Technologiecampus KAHO Sint Lieven, Belgium
Abstract
71 million individuals require yearly hospitalisation in the United States of America. Studies
show that over 30 billion dollars were spent to unnecessary hospitalisation in 2006. The situation
in Europe is even worse because over-hospitalisation is socially accepted. The scope of this thesis
is to investigate methods that enable predicting a patients length-of-stay at a hospital during the
upcoming year. If these predictions are accurate enough they can help identifying hazardous cases.
This allows the health care providers to treat patients before emergencies occur. This thesis partic-
ipates in an international competition conducted by HPN (http://www.heritagehealthprize.com)
from where historical data has been obtained. After processing this data and organizing it in a
database, we make use of the programming language R to implement a selection of relevant and
often used machine-learning algorithms. We make a comparative study of these algorithms and
investigate how the final result can be improved by combining the best algorithms.
1
Introduction
using the following metric:
1 n
=
[log(pi + 1)f - log(ai + 1)]2
(1)
n i
Information extraction from data sets is a hot
This metric, the root mean squared logarithmic
topic as it enables us to identify invisible pat-
error, is an extension of the well known root
terns and make predictions for the future based
mean squared error. The scope of this article was
on historical data. The heritage provider net-
the optimization of predicted values for the class
work [10], a medical insurance company, choose
DaysInHospital and thus achieving a top posi-
to host a competition in order to find an effi-
tion in the hosted competition. The second goal
cient method for predicting the number of days a
was examining if this score could be improved
certain patient will need to be hospitalized next
by combining the individual solution sets. The
year. The aim, predicting a patients length of
individual solution sets were generated using su-
stay accurately, is a typical supervised machine
pervised machine learning techniques. Building
learning problem. Entries are judged based on
a good model usually involves a lot of param-
their degree of accuracy on the made predic-
eter fine-tuning prior to achieving good results,
tion carried to six decimal places. The heritage
we investigate if this can be partially avoided by
health care competition ranks the competitors
generating different models using different pre-
1

dictor sets and blending their predicted solution
2.2
Data manipulation
sets.
The provided data suffers from problems all
real life data inhibits, like missing and erro-
Outline
First we start by analysing the pro-
neous instance fields. The available data went
vided data sets. These data sets were accumu-
trough four stadia ultimately forming one uni-
lated from actual claims so these need to be
form dataset of instances:
cleaned up. Once a uniform dataset has been
created we examine the predictor set and gen-
* Discretisation
erate extra predictors where needed. A compre-
* Data coherence
hensive study of the available supervised learn-
ing algorithms was made, and three techniques
* Feature construction
were selected. After obtaining the solution sets
*
from these algorithms the effect of model ensem-
Feature selection
bling was investigated.
Different subsets of features were constructed,
one with the Random Forest (RF) wrapper
2
Experiments
method, four others by splitting the complete
feature set in four parts with similar informa-
2.1
Data gathering
tion and recombining these which lead to six us-
able feature sets. By using different feature sets
The necessary data was provided to the competi-
variance is increased, this benefits model ensem-
tors by the heritage provider network in three
bling, different machine learning algorithms can
consecutive releases.
Release three supersedes
perform better individually on the same set of
the previous ones and was therefore chosen. This
data by using diverse feature sets [1]. These fea-
release contains six tables each covering particu-
ture subsets are represented in table 2.
lar information about the patient. Instances can
be uniquely identified by the patients member
Group
Content
Features
id and the year in which the claim was made.
AV
Complete predictorset
141
Table 1 shows the instance distribution, by com-
BS
RF optimalized predictorset
13
bining the instances from year one and two, the
c1
g1 + g2
99
training set is enlarged with a minimal loss of
c2
g1 + g2 + g3
109
information.
c3
g1 + g3 + g4
52
c
Modelling set
Entries
4
g2 + g3 + g4
125
Year one
76037
Table 2: Predictor working sets
Year two
71435
Target
70942
Combined
147472
2.3
Predictive modeling
Table 1: Instance distribution
Different machine learning algorithms were used,
these were obliged to meet following criteria:
2

* Able to generate a continuous prediction
is not an issue, considering the previous three
solution sets were generated.
* No correlation with the other selected algo-
The last algorithm, Generalized Boosted Ma-
rithms
chines (GBM), is an extension of the AdaBoost
* Able to deal with the distinct distribution algorithm [5]. Trough gradient boosting the al-
of the class DaysInHospital
gorithm is able to predict continuous values, in
contrary to the AdaBoost algorithm that is lim-
* Acceptable computational cost
ited to classification [6].
Boosting is a meta-
A comprehensive empirical comparison of su-
algorithm, which combines the results predicted
pervised learning algorithms was made in 2006
by generating weak learners trough iteration to
by Carauna and Niculescu-Mizil [3], with the
achieve an accurate prediction. Each iteration
results of this study in mind and the previ-
the predicted value is recalculated, instances
ously mentioned criteria the backward propa-
that were mislabelled in the previous iteration
gation,Random forest and Adaboost algorithm
gain a bigger weight in the instance set. This
were selected.
increases the chance that these iterations will be
The first algorithm, backward propagation [8],
correctly labelled in the upcoming iteration. The
is a neural network. Six solution sets were gen-
final solution is a weighted average of the indi-
erated using the available predictor subsets. Be-
vidual values predicted by the weak learners. A
fore generating these solutions the structure of
total of six solution sets were generated with the
the neural net needs to be defined, e.g. the num-
Generalized Boosted Machines algorithm.
ber of hidden layers and neurons in these layers.
One hidden layer is enough to model a non-linear
relation between input and output [4]. There is
2.4
Model ensembling
no mathematical framework available that deter-
mines the ideal number of hidden neurons, a few
By making use of the uniform instance set, the
guidelines do exist however [7] and the number
six different predictor sets and the three selected
of hidden neurons was set to one third of the
machine learning algorithms a total of fifteen so-
input neurons for each model.
lution sets were generated. Model ensembling
Random forest (RF) was selected as second
relies on the assumption that different machine
algorithm, and can be seen as an ensemble of
learning algorithms can achieve good scores indi-
decision trees [9]. This algorithm is an extension
vidually on parts of the instance set within the
of bagging [2], which adds extra variance in the
complete instance set. By combining these re-
generated models. At each node in the decision
sults an overall better solution set is created. An
tree a different subset of the available predictors
important prerequisite is a level of variance in
is selected to choose from. The Random Forest
the different solution sets, this can be realised
algorithm generates a number of these decision
by using different machine learning algorithms,
trees, a parallel process, and averages their pre-
different predictor subsets, different parameter
dicted outcome giving a weighted final prediction
settings and different training sets.
value. Random forest is not affected by redun-
The solution sets were combined trough a
dant or irrelevant features, so feature selection
weighted average technique:
3

For i = 0, . . . , i = count(M emberId)
the ensemble. E3 consists of GBMc4, RFAV and
N NAV , these are the best scoring models from
1 N
M emberId
each used machine learning algorithm. E4 uses
i =
DaysInHospitalij
(2)
N j=0
all twelve selected solution sets and lastly E5 is
a mixture of all six GBM models with the best
where N is the number of available solution sets.
scoring RF and NN model.
3
Results
Modelname
RMSLE score
E1
0,462160
The individual RMSLE scores of the machine
E2
0,462175
learning algorithms can be found in table 3. A
E3
0,463682
total of fifteen solution sets was created, from
E4
0,461919
which three were discarded as these achieved a
E5
0,461838
worse RMSLE score than the optimized constant
value calculated for the competition leader board
Table 4: Ensemble models RMSLE score
being 0,486459. These solution sets were com-
bined in different ways using formula 2.4, the
results can be found in table 4.
4
Discussion
Solution Set
RMSLE score
N NAV
0,465928
The backward propagation algorithm shows
N NBS
0,489334
great variance in its individual RMSLE scores,
N NC1
0,497286
this was to be expected as this algorithm relies
N NC2
0,497286
heavy on the available predictor input nodes,
N NC3
0,506054
small chances to this input has great effect.
N NC4
0,468368
Compared to the leader board the best scoring
RFAV
0,467816
neural network solution set would achieve a top
RFNS
0,471435
five hundred position among 1115 competitors.
RFBS
0,468368
Our second algorithm random forest shows
GBMAV
0,462672
less variance in the results, and scored a top
GBMBS
0,465419
treehundred position. GBM achieves the best
GBMC1
0,464147
average RMSLE score, and supplies the best in-
GBMC2
0,463414
dividual model c4. This model scores better then
GBMC3
0,468777
eighty per cent of the competitors.
GBMC4
0,462398
Model ensembling greatly improves the RM-
Table 3: Individual RMSLE score
SLE scores, the generated ensemble solution sets
all but one scored better then any individual so-
lution set.
The best score here was achieved
Solution set E1 is a combination of the six by solution set E5 that earned position eighty-
GBM models, E2 adds the three RF models to three on the leaderboard. A possible explanation
4

lies in the selection of models for this solution
time-consuming process of parameter fine-tuning
sets: GBM was the best scoring algorithm and
can be partially avoided. A disadvantage of the
its weigh is seventy-five per cent in this solution
used ensembling technique is that the ideal mix
set as only the best RF and NN were added.
needs to be determined trough trial and error.
Figure 1 shows the position of E5 compared to Blending techniques that assign a weight to each
the other competitors.
individual solution set according to the RMSLE
score obtained by predictions made for the train-
ing set can avoid the trial and error process.
Further improvements are still possible. An-
other machine learning algorithm like support
vector machines could be added to the blend.
Generating more solution sets could further im-
prove this result.
6
Acknowledgements
The author would like to thank the kaggle online
community for their help and contribution on
various aspects. Further the milestone winner
papers written by Willem Mestrom and the team
Market Makers were a great source [11]. The
author further thanks ing. Hannes Catrysse for
reviewing this article.
Figure 1: Leaderboard position, solution set E
References
5
[1] A.L. Blum and P. Langley.
Selection
of relevant features and examples in ma-
5
Conclussion and further work
chine learning. Artificial intelligence, 97(1-
2):245-271, 1997.
A total of 1115 teams of competitors from all
over the world entered the heritage health prize
[2] L. Breiman. Bagging predictors. Machine
competition. Position eighty-three was achieved
learning, 24(2):123-140, 1996.
with solution set E5, scoring better than 93
per cent of the competitors. Model ensembling
[3] R. Caruana and A. Niculescu-Mizil. An em-
proved to be a viable technique to improve the
pirical comparison of supervised learning al-
individual model scores, and will probably be
gorithms.
In Proceedings of the 23rd in-
a necessity if one wants to win machine learn-
ternational conference on Machine learning,
ing competitions. Trough model ensembling the
pages 161-168. ACM, 2006.
5

[4] G. Cybenko. Approximation by superposi-
tions of a sigmoidal function. Mathematics
of Control, Signals, and Systems (MCSS),
2(4):303-314, 1989.
[5] Y.
Freund
and
R.E.
Schapire.
Ex-
periments
with
a
new
boosting
algo-
rithm.
In
MACHINE
LEARNING-
INTERNATIONAL WORKSHOP THEN
CONFERENCE-, pages 148-156. MOR-
GAN KAUFMANN PUBLISHERS, INC.,
1996.
[6] J.H. Friedman. Greedy function approxima-
tion: a gradient boosting machine. Annals
of Statistics, pages 1189-1232, 2001.
[7] J. Heaton. Introduction to neural networks
with Java. Heaton Research Inc, 2005.
[8] R. Hecht-Nielsen. Theory of the backpropa-
gation neural network. In Neural Networks,
1989. IJCNN., International Joint Confer-
ence on, pages 593-605. IEEE, 1989.
[9] A. Liaw and M. Wiener.
Classification
and regression by randomforest. R news,
2(3):18-22, 2002.
[10] Herritage
Provider
Network.
http://www.heritageprovidernetwork.com/.
[11] Milestone
prize
Papers.
Mile-
stone
prize
papers,
urldate
=
http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1/.
6

Download
Predicting patients length-of-stay in hospitals using historical data

 

 

Your download will begin in a moment.
If it doesn't, click here to try again.

Share Predicting patients length-of-stay in hospitals using historical data to:

Insert your wordpress URL:

example:

http://myblog.wordpress.com/
or
http://myblog.com/

Share Predicting patients length-of-stay in hospitals using historical data as:

From:

To:

Share Predicting patients length-of-stay in hospitals using historical data.

Enter two words as shown below. If you cannot read the words, click the refresh icon.

loading

Share Predicting patients length-of-stay in hospitals using historical data as:

Copy html code above and paste to your web page.

loading