Introduction epidemics and enhancing the quality ofIntroduction epidemics and enhancing the quality of

Introduction : Data mining which is also known
as knowledge discovery in database has been extensively used in many industries
to improve customer fulfillment, increase product protection and usability (Durairaj
& Ranjani, 2013). In healthcare, data mining has
proven effective in areas such as predictive medicine, customer relationship
management, recognition of scam and misuse, administration of healthcare and
measuring the usefulness of certain treatments. Big
data in healthcare is used for reducing cost overhead, therapeutic diseases,
improving earnings, predicting epidemics and enhancing the quality of human
life by preventing deaths (Boytcheva, Angelova, Tcharaktchiev, & Angelov,
2011). The main data mining tasks are
association rule mining, patterns discovery, classification and prediction and
clustering (Ilayaraja
& Meyyappan, 2013). Many researchers are proposed
different algorithms to diagnose different chronic diseases like diabetic
mellitus (Concaro,
Sacchi, Cerra, & Bellazzi, 2009) (Shetty,
2250) (Shukla
& Arora, 2016), cancer (Khare
& Gupta, 2016) (Dagliati
et al., 2017) etc . Early prediction of
diseases using data mining techniques is the most important challenging task .
The temporal data mining process is concerned with the algorithms by which
temporal patterns are generated and enumerated from temporal data.

successful diagnosis and prognosis we need a set of temporal data from
healthcare which includes a set of lab tests and treatment or medication given
to patients that are not easily available. So we need to collect or generate
that data from the healthcare institute. From that data set we need to find a
frequent sequence of treatments or symptoms to disease diagnosis or prognostic
accuracy of diseases from the past data.         

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

main objective is to design a classifier which will successfully classify a
sequential pattern of treatment for disease diagnosis or prognosis. Many classifier
such as CAR, CMAR (Wenmin
Li, Jiawei Han, & Jian Pei, n.d.), CPAR  and WCAR (Soni
& Vyas, 2014) are already proposed by many
researchers. For finding a specific sequence of treatments given to patients
for successful diagnosis or prognosis of a disease we need to generate the time
series data into sequence pattern data using some preprocessing tasks which
includes discretization or normalization of patterns. We need to generate the
pattern of treatments and assign priority to that sequence which is of most
desired.  After generating a sequence of
patterns we need to design a sequence mining algorithm which used association
rule mining to find the correlation between different treatments given to
patients for diagnosis or prognosis of disease. Some of the sequence pattern
mining algorithms already proposed are GSP (Srikant
& Agrawal, 1996), SPADE (Zaki,
2001)  and
et al., 2004). So we need to propose a
sequence mining classifier which successfully classify the disease class and
which gives high accuracy than already proposed classifiers. Considering the
output of the classification model, the medical doctors can make a better
decision on the treatment to be applied to the patient.


brief Review of the work already done in the field :

(Kayaer & Yildirim, 2003)
investigated the general regression neural network (GRNN) for diagnosing the
diabetes from Pima Indian diabetes data set. GRNN approximates any arbitrary
function between input and output vectors, depicting the function estimate
directly from the training data and don’t require an iterative training
process. They use the kernel regression method to produce the probable value of
dependent variable which minimizes the mean squared error. GRNN is a method for
estimating the joint probability density function of dependent and independent
variable when giving only a training data set. Three different neural network
structures multilayer perception, radial basis function and GRNN were applied
to Pima Indians Diabetes data sets and the result shows that, GRNN can be an
option to classify a medical data.


(Exarchos, Papaloukas, Fotiadis, & Michalis, 2006)
association rules using classification for presence of ischemic beats in long
duration ECGs. Their methodology was implemented in a three stages ECG feature
extraction; feature discretization; and rule generation and beat
classification. All stages of the planned methodology were executed
automatically. The equal depth binning and the customized classification tree
algorithm (CT-disc) were tested for discretization, whereas CBA, CMAR, CPAR,
and Apriori-TFPC algorithms were tested for classification using association
rules. Their methodology gave high accuracy, combined with the justification
for the classification decisions.


(Berlingerio, Bonchi, Giannotti, & Turini, 2007)
a Time Annotated Sequences (TAS) mining approach to extract to data concerning
a set of patients in the follow up of a liver transplantation. Their aim is to
assess the usefulness of the extracorporeal photopheresis (ECP) as a treatment
to prevent rejection in solid organ transplantation. A set of biochemical
variables is recorded for every patient at different time moments after the
transplantation. The TAS pattern extracted shows the values of interleukins and
other clinical parameters at specific dates, from which it is possible for the
physician to assess the usefulness of the ECP treatment.


(Dua, Singh, & Thompson, 2009)
proposes a technique for the classification of mammograms using a distinctive
weighted association rule based classifier where all the images are preprocessed
to disclose regions of interest. Texture mechanism are extracted from segmented
parts of the image and discretized for rule discovery. They derived Association
rules between different texture components discovered from segments of images
and applied for classification based on their intra-class and inter-class
dependencies. Then these discovered rules are applied for the classification of
a generally used mammography dataset, and precise testing is performed to
assess the rules’ effectiveness under different classification scenarios.
Through their results they showed that the technique worked well for such
datasets, with accuracies about 89%, which is higher than the accuracy rates of
other rule based classification techniques.


(Tai & Chiu, 2009)
applied association rule mining (ARM) using apriori algorithm to look at the
labyrinthian network of ADHD comorbidity, and to observe the practicality of
ARM in comorbidity studies by means of clinic databases. The support and
confidence values of ARM outcomes were examined and compared then Comorbidity
rates and relative risk (RR) ratios of both groups of each diagnosis. The
results of applying ARM were the developmental delay (DD) appears as an
significant node between ADHD and anxiety disorder, mild mental retardation and


(Gharib, Nassar, Taha, & Abraham, 2010)
described the notion of temporal association rules with the purpose of solving
the problem of managing time series by taking time expressions into association
rules since temporal databases are constantly appended or updated in order that
the explored rules need to be updated. The experimental results on the
synthetic and the real dataset show a considerable accuracy than the
conventional approach of mining the whole database. They employed the framework
of the incremental procedure of the Sliding-Window Filtering algorithm (SWF) and
the results are compared with Twain algorithm from different parameters such as
: run time, minimum support, original database size and incremental database


(Boytcheva et al., 2011)
a technique for temporal event matrix depiction and a learning framework that
gives complex latent event patterns or diabetes mellitus complications. They
discussed mining parallel episodes, tracking serial extensions and learning
partial orders in their research. Their main objective of research is to
observe comorbidity of diseases and their association with different


(Al Jarullah, 2011)
a decision tree approach for the diagnosis of Type II diabetes from Pima Indian
data sets in 2 phase process. In first phase data preprocessing, managing
missing values and numerical discretization is done and in second phase they
use weka data mining tool to construct a decision tree prediction model for diabetic
prediction. Numerical discretization was made to reduce the complexity of the
problem and to accomplish the better accuracy before learning. He implemented
J48 algorithm to construct the decision tree.


(Kaneiwa & Kudo, 2011)
proposed a technique for mining local patterns from sequences using rough set
theory. They depicted an algorithm for creating decision rules that take into
consideration local patterns for arriving at a particular decision. In order to
apply sequential data to rough set theory, the local pattern size was specified
to allow a set of sequences to be transformed into a sequential information
system. The proposed algorithm generates sequential decision rules according to
the size of subsequences by altering the size from 2 to a maximum number so as
to check different granulates for sequential data.


(Shouman, Turner, & Stocker, 2012)
a research model which uses single data mining techniques along with hybrid data
mining techniques for treatment of heart disease. They apply single and hybrid
data mining techniques to heart disease diagnosis benchmark dataset to
establish baseline precision for each single and hybrid data mining technique
in the identification of heart disease patients and then apply the same single
or hybrid data mining techniques used in heart disease diagnosis to heart
disease treatment dataset to examine if single or hybrid data mining technique
can achieve improved results in identifying appropriate treatments as that
achieved in the diagnosis.


(Sharma, 2013)
gave the statistical figures about the percentages of proportional mortality in
india by communicable and non communicable diseases. She said that majority of
deaths in india through NCDs were recognized as cardiovascular diseases,
cancers, chronic respiratory diseases and diabetes. She also gave some graph
describing the percentage distribution of deaths due to NCDs in india and also
gave NCDs risk factors and determinants and concludes that there is a need to
highlight on health encouragement and protective procedures to reduce the
coverage of risk factors.


(Songthung & Sripanidkulchai, 2016)
classification technique for diabetic dataset of 12 hospitals in Thailand
during 2011-2012 who are females age 15 years or older. They user Rapidminer
Studio 7.0 along with naïve bayes and CHAID decision tree classifier to predict
high risk for diabetes mellitus and then compared those results with hand
computed mechanism for calculating diabetes risk. They proposed a framework for
diabetes prediction in which phases are data acquisition, feature
preprocessing, experiments and evaluation metrics. They used data from 11
hospitals for training dataset and 1 hospital for testing dataset and
classified individual into two classes’ high risk and low risk. They used
10-fold cross validation for training and then applied the model on testing
set. They concluded that naïve bayes provides better result for predicting
diabetes risk.


(Khare & Gupta, 2016)
an exploration of associative rule mining for analyzing heart disease dataset
from UCI repository and finding the factors that affect heart disease. The
proposed approach had two phases, training phase and testing phase. They used
apriori association rule mining algorithm to generate the rules. Two types of
rules are being explored are attribute => class for causative factors for
heart disease and attribute => attribute for associative relation of
different attribute for heart disease. The rules were generated with 10%
minimum support threshold and 85% minimum confidence threshold. The aim is to
study only contributing factors of heart disease of heart disease or which are
significant in diagnosis of heart disease.


(Shukla & Arora, 2016)
utilizes random forest tree as a base learner together with separate
information mining method scaled conjugate gradient to differentiate patients
with diabetes mellitus utilizing diabetes risk variables. After computing the
exactness, affectability and specificity factors on Pima Indians data set the
accuracy of random forest tree is calculated and it gave 92.96% accurate
precision. They also found that the prediction can be increased by increasing
the k fold cross validation but due to it the error will increase and accuracy
will be varied so it cannot be implemented but tried out. They compared the
performances of random forest and SCG and concluded that random forest gave the
better grouping results.


(Srivastava, Kumar, & Mangla, 2016)
tried to analyze the diabetic data set using hive and R and generated graphs
for analysis of dataset. They used regression data mining technique for
predictive analysis of diabetic treatment. He applied different classification
algorithm for dataset but concluded that c4.5 is the best classification
algorithm. They use gini index to find the discrimination among the data and
also uses K nearest neighbor technique to make a prediction model for diagnosis


(Sharmila & Vetha Manickam, 2016)
proposed a map reduce framework using apache hadoop to implement k means
clustering technique to divide the Pima Indian Diabetic data set into 2
clusters, cluster1 for diabetic and cluster2 for non diabetic. The evaluation
comes about clearly demonstrates the parallelism with Hadoop cluster is the
most flexible and the most skillful approach for analysis of huge information
mining forms with increasing information set size.


(Heydari, Teimouri, Heshmati, & Alavinia, 2016)
compared five classification algorithms namely support vector machine,
artificial neural network, decision tree, nearest neighbor and Bayesian network
in an attempt to find the best algorithm for diagnosing type 2 diabetes. The
accuracy rate of these algorithms are computed using Weka open source tool as
97.44%, 81.19%, 95.03%, 90.85% and 91.60% respectively. The result shows the
usefulness of these classification techniques on a dataset depends on the
application, nature and complexity of the dataset used. Here the dataset used
for diagnosis is for 2536 cases screened for type 2 diabetes, in the city of
Tabriz, Iran. They have concluded that it is not possible to declare that one
classification technique will always work best, and the opinion of medical
experts is always required for best possible results.


Contribution in the field of proposed work:

(Soni & Vyas, 2014)
proposed a generalized framework for sequence mining of health care database
consists the set of temporal tuples consisting of basic information of the
patient i.e. Patient_ID, age, gender and a series of sequences that represents
the set of treatments given to the patient along with the class label which
indicates the patient is cured or not. Their steps of proposed method was
representation and modeling in which they represent the sequences in numeric
form to perform apriori algorithm, then use Euclidean distance to find
similarity measure, then weight was applied to attributes using maximum
likelihood estimation then association rule mining is applied to generate class
association rule and then predict sequence of treatments using high confidence
value. Classification association rules (CAR) having consequent as positive
class label represents the frequent sequence of successful treatments given to
the patient.


(Wright, Wright, McCoy, & Sittig, 2015)
the CSPADE algorithm to extract sequential patterns of diabetes medication
prescriptions both at the drug class and generic drug levels of granularity by
exploring temporal relationships between medications and produce rules that forecast
which diabetes medication is prescribed next for a patient.  The preprocessing step included transformation
of horizontally laid dataset into vertical id-lists for every item consisting
of all the sequence-ids and transaction-times where item was found that allowed
sequential patterns to be found using intersections of id-lists and also
minimizes the number of database scans that are necessary. After generating IF
THEN rules the antecedent contains all prior patterns of history of medication
of a patients and consequent contains next drug and ranked them by support and
used the top 5 rules for testing. To predict the next drug of patients they
make a test data with the history of patient and predict the next drug. If no
drug was predicted at a particular sequence then they reduce the sequence of
antecedent from the beginning that is first medication in the sequence are
removed and then again generate the rules with 3 suggestions as next drug in
consequent. They used 10-fold cross validation for assessment of rules. The
limitation of the method concluded are, that the dataset used in their study contains
some patients who started some other medication prior to this study and that
medication is not included, The dataset generated for the study contains data
for 3 years while progression of diseases could take longer and that the
dataset only contains claimed data where all patients were insured and the
result might not be relevant to uninsured populations.


(Pazhanikumar & Arumugaperumal, 2016)
proposed mining algorithm using the non-redundant closed weighted sequential
patterns with flexible time intervals for the medical time series data in which
the sequence weight for each sequence is calculated based on the time interval
between the itemsets and consequently the candidate sequences are generated
with flexible time intervals and then compute the frequent sequential patterns
with the proposed support measure and afterward the frequent sequential
patterns are given to closure checking process which generates the closed
sequential patterns with flexible time intervals. They finally concluded that
in future, proposed methodology could be extended to produce non-redundant
closed weighted sequential rules.


(Malhotra, Navathe, Chau, Hadjipanayis, & Sun,
2016) proposed sequential
mining algorithms with two clinical constraints ‘exact-order’ and ‘temporal
overlap’, to extract treatment patterns as features used in predictive
modeling. They applied both logistic regression model and Cox regression to
model patient survival outcome. The goal of their study is to apply their model
to effectively predict patients who survived for greater than 12 months. For
mining such treatment plans they applied existing approaches such as GSP and
SPADE  by adding two new constraints,
namely, ‘exact-order’ and ‘overlap’ constraints. They developed a treatment
advisor tool to advise treatments for a patient based on treatments given to
patients having a similar clinical and genomic profile.


(Tóth, Kósa, & Vathy-Fogarassy, 2017)
created a hierarchical data analysis technique to produce health care sequences
and sequential patterns from medical databases in which healthcare treatments
are considered as events and applying SQL queries to generate event sequences and
are represented at different levels. Event sequences are created by using a
hierarchical code system for treatments and a discipline of aggregations, which
reveal medical practices professionally. Their proposed method gave the
comparison of the specificity of treatments of healthcare institutes, and the
investigation of common or uncommon treatment sequences for given diseases. In
the lowest level event sequences hold all treatments for the entire examination
period, indicated by their International Classification of Procedures in
Medicine (ICMP) code. In case of more than one event with the same timestamp
only the highest priority event will taken. At last they replaced multi character
codes by a one character code while ignoring the timestamp from the event
sequences to gets a generalized care sequence of patients. During their
analysis they concluded that the number of patients treated shows a very weak
relationship with the frequency of the different event sequences.


Methodology during the tenure of research work :

During the tenure of research work the following step will be

(i)      Data collection regarding patient
tests and treatments are collected from hospitals and in case if it was not
generated from hospitals then we can download datasets from UCI machine
learning repository or PHYSIONET.

(ii)     From
temporal data set it is required to generate temporal sequence patterns.

(iii)    Calculate
the support measure and assign weight to the sequences generated.

(iv)    Find
the frequent sequences after applying sequence mining algorithms like GSP,

(v)     Generate
class association rules using consequent as the class labels in the sequence to
predict the presence of diseased.

(vi)    Given
a test data we can check the performance of the classifier.


The comparative study of proposed algorithm
with other already available predictive sequence mining system will be performed as a technique
could work good in one Dataset but may not for the others.
The above rule and relationship will be used
for predicting the chance of chronic diseases. A new algorithm to generate positive and
negative association rule is proposed to be developed. Figure
below depicts the methodology :