It is essential for any research undertaking to review previous studies in the area of investigation and sum up the trends of research practices and the direction of the findings there from. In this chapter, literature conceptually relevant to this study and their related literature has been reviewed.
2. 1The Foundations of Data Mining
Data mining techniques are the result of a long process of research and product development (18). This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery (18). In the business community as noted by Berson et al. (19), data mining was ready for application with the support of three technologies, mainly, massive data collection, powerful multiprocessor computers, and data mining algorithms that are now sufficiently mature. Among those three technologies, data mining algorithms embody techniques that have only recently been implemented as mature, reliable, understandable tools consistently outperform older statistical methods (19). According to Thearling (18), in the steps of evolution of data mining (Table 1) from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining.
Data Collection(1960s)Computers, tapes, disksRetrospective, static data deliveryData Access(1980s)Relational databases (RDBMS), Structured Query Language (SQL), ODBCRetrospective, dynamic data delivery at record levelData Warehousing &Decision Support(1990s)On-line analytic processing (OLAP), multidimensional databases, data warehousesRetrospective, dynamic data delivery at multiple levelsData Mining(2000+) or (Emerging Today)Advanced algorithms, multiprocessor computers, massive databasesProspective, proactive information deliveryTable 1: Steps in the Evolution of Data Mining. Source: Thearling (18)
2. 2What is Data Mining?
The objective of data mining is to identify valid novel, potentially useful, and understandable correlations and patterns in existing data (14). Finding useful patterns in data is known by different names (including data mining) in different communities (e. g., knowledge extraction, information discovery, information harvesting, data archeology, and data pattern processing) (12). The term ” data mining” is primarily used by statisticians, database researchers, and the MIS and business communities (12). Data mining is an extension of traditional data analysis and statistical approaches in that it incorporates analytical techniques drawn from a range of disciplines including, but not limited to: numerical analysis, pattern matching and areas of artificial intelligence such as machine learning, and neural networks and genetic algorithms (11). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction. Data mining can be performed on data represented in quantitative, textual, or multimedia forms. Data mining applications can use a variety of parameters to examine the data (20). Such parameters as noted by Two Crows Corporation (20) include association (patterns where one event is connected to another event, such as purchasing a pen and purchasing paper); sequence or path analysis (patterns where one event leads to another event, such as the birth of a child and purchasing diapers); classification (identification of new patterns, such ascoincidences between duct tape purchases and plastic sheeting purchases); clustering (finding and visually documenting groups of previously unknown facts, such as geographic location and brand preferences); and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities, such as the prediction that people who join an athletic club may take exercise classes). While many data mining tasks follow a traditional, hypothesis-driven data analysis approach, it is commonplace to employ an opportunistic, data driven approach that encourages the pattern detection algorithms to find useful trends, patterns, and relationships. In this regard, Daniel (13) wrote, as compared to other data analysis applications, data mining utilizes a discovery approach, in which algorithms can be used to examine several multidimensional data relationships simultaneously, identifying those that are unique or frequently represented. In contrast, many simpler analytical tools or statistical analysis software utilize a verification-based approach, where the user develops a hypothesis and then tests the data to prove or disprove the hypothesis (13). A number of advances in technology and business processes have contributed to a growing interest in data mining in both the public and private sectors. Some of these changes include the growth of computer networks, which can be used to connect databases; the development of enhanced search-related techniques such as neural networks and advanced algorithms; the spread of the client/server computing model, allowing users to access centralized data resources from the desktop; and an increased ability to combine data from disparate sources into a single searchable source (20). In addition to these improved data management tools, the increased availability of information and the decreasing costs of storing it have also played a role. Over the past several years there has been a rapid increase in the volume of information collected and stored, with some observers suggesting that the quantity of the world’s data approximately doubles every year (20). At the same time, the costs of data storage have decreased significantly from dollars per megabyte to pennies per megabyte. Similarly, computing power has continued to double every 18-24 months, while the relative cost of computing power has continued to decrease (20). Reflecting this conceptualization of data mining, various definitions of, and synonyms for, data mining have emerged in recent years. Translating data mining word by word means, the mining or digging in data with the purpose of finding useful information or respectively knowledge. Coming to the more abstract and very well known definition of Moxon (21), data mining is defined as ” The non-trivial extraction of implicit, previously unknown, and potentially useful information from data”. Groth (22) mentions another interesting aspect of data mining. He describes it as ” a knowledge discovery process of extracting previously unknown, actionable information from very large dataset”. In the words of Witten and Frank (23): ” Data mining is the process of meaningful new correlation, patterns and trends by sifting through large amounts of data, using pattern recognition technologies as well as statistical and mathematical techniques”. As pointed out by Witten and Frank (23), the idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Apart from the definitions of data mining, many researchers and practitioners treat data mining as a synonym for another popularly used term Knowledge Discovery in Databases, or KDD, while others view data mining as simply an essential tool in the process of knowledge discovery process (12). In relation to this, some observers consider data mining to be just one step in a larger process known as knowledge discovery in Databases (KDD). Other steps in the KDD process, in progressive order, include data cleaning, data integration, data selection, data transformation, (data mining), pattern evaluation, and knowledge presentation (14).
2. 3Uses of Data Mining
Data mining tools can predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analysis offered by data mining move far beyond the analysis of past events provided by retrospective tools typical offered by decision support systems (18). Data mining tools can give answers to business questions that traditionally were time consuming to resolve (18). Today data mining is primarily used by companies such as banking, insurance, medicine, retailing, and health care commonly use data mining to reduce costs, enhance research, and increase sales. So different the companies are, so different are the purposes of the use of data mining (12). According to David (12), here are a few areas in which information industries use data mining to achieve a strategic benefit: the insurance and banking industries use data mining applications to detect fraud and assist in risk assessment (e. g., credit scoring). Using customer data collected over several years, companies can develop models that predict whether a customer is a good credit risk, or whether an accident claim may be fraudulent and should be investigated more closely. the medical community sometimes uses data mining to help predict the effectiveness of a procedure or medicine. Pharmaceutical firms use data mining of chemical compounds and genetic material to help guide research on new treatments for diseases. Retailers can use information collected through affinity programs (e. g., shoppers’ club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together. Companies such as telephone service providers and music clubs can use data mining to create a ” churn analysis,” to assess which customers are likely to remain as subscribers and which ones are likely to switch to a competitor. Several factors have motivated the use of data mining applications in healthcare (24). Among other factors, as identified by Milley (24) is that: the existence of medical insurance fraud and abuse, for example, has led many healthcare insurers to attempt to reduce their losses by using data mining tools to help them find and track offenders. the huge amounts of data generated by healthcare transactions are too complex and voluminous to be processed and analyzed by traditional methods. Data mining can improve decision-making by discovering patterns and trends in large amounts of complex data. Such analysis has become increasingly essential as financial pressures have heightened the need for healthcare organizations to make decisions based on the analysis of clinical and financial data. the realization that data mining can generate information that is very useful to all parties involved in the healthcare industry. For example, data mining applications can help healthcare insurers detect fraud and abuse, and healthcare providers can gain assistance in making decisions, for example, in customer relationship management. Data mining applications also can benefit healthcare providers, such as hospitals, clinics and physicians, and patients, for example, by identifying effective treatments and best practices. Recently, data mining has been increasingly cited as an important tool for health care organizations. Accordingly, (25) observed that ” data mining techniques are being increasingly implemented in healthcare sectors in order to improve work efficiency and enhance quality of decision making process”. In addition, the use of data mining in healthcare, as suggested again by Eapen, (25), enables comprehensive management of medical knowledge and its secure exchange between recipients and providers of healthcare services; the elimination of manual tasks of data extraction from charts or filling of specialized questionnaires; extraction of data directly from electronic records; transfer on secure electronic system of medical records that will save lives and reduce the cost of health care, early detection of infectious diseases with advanced collection of data etc.
2. 4Data Mining and Statistics
The disciplines of statistics and data mining both aim to discover structure in data. So much do their aims overlap, that some people regard data mining as a subset of statistics. But that is not a realistic assessment as data mining also makes use of ideas, tools, and methods from other areas – particularly database technology and machine learning, and is not heavily concerned with some areas in which statisticians are interested (26). Statistical procedures do, however, play a major role in data mining, particularly in the processes of developing and assessing models. Most of the learning algorithms use statistical tests when constructing rules or trees and also for correcting models that are overfitted. Statistical tests are also used to validate machine learning models and to evaluate machine learning algorithms (26). Some of the commonly used statistical analysis techniques as described by Johnson and Wicheren (27) are discussed below: Descriptive and Visualization Techniques include simple descriptive statistics such as: averages and measures of variation, counts and percentages, and cross-tabs and simple correlations. Cluster Analysis seeks to organize information about variables so that relatively homogeneous groups, or ” clusters,” can be formed. Correlation Analysis measures the relationship between two variables. The resulting correlation coefficient shows if changes in one variable will result in changes in the other. When comparing the correlation between two variables, the goal is to see if a change in the independent variable will result in a change in the dependent variable. Discriminant Analysis is used to predict membership in two or more mutually exclusive groups from a set of predictors, when there is no natural ordering on the groups. Discriminant analysis can be seen as the inverse of a one-way multivariate analysis of variance (ANOVA) in that the levels of the independent variable (or factor) for ANOVA become the categories of the dependent variable for discriminant analysis, and the dependent variables of the MANOVA become the predictors for discriminant analysis. Factor Analysis is useful for understanding the underlying reasons for the correlations among a group of variables. The main applications of factor analytic techniques are to reduce the number of variables and to detect structure in the relationships among variables; that is to classify variables. Therefore, factor analysis can be applied as a data reduction or structure detection method. Regression Analysis is a statistical tool that uses the relation between two or more quantitative variables so that one variable (dependent variable) can be predicted from the other(s) (independent variables). Regression analysis comes in many flavors, including simple linear, multiple linear, curvilinear, and multiple curvilinear regression models, as well as logistic regression, which is discussed next. Logistic Regression is used when the response variable is a binary or qualitative outcome. It uses a maximum likelihood method, that is, it maximizes the probability of obtaining the observed results given the fitted regression coefficients. Some of the more common flavors that logistic regression comes in include simple, multiple, polytomous and Poisson logistic regression models. 2. 5Methodologies for Data MiningAs data mining is coming of age, several methodologies have been developed, each with their own perspective (11). Broadly used methodologies in data mining as described by Daniel (13) are: KDD (Knowledge Discovery in Data base), CRISP-DM (Cross-Industry Standard Process for Data Mining), SEMMA (Sample Explore Modify Model Assess). Despite that KDD, SEMMA, and CRISP-DM are usually referred as methodologies, they are also referred as processes, in the sense that they consist of a particular course of action intended to achieve a result (13). Here three of them are reviewed as follows: 2. 5. 1KDD MethodologyThe term knowledge discovery in databases or KDD, for short, was coined in 1989 to refer to the broad process of finding knowledge in data, and to emphasize the ” high-level” application of particular data mining techniques (20). Kantardzic (11) defined the KDD process as: The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Unlike others methodologies, the KDD process is categorized as an academic model which is considered to be more applicable for academic research works. The KDD Process stands for the Knowledge Discovery in Databases. The approach to gain knowledge out of a set of data was separated by Fayyad (28) into individual steps. The individuality results out of different tools we use, and different outcomes that are needed. According to Fayyad (28) there are five KDD steps: Selection, Pre-processing, Transformation, Data Mining, and Interpretation/Evaluation. These five steps are passed through iteratively. Every step can be seen as a work-through phase. Such a phase requires the supervision of a user and can lead to multiple results. The best of these results is used for the next iteration, the others should be documented. An outline of the steps that are in Figure 1 will be adequate for understanding the concepts required for the KDD process. Figure 1: The KDD ProcessSource: Fayyad (28)In the following, the steps will be briefly described. STEP 1: SELECTION. In the selection-step the significant data gets selected or created. Henceforward the KDD process is maintained on the gathered target data. Only relevant information is selected, and also metadata or data that represents background knowledge in a case where we have millions of data points. STEP 2: PRE-PROCESSINGA good result after applying data mining depends on an appropriate data preparation in the beginning. Important elements of the provided data have to be detected and filtered out. These kinds of things are settled in the pre-processing phase. To detect knowledge the effective main task is to pre-process the data properly and not only to apply data mining tools. The less noise contained in data the higher is the efficiency of data mining. As noted by Fayyad (28), elements of the pre-processing include the cleaning of wrong data, the treatment of missing values, and the creation of new attributes. STEP 3: TRANSFORMATIONTransformation of data in this step can be defined as decreasing the dimensionality of the data that is sent for data mining. Usually there are cases where there are a high number of attributes in the database for a particular case. With the reduction of dimensionality we increase the efficiency of the data-mining step with respect to the accuracy and time utilization. The transformation phase of the data may result in a number of different data formats, since variable data mining tools may require variable formats. The data also is manually or automatically reduced. The reduction can be made via lossless aggregation or a loss full selection of only the most important elements. A representative selection can be used to draw conclusions to the entire data. STEP 4: DATA MININGThe data mining step is the major step in data KDD. In the data mining phase, the data mining task is approached. This is when the cleaned and pre-processed data is sent into the intelligent algorithms for classification, clustering, similarity search within the data, and so on. Here we chose the algorithms that are suitable for discovering patterns in the data. Some of the algorithms provide better accuracy in terms of knowledge discovery than others. Thus selecting the right algorithms can be crucial at this point. STEP 5: INTERPRETATION/EVALUATIONThe interpretation of the detected pattern reveals whether or not the pattern is interesting. That is, whether they contain knowledge at all. This is why this step is also called evaluation. The duty is to represent the result in an appropriate way so it can be examined thoroughly. If the located pattern is not interesting, the cause for it has to be found out. It will probably be necessary to fall back on a previous step for another attempt. In this step, the mined data is presented to the end user in a human-viewable format. This involves data visualization, which the user interprets and understands the discovered knowledge obtained by the algorithms. The KDD process is interactive and iterative, involving numerous steps with many decisions being made by the user. Additionally, the KDD process must be preceded by the development of an understanding of the application domain, the relevant prior knowledge and the goals of the end-user. It also must be continued by the knowledge consolidation by incorporating this knowledge into the system (28). 2. 5. 2 CRISP-DM MethodologyCRISP-DM evolved to become the de facto industry standard. The CRISP-DM process was developed by the effort of a data mining companies initially composed with DaimlerChryrler, SPSS and NCR and is designed to provide a generic process model that can be specialized according to the needs of any particular company or industry (29). CRISP-DM stands for CRoss-Industry Standard Process for Data Mining. It consists on a cycle that comprises six stages as illustrated in figure 2:
Figure 2: Phases of the CRISP-DM process ModelSource: (Chapman et al. (29)
Phase 1: BUSINESS UNDERSTANDINGThis initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. Phase 2: DATA UNDERSTANDINGThe data understanding phase starts with an initial data collection and proceeds with activities to become familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. Phase 3: DATA PREPARATIONThe data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools. Phase 4: MODELINGIn this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed. Phase 5: EVALUATIONAt this stage in the project the model (or models) built appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to evaluate the model more thoroughly, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been considered sufficiently. At the end of this phase, a decision on the use of the data mining results should be reached. Phase 6: DEPLOYMENTCreation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the client can use. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the client, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the client to understand up front what actions will need to be carried out to make use of the models created (29). 2. 5. 3 THE SEMMA Analysis CycleThe SEMMA process was developed by the SAS Institute. SAS developed a data mining analysis cycle known by the acronym SEMMA. This acronym stands for the five steps of the analyses, mainly, Sample, Explore, Modify, Model, Assess, and refers to the process of conducting a data mining project as shown in Figure 4. The SEMMA analysis cycle guides the analyst through the process of exploring the data using visual and statistical techniques, transforming data to uncover the most significant predictive variables, modeling the variables to predict outcomes, and assessing the model by testing it with new data.
Figure 3: The SEMMA Analysis Cycle. Source: SAS Enterprise (30)
Step 1: Sample
This step consists on sampling the data by extracting a portion of a large data set big enough to contain the significant information, yet small enough to manipulate quickly. This stage is pointed out as being optional.
Step 2: Explore
After sampling the data, the next step is to explore the data visually or numerically for trends or groupings. Exploration helps to refine the discovery process. Techniques such as factor analysis, correlation analysis and clustering are often used in the discovery process.
Step 3: Modify
This step consists on the modification of the data by creating, selecting, and transforming the variables to focus the model selection process, or to modify the data for clarity or consistence.
Step 4: Model
This step consists on modeling the data by allowing the software to search automatically for a combination of data that reliably predicts a desired outcome.
Step 5: Assess
The last step is to assess the model to determine how well it performs. A common means of assessing a model is to set aside a portion of the data during the sampling stage. If the model is valid, it should work for both the reserved sample and for the sample that was used to develop the model (30).
2. 6Data Mining Tasks:
There are different kinds of data mining functionalities (tasks) that can be used to extract knowledge from large database. These could be classified into two main categories: ” Descriptive” and ” Predictive” which is considered as the high level goals of data mining (11, 14). On the descriptive end of the spectrum, the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. Clustering, Summarisation, and Visualisation of databases are the main applications of descriptive data mining. The usefulness of this concept is that it enables one to generalise the data set from multiple levels of abstraction, which facilitates the examination of the general behaviour of the data, since it is impossible to deduce that from a large database. On the other, predictive, end of the spectrum, the goal of data mining is to produce models that are capable of producing prediction results when applied to unseen, future cases. Classification and Regression are the most frequent types of tasks that are applied in data mining. However, the relative importance of prediction and description for particular data-mining applications can vary considerably (14). Generally, data mining can be used in many different ways. Some of the tasks most commonly found are (11, 14): Classification: discovery of a predictive learning function that classifies a data item into one of several predefined classes. Regression: discovery of a predictive learning function, which maps a data item to a real-value prediction variable. Clustering: a common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data. In other words, clustering maps a data item into one of several categorical classes (or clusters) in which the classes must be determined from the data—unlike classification in which the classes are predefined. Clusters are defined by finding natural groupings of data items based on similarity metrics or probability density models. Summarization: an additional descriptive task that involves methods for finding a compact description for a set (or subset) of data. A simple example would be the mean and standard deviations for all fields. More sophisticated functions involve summary rules, multivariate visualization techniques, and functional relationships between variables. Summarization functions are often used in interactive exploratory data analysis and automated report generation. Dependency Modeling: describes significant dependencies among variables. Dependency models exist at two levels: structured and quantitative. The structural level of the model specifies (often in graphical form) which variables are locally dependent; the quantitative level specifies the strengths of the dependencies using some numerical scale. Link analysis: determines relations between fields in the database (e. g., association rules to describe which items are commonly purchased with other items in grocery stores). The focus is on deriving multi-field correlations satisfying support and confidence thresholds. Sequence analysis: models sequential patterns (e. g., in data with time dependence, such as time-series analysis). The goal is to model the states of the process generating the sequence or to extract and report deviation and trends over time. For this research project, a classification task is to be carried out since a model is to be built by using the pre-classified data of past records of anaemia prevalence in women that are included in the 2011 EDHS survey. 2. 6. 1ClassificationClassification, as indicated before, is a task in Data mining. According to Han and Kamber (14), ” classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown”. The derived model is based on the analysis of a set of training data (i. e., data objects whose class label is known. In this line, Daniel (13) says that given a collection of records (training set), each record contains a set of attributes, one of the attributes is the class. Usually, in the classification process, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it (14). In general, as decribed by Kantardzic (11), classification is the process of assigning a discrete label value (class) to an unlabeled record, and a classifier is a model (a result of classification) that predicts one attribute-class of a sample-when the other attributes are given. In doing so, samples are divided into predefined groups. For example, a simple classification might group customer billing records into two specific classes: those who pay their bills within thirty days and those who takes longer than thirty days to pay. Different classification methodologies are applied today in almost every discipline where the task of classification, because of the large amount of data, requires automation of the process. Examples of classification methods used as a part of data-mining applications include classifying trends in financial market and identifying objects in large image databases (11). As Han and Kamber (14) wrote, classification is a two-step process: Model construction and model usage. The former step is concerned with the building of a classification model by describing a set of predetermined classes using training data set. The training data set a set of tuples where each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute. The learned model is represented in the form of classification rules, decision trees, or mathematical formula. The later step involves the use of a model (classifier) built to predict or classify unknown objects based on the patterns observed in the training set. There are various classification methods. Popular classification techniques include the following (13, 14): Decision Tree based Methods, Rule-based Methods (Rule Induction), Neural Networks , Bayesian Networks, K-Nearest Neighbour, Support Vector Machines. 2. 6. 2Selected Classification TechniquesAlthough the choice of techniques suitable for classification tasks seems to be strongly dependent on the application, the data mining techniques that are applied in many real-world applications as a powerful solution to classification problems, among other methods, are decision trees and rule inductions (18). This section, therefore, discusses the concepts and principles of the techniques identified in carrying out classification prediction. 2. 6. 2. 1Decision TreesDecision tree is well-known to be one other effective classification tasks in several domains (18). It is a way of representing series of rules that lead to a class or value. Decision tree, as defined by Han and Kamber (14), ” is a flow-chart-like structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions”. In support of this, Kartnzicr (11), states that a decision tree consists of nodes that where attributes are tested. The outgoing branches of a node correspond to all the possible outcomes of the test at the node. The samples, at a non-leaf node in the tree structure, are thus partitioned along the branches and each child node gets its corresponding subset of samples. In general, it is a knowledge representation structure consisting of nodes where the top most node is the root node and branches organized in the form of a tree such that, every internal non-leaf node is labeled with values of the attributes (11). Decision trees can easily be converted to classification rules. As pointed out by Witten and Frank (23), a decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable. The target variable is usually categorical and the decision tree model is used either to calculate the probability that a given record belongs to each of the categories, or to classify the record by assigning it to the most likely class. Decision trees can also be used to estimate the value of a continuous variable, although there are other techniques more suitable to that task. Decision trees are commonly used in data mining to examine data and induce the tree and its rules that will be used to make predictions. A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification (23). The decision tree method encompasses a number of specific algorithms including Chi-squared Automatic Interaction Detection (CHAID), Classification and Regression Trees (CART), C4. 5 and C5. 0 (20). Decision Trees, as a data mining technique, is very useful in the process of knowledge discovery in the medical field. In addition, using this technique is very convenient since the Decision Tree is simple to understand, works with mixed data types, models non-linear functions, handles classification, and most of the readily available tools use it (31). Although decision tree is known to have these advantages over numerous techniques, it is also known to have scalability and efficiency problems, such as substantial decrease in performance and poor use of available system resources (18). 2. 6. 2. 2Rule InductionRule induction is one of the major forms of data mining and is perhaps the most commonly used data mining techniques for classification tasks (18). It is also perhaps the form of data mining that most closely resembles the process that most people think about when they think about data mining, namely ” mining” for gold through a vast database. The gold in this case would be a rule that is interesting – that tells something about the database that we didn’t already know and probably weren’t able to explicitly articulate (aside from saying ” show me things that are interesting”) (32). Rule induction is the process of extracting useful ‘ if then’ rules from data based on statistical significance. A Rule based system constructs a set of if-then-rules. Knowledge represents has the form: IF conditions THEN conclusion. The rules themselves consist of two parts. The left hand side is called the antecedent and the right hand side is called the consequent. The rule antecedent (the IF part) can consist of just one condition or multiple conditions which must all be true in order for the consequent to be true at the given accuracy where as the rule consequent (THEN part) contains just a single condition rather than multiple conditions (32). The rules are used to find interesting patterns in the database but they are also used at times for prediction. There are two main things that are important to understanding a rule: Accuracy: refers to the probability that if the antecedent is true that the precedent will be true. High accuracy means that this is a rule that is highly dependable. Coverage: refers to the number of records in the database that the rule applies to. High coverage means that the rule can be used very often and also that it is less likely to be a spurious artifact of the sampling technique or idiosyncrasies of the database. Table 2: Displays the tradeoff between coverage and accuracy (18).
Accuracy LowAccuracy HighCoverage HighRule is rarely correct but can be used often. Rule is often correct and can be used often. Coverage LowRule is rarely correct and can be only rarely used. Rule is often correct but can be only rarely used.
Table 2: Rule coverage versus accuracy. Source: Thearling (18)
Rule induction is also known to have an advantage over numerous techniques due to the output it produces. Some of the advantages as noted by Langley and Simon (32) are: Easy to interpret, easy to generate, and rapidly classify new instances. Apart from these, an important quality of rule induction is that they can easily handle missing values and numeric attributes (32). However, one of the biggest problems with rule induction systems is the sometimes overwhelming number of rules that are produced. Most of which have no practical value or interest. Some of the rules are so inaccurate that they cannot be used, some have so little coverage that though they are interesting they have little applicability, and finally many of the rules capture patterns and information that the user is already familiar with (18). 2. 6. 2. 3Decision trees vs. Rule inductionsIn this section, the common difference and similarities existed between decision trees and rule inductions have been discussed. Decision trees also produce rules but in a very different way than rule induction systems ( ). The main difference between the rules that are produced by decision trees and rule induction systems as pointed out by Thearling (18) is that ” Decision trees produce rules that are mutually exclusive and collectively exhaustive with respect to the training database while rule induction systems produce rules that are not mutually exclusive and might be collectively exhaustive”. Thearling (18) noted that the reason for this difference is the way in which the two algorithms operate, that is, rule induction seeks to go from the bottom up and collect all possible patterns that are interesting and then later use those patterns for some prediction target. Decisions trees, on the other hand, work from a prediction target downward in what is known as a ” greedy” search. Regarding their commonality, Thearling (18) again stated that ” the one thing that decision trees and rule induction systems have in common is the fact that they both need to find ways to combine and simplify rules”. In a decision tree this can be as simple as recognizing that if a lower split on a predictor is more constrained than a split on the same predictor further up in the tree that both don’t need to be provided to the user but only the more restrictive one. While rules from rule induction systems are generally created by taking a simple high level rule and adding new constraints to it until the coverage gets so small as to not be meaningful. 2. 7An Overview of Anaemia2. 7. 1Anaemia and Maternal HealthAnaemia is known to have detrimental health implications, particularly for mothers and young children. Women with severe anaemia can experience difficulty meeting oxygen-transport requirements near and at delivery, especially if significant hemorrhaging occurs. This may be an underlying cause of maternal death and prenatal and perinatal infant loss (33). In fact, unfavourable pregnancy outcomes have been reported to be more common in anaemic mothers than in non-anaemic mothers (1). The adverse effects of mild anaemia are less well documented than the effects of severe anaemia. However, in several studies, premature delivery, placental hypertrophy, and reduced excretion of estriol (maternal hormone) have been observed to be more common in mildly anaemic mothers than in non-anaemic mothers (4). 2. 7. 2Definition of AnaemiaAnaemia is a medical condition in which the red blood cell count or haemoglobin is less than normal. According to Sherman (3), anaemia is defined as a reduction in the number of circulating red blood cells, the haemoglobin concentration, or the volume of packed red cells (hematocrit) in the blood. Red Blood Cells (RBC) are the most numerous cells in the blood; approximately 20 billion of them circulate in the blood of an adult. They are required to transport oxygen to the tissues and organs of the body. Red blood cells contain haemoglobin, an iron-containing protein that acts in the transportation of oxygen to the tissues and carbon dioxide from the tissues (5). Generally, when the concentrations of haemoglobin or red blood cells in the blood are reduced to below normal, anaemia is developed. Sherman (3) wrote that since haemoglobin normally carries oxygen from the lungs to tissues, anaemia leads to the lack of oxygen in organs; and as all human cells depend on oxygen, varying degrees of anaemia can have a wide range of clinical consequences. 2. 7. 3 Main Causes of AnaemiaCommonly, anaemia is the final outcome of nutritional deficiency of iron, folate, vitamin B12, and some other nutrients (4). Many other causes of anaemia have also been identified. They include malaria, hemorrhage, infection, genetic disorders (hemoglobinopathies), parasite infestation (hookworm), chronic disease, and others. Nutritional deficiency, due primarily to a lack of bioavailable dietary iron, accounts for the majority of anaemia cases in the world (34). The contribution of other causes of anaemia depends on many factors, including level of economic development, climate, condition of health care, and the existence of anaemia control and prevention programs. In semi-developed and developed countries, iron deficiency is the main cause of anaemia among women and children. In approximately 50 to 80 percent of anaemia cases, iron deficiency is considered to be the main etiologic factor (5). In many countries, however, a number of other factors besides iron deficiency contribute to the burden of anaemia. Of particular importance in developing countries are malaria and intestinal parasites, especially hookworm infestation. The level of contribution of these factors to the overall prevalence of anaemia depends on the magnitude of malaria epidemics, the existence of iron supplementation and fortification programs, and other conditions in each particular country. Recently, the role of HIV epidemics as an important factor contributing to anaemia in countries of sub-Saharan Africa has been emphasized. It has been shown that HIV negatively affects the release of erythropoietin, which is a kidney hormone that stimulates production of red blood cells. Perhaps because of this mechanism, HIV-infected women, even without clinical symptoms of opportunistic infections, are more likely to become anaemic than HIV-free women (1). 2. 7. 4 Health consequencesAnaemia is an indicator of both poor nutrition and poor health. The consequences of anaemia for women include increased risk of low birth-weight or prematurity, perinatal and neonatal mortality, inadequate iron stores for the newborn, increased risk of maternal morbidity and mortality, and lowered physical activity, mental concentration, and productivity (2). Women with even mild anaemia may experience fatigue and have reduced work capacity (7). 2. 7. 5 Classification of anaemia as a public health problemThe prevalence of anaemia as a public health problem is categorized as follows: <5%, no public health problem; 5–19. 9%, mild public health problem; 20–39. 9%, moderate public health problem; >= 40%, severe public health problem (2). 2. 3 Review of Research PapersAlthough the researcher could not find specific studies which have applied data mining techniques in predicting the status of anaemia in women of reproductive age from the EDHS data set, many researches related to anaemia disease using traditional statistical techniques have been conducted. Moreover, the researcher found and reviewed various health related researches done by using data mining techniques. In this section, therefore, some of the studies that have been done by different researchers using both of the above methods are discussed. One of the research paper conducted on the prevalence of anaemia was that of Samson and Fikre (15) entitled ” Correlates of anaemia among women of reproductive age in Ethiopia: evidence from Ethiopian DHS 2005″. The objective of the study was to assess the correlates of anaemia among women of reproductive age in Ethiopia. Regarding methods, a quantitative cross-sectional study were carried out based on the secondary data of the Ethiopian Demographic Health Survey (EDHS) 2005. As far as the procedures followed in the study is concerned, they downloaded the EDHS 2005 data from Measure DHS website in SPSS format. After data cleaning was done, data on a total of 5963 women of reproductive age were chosen for the study. In addition, information on a wide-range of potential independent variables such as socio-demographic, economic, dietary intake, nutritional status, micronutrient supplementation history, breastfeeding history, maternity services utilization, family planning use, fertility history, etc were extracted accordingly. The data analysis was done using SPSS. Frequencies, percentage, mean and standard deviation were used for the descriptive analysis. Independent sample t-test and One-way Analysis of Variance (ANOVA) were applied to compare mean blood hemoglobin level across different categories of the independent variables. Binary logistic regression was employed to control potential confounders and to explore association between dependent variable (anaemia status) and a wide range of the aforementioned independent variables. After analyzing the data, the investigators have arrived at the following findings: The prevalence of anaemia was 27. 4% (95% CI: 26. 3-28. 5%). Rural residence, poor educational and economic status, 30-39 years of age and high parity were key factors predisposing women to anaemia. Lactating women and those who gave birth in the month of the interview had higher risk than their counterparts. Those not using contraceptive were more likely to develop anaemia than current contraceptive users. Utilizing maternity services, taking iron and vitamin A supplement during pregnancy and postpartum period, respectively, didn’t have a significant effect in reducing the burden of anaemia. Samson and Fikre (15) ends their study by stating the following recommendations: Family planning, economic and educational empowerment of women has affirmative inputs in combating anaemia. A combination of nutrition, educational and livelihood promotion strategies should be instated to enhance dietary diversity. Maternal nutrition interventions should be integrated in a stronger manner into maternity services. A cross-sectional community-based study with analytic component was conducted by Haider (16) in nine of the 11 regions of Ethiopia to assess the magnitude of anaemia, deficiencies of iron and folic acid and compare the factors responsible for anaemia among anaemic and non-anaemic women to identify the potential causes of anaemia in apparently healthy-looking Ethiopian women of childbearing age (15-49 years). The study revealed that one in every three women had anaemia and deficiency of folic acid while one in every two had iron deficiency, suggesting that deficiencies of both folic acid and iron constitute the major micronutrient deficiencies in Ethiopian women. The risk imposed by anaemia to the health of women ranging from impediment of daily activities and poor pregnancy outcome calls for effective public-health measures, such as improved nutrient supplementation, health education, and timely treatment of illnesses. Haider and Pobocik (17) also conducted the first large nutrition study of a representative sample of women aged 15 to 49 years in Ethiopia. The association of anaemia to demographic and health variables was tested by chi-square and a stepwise backward logistic regression model was applied to test the significant associations observed in chi square tests. The study indicated that intake of vegetables less than once a day and meat less than once a week was common and was associated with increased anaemia, and nutrition related and chronic illnesses are the most common causes of anaemia. The researchers concluded that moderate nutritional anaemia in the form of iron deficiency anaemia is a problem in Ethiopia and therefore, the need for improved supplementation to vulnerable groups is warranted to achieve the United Nation’s Millennium Development Goals. Sanku et al (35) conducted a study entitled ” Prevalence of anaemia in women of reproductive age in Meghalaya: a logistic regression analysis”. The researchers aim was to determine the prevalence of anaemia among ever-married women of reproductive ages from the state Meghalaya, India, and to explore some factors commonly associated with anaemia. In the study, socioeconomic differentials are also presented to understand the prevalence of anaemia. The study population consisted of 3934 ever-married women of reproductive ages (15-49 years) which were taken from the third Indian National Family Health Survey in 2005-2006 to explore the predictors responsible for the prevalence of anaemia by using different background characteristics such as age, place of residence, nutritional status, number of children ever born, pregnancy status, educational achievement, and economic status. As a response variable, anaemia levels were categorised as a dichotomous variable, and the predicted probabilities were worked out through a binary logistic regression model, to assess the contribution of the predictors on anaemia. After the data analysis, the findings of the study revealed that all the predictors, except total children ever born, were found to be statistically significant, and as a result of the mean haemoglobin concentration 49. 6% of the women were found to be anaemic. They also observed that women of the age group 20-24 years were at high risk of anaemia. Finally, the researchers state the following conclusions: Pregnant, under nutritious, and poorest women are at high risk of anaemia, Urban women are also at high risk; however, higher educated women are at low risk of anaemia. The habit of cigarette smoking/pan/bidi/gutka etc. also increases the risk of anaemia. Bentley and Griffiths (36) studied on the burden of anaemia among women in India. They used The National Family Health Survey 1998/99 which provides nationally representative cross-sectional survey data on women’s haemoglobin status, body weight, and diet, social, demographic and other household and individual level factors. The purpose of the study was to investigate the prevalence and determinants of anaemia among women in Andhra Pradesh, and examined differences in anaemia related to social class, urban/rural location and nutrition status body mass index (BMI). Their findings were prevalence of anaemia was high among all women, and poor urban women had the highest rates and odds of being anaemic. They also described that fifty-two percent of thin, 50% of normal BMI, and 41% of overweight women were anaemic. Another related literature found relevant for review is a paper by Sanap et al. (31) entitled ” Classification of Anaemia Using Data Mining Techniques”. The study presented an analysis of the prediction and classification of anaemia in patients using data mining techniques. The aim of the study was to investigate C4. 5 decision tree algorithm and SMO(sequential minimal optimization) in WEKA, and to find which method gives most suitable technique for prediction and best possible classification of anaemia using CBC data and generate decision tree. The dataset which contains 514 instances were constructed from complete blood count tests which are performed by collecting blood samples from 191 healthy individuals and 323 patients having disorders of anaemia. The methodology used by the researcher were collecting of data, data pre-processing and data Cleaning, attribute Selection, and feature extraction and classification. After experiments using C4. 5 decision tree algorithm and Support vector machine which are implemented as J48 and SMO(sequential minimal optimization) in WEKA open software, they found out that C4. 5 decision tree algorithm classify anaemia perfectly with accuracy of 99. 42%, and support vector machine with accuracy of 88. 13%. From the findings they observed that C4. 5 algorithm has best performance with highest accuracy. Biset (37) on his master thesis used data mining techniques to predict low birth weight on Ethiopian Demographic and Health Survey data sets. The goal of the study was to predict low birth weight using EDHS 2005 (Ethiopia Demographic Health Survey) data set so as to build a model using data mining technique addressing the factors associated with low birth weight. He applied CRISP-DM methodology, which contains six major phases: business understanding, data understanding, and data preparation, model building, evaluation and deployment. J48 decision tree classifier and PART rule induction algorithms were selected for experiments. The researcher compare the classification performance of the decision trees with tree pruning and without tree pruning, and found that tree pruning can significantly improve decision tree’s classification performance. In general, the results from the study as the researcher described were encouraging which can be used as decision support aid for health practitioner. He further indicated that the extracted rules in both the algorithms are very effective for the prediction of low birth weight. Finally, he observed that from both algorithms that the attributes such as antenatal visits during pregnancy (antenatal care for pregnancy), mother’s educational level, and marital status, Iodine contents in salt, region, and age of mother, numbers of birth order and wealth index as well as place of residence are the most determinant factors to predict low birth weight. Abel (38) also applied data mining techniques to design a predictive model for heart disease detection. The researcher aim was to design a predictive model for heart disease detection using data mining techniques from Transthoracic Echocardiography Report dataset that is capable of enhancing the reliability of heart disease diagnosis using echocardiography. Knowledge Discovery in Database (KDD) methodology consisting of nine iterative and interactive steps was adopted. He used a total of 7, 339 records for experiments with implementation of J48 Decision Tree classifier, Naïve Bayes classifier and Neural Network. Experimental results revealed that all the models built using the above three techniques have high classification accuracy and are generally comparable in predicting heart disease cases. However, comparison that is based on True Positive Rate suggests that the J48 model performs slightly better in predicting heart disease with classification accuracy of 95. 56%. Based on the findings, he concluded that data mining techniques can be used efficiently to model and predict heart disease cases, and the outcome of the study can be used as an assistant tool by cardiologists to help them to make more consistent diagnosis of heart disease. Muluneh (39) has conducted a study by applying data mining in an attempt to investigate the potential applicability of data mining techniques in exploring the prevalence of diarrheal disease using the data collected from the diarrheal disease control and training centre of African sub Region II in Tikur Anbessa Hospital. He used the CRISP-DM (CRoss-Industry Standard Process for Data Mining) methodology. Two machine learning algorithms from WEKA software such as J48 Decision Trees(DT) and Naïve Bayes(NB) classifiers were implemented to classify diarrheal disease records on the basis of the values of attributes ‘ Treatment’ and ‘ Type of Diarrheal’. Results of the experiments showed that J48 DT classifier has better classification and accuracy performance as compared to NB classifier. The researcher concluded that the study has proved in which data mining techniques are valuable to support and scale up the efficacy of health care services provision process. Elias (40) conducted a study on HIV status predictive modelling using data mining technology. The objective was is to construct HIV status predictive model in support of scaling up of HIV testing in Addis Ababa. He used the CRISP-DM methodology for HIV status predictive modelling and discovering association rules between HIV status and selected attributes. In the study, J48 and ID3 algorithms were experimented to build and evaluate the models while apriori algorithm was used to discover association rules. To implement the algorithms WEKA 3. 6 was used as the data mining tool. The findings of the researcher revealed the pruned J48 classifier that predicts HIV status with 81. 8% accuracy was developed, and association rule mining has also its potential in discovering the relationships of the selected attributes and HIV status. Selam (41) used data mining technology to design a predictive model that can help predict the occurrence of measles outbreaks in Ethiopia. She used was a hybrid six-step Cios KDP methodology, which consists of six basic steps such as: problem domain understanding, data understanding, data preparation, data mining, evaluation of the discovered knowledge, and use of the discovered knowledge. Naïve bayes and decision tree data mining techniques were employed to build and test the models using a dataset of 15631 records. Experimental results of the study revealed that among the implemented algorithms the J48 algorithm has shown better prediction accuracy than naïve bayes classifier. From the results, she proved that applying data mining techniques on measles surveillance data to build a model that predicts the occurrence of measles outbreak in different Ethiopian Regions was possible. Shegaw (42) applied data mining technology to predict child mortality patterns up on community-based epidemiological datasets collected by the Butajira Rural Health Project (BRHP) epidemiological study. The methodology which was used by the researcher had three basic steps. These were collecting of data, data preparation and model building and testing. BrainMaker and See5 softwares were employed by the researcher so as to build models using neural net work and decision tree techniques respectively. He found that both the techniques yield comparable results for misclassification rates. However, unlike the neural network models, the results obtained from decision tree models provided simple and understandable rules that can be used by any health care professionals to identify cases for which the rule is applicable. In fact, the accuracy obtained from decision tree models also outperforms neural networks. He also found that best models using neural network technique by modifying default parameters of the program. A study conducted by Teklu (43) has attempted to investigate the application of data mining techniques on Anti Retroviral Treatment (ART) service with the purpose of identifying the determinant factors affecting the termination/continuation of the service. This study applied classification and association rules using, J48 and apriori algorithms respectively, on 18740 ART patients’ datasets. The methodology employed to perform the research work is CRISP-DM. Finally the investigator proved the applicability of data mining on ART by identifying those factors causing the continuation or termination of the service. Tesfahun (44) also attempted to construct adult mortality predictive model using data mining techniques so as to identify and improve adult health status using BRHP open cohort database. He used the hybrid model that was developed for academic research. Decision tree and Naïve Bayes algorithms were employed to build the predictive model by using a sample dataset of 62, 869 records of both alive and died adults through three experiments and six scenarios. The investigator found out that as compared to Bayes, the performance of J48 pruned decision tree reveals 97. 2% of accurate results for developing classification rules that can be used for prediction. He further pointed out that if no education in family and the person is living in rural highland and lowland, the probability of experiencing adult death was 98. 4% and 97. 4% respectively with concomitant attributes in the rule generated. Finally, he suggested that education plays a considerable role as a root cause of adult death, followed by outmigration, and further comprehensive and extensive experimentation is needed to substantially describe the loss experiences of adult mortality in Ethiopia. Up to the knowledge of the researcher, no previous researches have been done by applying data mining techniques to predict the status of anaemia in Ethiopia using survey or clinical dataset. Hence this research has a great contribution to generate patterns that can help health practitioners to develop a better strategy for preventing and treating anaemia using data mining technique.