Data Analysis and Decision Modelling Oz Assignments
Delivery in day(s): 4
Sinking of Titanic boat was one of the most famous phenomenons which have created one of the most memorable historic events. There were so many passengers travelling through Titanic boat and some of them could survive while other lost their life. In present paper we would try to analyze survival rate of the passengers travelling through Titanic boat, factors which were of immense importance in determining their survival and developing a predictive model in order to determine chances of survival for the passengers based on specific variables.
Present paper would make usage of data collected for passengers in form of their age, sex, pclass, number of siblings in boat, number of parents in boat, body identification number and passenger fair etc. Rattle data mining tool would be used in order to analyze the data with specific tools such as correlation, decision tree and principal component analysis etc. These tools would be helpful in order to find out the variable of importance for the passengers which are helpful to ensure higher survival rate for the passenger. Further predictive modeling would be done based on the decision tree so that a model can be developed in order to predict whether under given circumstance a particular passenger would have survived or not.
Present section would identify the key elements which are of immense importance in order to find out key components of defining the survival rate for the passengers. Table below provides the summary statistics for the variables included in the business analysisfor the data collected in order to find survival rate for the passengers.
Parameter  Age  Sibsp  Parch  Body  fare  Pclass 
Min  0.33  0.0  0.0  1  0.0  1 
1^{st} quartile  21  0.0  0.0  80  7.89  2 
Median  28  0.0  0.0  169  14.45  3 
Mean  29.81  0.49  0.37  168  32.07  2.3 
3^{rd} quartile  38  1.00  0.0  261  30.50  3 
Max  80  8.0  9.0  328  512  3 
Na  186  0.0  0.0  827  1  0.0 
Table 1: Showing summary statistics for the key variables
As shown in table above that there are six variables for which summary statistics has been made i.e. age, sibsp, parch, body, fare and pclass. For age variables 25% of the people have age below then 21 years while 25% of the passengers were having age more than 38 years. There were three passenger classes in the boat i.e. 1, 2 and 3. Rests of the variables are self explanatory.
Count of Sr. no  Survived  
PCLASS  0  1  Grand Total 
1  123  200  323 
2  158  119  277 
3  528  181  709 
Grand Total  809  500  1309 
Table 2: Showing cross tab for survival rate of the passenger against the passenger class
As shown in the table above and chart below that passenger in class 1 were having highest survival rate followed by class 3 and class 2. In terms of probability class 1 was having highest probability for passenger survival.
Figure 1: Showing cross tab for survival rate of passengers against passenger class
Figure 2: Showing principal component analysis for variables for Titanic sinking
From the figure 2 and table 3 it can be clarified that there are in total 5 factors which are responsible for variance in data. While PC 1 is responsible for 28.3% variance in data, PC 2 responsible for 25.5% variance in data, PC 3 responsible for 20.7% variance in data, PC 4 responsible for 15% variance and PC 5 are only able to explain 10% of total variance in data.
PC  PC1  PC2  PC3  PC4  PC5  
SD  1.190  1.130  1.019  0.867  0.717  
Variance  0.283  0.255  0.207  0.150  0.102  
Cumm. Prop.  0.283  0.538  0.746  0.897  1.000  
Factor  PC1  PC2  PC3  PC4  PC5 

Age  0.455  0.180  0.579  0.641  0.115 

Sibsp  0.572  0.222  0.532  0.115  0.570 

Parch  0.051  0.761  0.268  0.112  0.576 

Fare  0.555  0.037  0.383  0.735  0.034 

Body  0.392  0.579  0.401  0.143  0.571 

Table 3: Showing variance in data for the five principal components
These five principal components responsible for variance in data are age of the passengers, no of siblings aboard, no of parents/children aboard, fare paid by the passengers and port of embarkation. From the above table priority for each principal component can be found and as explained below in their decreasing priority level:
Figure 3: Showing correlation among the principal components
Factor  Age  Sibsp  Parch  Fare  Survived  Body 
Age  1.000  0.267  0.147  0.199  0.052  0.125 
Sibsp  0.267  1.000  0.360  0.151  0.016  0.129 
Parch  0.147  0.360  1.000  0.176  0.078  0.0670 
Fare  0.199  0.151  0.176  1.000  0.233  0.044 
Survived  0.052  0.016  0.078  0.233  1.000  NA 
Body  0.125  0.129  0.067  0.044  NA  1.000 
Table 4: Showing correlation data among the principal components identified
From the table 4 and figure 3 above it can be justified that variables parch and sibsp are having high degree of positive correlation which means that such passengers were travelling with their complete family. Further variable fare and survival were also having positive correlation means that higher the fare paid by the passenger higher was the chances for survival. Further there was high degree of negative correlation among the variables named age and sibsp which means that higher the number of siblings aboard lesser was the age of the passenger.
Figure 4: Showing correlation cluster for the different variables
From the figure of correlation cluster it can be suggested that there is high degree of correlation between the two variables named body identification number and survival rate of the passengers. After that variables parch and sibsp are having high degree of correlation which was even displayed through the correlation table earlier as well.
In order to develop decision making regarding the importance variables contributing to the determination of survival for the passengers decision tree can be framed which would provide variables of importance.
As shown in the figure above that with help of rattle data mining tool conditional decision tree can be plotted which shows survival rate for the passengers boarded in the Titanic boat based on the several variables of importance such as the fare paid by the passenger, parents/children boarded in the boat, no of sibling boarded in the boar and embarkation. All these variables are of prime importance in order to make decision regarding survival of the passengers which is also justified through principal component analysis method. As shown in the decision tree above that there is one root node and some intermediate nodes and finally there are leaf nodes which shows the chances of survival for the passengers. Each leaf node made in the decision tree represents a set of passengers showing similar character tics. There are in total 9 nodes which have been formed with the conditional basis with the variables of importance (Kantardzic, 2003).
Root node shows the amount of fare paid by each passenger and based on the condition for fair paid by the passenger survival rate can be divided among the two different nodes. The first condition for division among the passengers is based on the fair paid with amount 25.467. Node 2 consists of the passengers who have paid either equal or less than 25.467 of fair for the boarding while node 5 consist of the passengers who have paid fair more than 25.467. Sub nodes under each node i.e. node 2 and 5 also follows the similar trend for the passenger fair.
Second level of bifurcation among the nodes has been done based on the embarkation as there are three levels of embarkation i.e. C, S and Q. Passengers having embarkation as C or Q are classified under node 3 while passengers having embarkation as S are classified under node 4. Similarly from the node 5 division among the further node has been done based on the factor that variable sibsp has value less than or equal 2 or sibsp value is higher than 2. For passengers having sibsp value higher than 2 node 9 is represented while for passengers having sibsp value less or equal 2 node 6 represents such passengers with further bifurcation based on the parch values. For the parch values less than or equal to 1 node 7 represents the passenger class while for the passengers having parch values more than 1 node 8 represents such set of passengers. Actual survival rate for the passengers can be predicted through the five leaf nodes formed in the decision tree while intermediate nodes are just used in order to make bifurcation among the variables values and survival rate can’t be predicted through intermediate nodes.
For each leaf node survival rate among the passengers is shown through 1 to 3 point scale with higher the values shows higher probability of the passenger to survive in the boat and vice a versa. Among all the five nodes i.e. node 3, 4, 7, 8 and 9, node 8 shows the highest level of survival rate among the passengers with total of 15 passengers showing the similar exhibits. Hence from the predictive modeling formed with help of data it can be said that passengers having fare more than 25.467, sibsp <=2 and parch>1 are having highest survival rate among the other passengers boarded in the boat.
After leaf node 8, node 7 shows the high rate of survival among the passengers with the passengers exhibits as the fare >25.467, sibsp<=2 and parch<=1. Hence node 7 which represents a set of 65 passengers are having lesser survival rate than node 8 while among other passengers these passengers enjoys a much higher survival rate for the exhibits shown by these passengers. Following the similar trend node 4 is having the lesser survival rate in comparison to the node 8 & 7 with characteristics as the fare <=25.467, embarkation as S. Node 4 represents a set of 100 passengers boarding the boat and represents high level of survival rate for these passengers.
While node 9 and node 3 are having a set of 8 and 40 passengers respectively showing similar characteristics and there is very low survival rate for these passengers based on their characteristics as determined through principal components. Hence at each node level there is some decision making involved which can be justified with the help of conditional decision tree and for a particular node conditions can be examined through the four important variables as determined in the decision tree.
Data quality is of immense importance for the data warehouse architecture and defines the success of overall data warehouse architecture developed in the organization. Hence it is important to ensure that data quality should be as per standards and requirements of the data warehouse architecture (Ye, 2003). Data quality has its role in data warehouse as it enhance efficiency of the data warehouse, avoid duplication of work & efforts, saves cost and enhance decision speed for the management using data warehouse reports so as to make their day to day decisions.
Some of the key characteristics which a quality data set must include to form the data warehouse architecture in the organization structure include usefulness, validation, believability, accessibility and interpretability. These key qualities of data can be explained as given below:
Role of data quality in data warehouse is through the various benefits offered by quality data for the data warehouse architecture such as the non duplication of work, faster decision power, better understanding, no missing or garbage values and efficiency enhancement for the organization. A quality data would avoid duplication of work as data captured one time can be used for generation of several reports and making interpretation for the data (Ralf and Markus, 2011).
Further with high quality of data it becomes easier for the decision maker to make quick decision based on the better understanding obtained through the continuous data containing no missing values of garbage values. Efficiency enhancement is the major role which a quality data can play for the organization as it saves the time, effort and cost for the organization by implementing high quality of data in data warehouse architecture.
From the pivot table analysis through data presented regarding various state/province, their sales figure for each quarter below graph can be plotted. Key trend from the graph below shows that for quarter 3 all three parameters were high i.e. store cost, store sales and unit sales of the products. For Canada unlike general trend highest values of store cost, store sales and unit sales have been observed into quarter 3. While for Mexico quarter 3 was having highest figures for store cost, store sales and unit sales but for USA quarter 1 was having highest values.
Values  
Row Labels  Store Cost  Store Sales  Unit Sales 
1998  432565.7289  1079147.47  509987 
Quarter 1  116512.6905  290873.18  137078 
Canada  9576.6446  23881.13  11160 
Mexico  47502.2264  118589.41  56133 
USA  59433.8195  148402.64  69785 
Quarter 2  115080.3318  287009.99  135745 
Canada  11072.1808  27685  12885 
Mexico  45683.9482  113830.59  54005 
USA  58324.2028  145494.4  68855 
Quarter 3  118322.14  295040.55  139412 
Canada  10915.5866  27176.3  12966 
Mexico  49267.9496  122706.05  57872 
USA  58138.6038  145158.2  68574 
Quarter 4  82650.5666  206223.75  97752 
Canada  7768.1585  19303.03  9146 
Mexico  30133.9206  75167.54  35904 
USA  44748.4875  111753.18  52702 
Grand Total  432565.7289  1079147.47  509987 
Unit Sales  Column Labels  
Row Labels  Canada  Mexico  USA  Grand Total 
Breakfast Foods  1453  6594  8502  16549 
Cereal  556  2585  3499  6640 
Pancake Mix  112  701  844  1657 
Pancakes  156  603  865  1624 
Waffles  629  2705  3294  6628 
Grand Total  1453  6594  8502  16549 
From the table above it can be stated that among the three countries maximum of all products i.e. cereals, pancake mix, pancake and waffles are sold into USA. Hence overall consumption for all the products is highest in USA as compared to other two countries. Further it has been observed that cereal is the most sold food category followed by waffles, pancake mix and pancake. For all countries product consumption pattern is observed similar though there are differences in consumption patter for the individual countries.
Row Labels  Unit Sales  Store Sales 
USA  186899  396294.93 
OR  60612  128598.5 
WA  126287  267696.43 
Grand Total  186899  396294.93 
From the table above it can be observed that unit sales are more for the Washington state as compared to Oregon. Further it can be identified that similar trends are observed for the unit sales and store sales value because of the reason that prices for the particular unit are same hence store values and unit sales values would be in linear proportion due to which same kind of pattern has been observed for the two states. Further consumption levels are more than doubled in Washington as compared to Oregon hence store sales value are also doubled.
Table below gives the key trend for the beer and wine for the various brands of both the products. It has been found out that unit sales levels are much more for wine as compared to beer levels. Further among the beer sub categories it has been found out that “Good” is the most promising brand followed by the pearl, Portsmouth, walrus and top measure. Similar kind of trend is also observed for wine sub categories where in “Good” is the most sold brand followed by pearl, Portsmouth, top measure and walrus.
Row Labels  Unit Sales 
Beer and Wine  13069 
Beer  3359 
Good  767 
Pearl  725 
Portsmouth  713 
Top Measure  546 
Walrus  608 
Wine  9710 
Good  2097 
Pearl  2028 
Portsmouth  1942 
Top Measure  1883 
Walrus  1760 
Grand Total  13069 
This study explains the survival chances of passengers on the Titanic Ship and significant variables were identified such as Age, Parch, Sibsp and Fare. Using principal component information techniques and using these significant variables a predictive model is developed which helps in understanding the survival chances of passengers on the Titanic.
Rattle Data mining tool is used for this and various correlation coefficients are determined between different variables which helps us in comprehending and analyzing the survival rates of passengers on the Titanic Ship. The Decision tree is developed based on significant variables and different nodes are formed, different nodes explain different survival chances of passengers on the Titanic.
Oz Assignment Help is the best assignment help provider in Australia. Our online assignment writing help Australia isespecially dedicated for the students studying in all Australian colleges and universities. Order Now