ICT704 Non Relational Database Systems Assignments
Delivery in day(s): 4
This is a solution of rattle data mining assessment in which discuss identification of five key variables contributing most to survival rate of passenger, Role of quality of data in order to ensure proper data warehouse architecture, the titanic uses a decision tree model in rattle data mining tool.
Sinking of Titanic boat was one of the most famous phenomenons which have created one of the most memorable historic events. There were so many passengers travelling through Titanic boat and some of them could survive while other lost their life. In present paper we would try to analyze survival rate of the passengers travelling through Titanic boat, factors which were of immense importance in determining their survival and developing a predictive model in order to determine chances of survival for the passengers based on specific variables.
Present paper would make usage of data collected for passengers in form of their age, sex, pclass, number of siblings in boat, number of parents in boat, body identification number and passenger fair etc. Rattle data mining tool would be used in order to analyze the data with specific tools such as correlation, decision tree and principal component analysis etc. These tools would be helpful in order to find out the variable of importance for the passengers which are helpful to ensure higher survival rate for the passenger. Further predictive modeling would be done based on the decision tree so that a model can be developed in order to predict whether under given circumstance a particular passenger would have survived or not.
Identification of the five key components contributing to the survival rate of the passengers for Titanic can be identified with the help of principal component analysis which can be carried out with the collected data. Using principal component method with the rattle data mining tool along with correlation analysis below output can be obtained:
As shown in figure above that there are five components which have been identified as the principal components and contributing to the event of survival of passenger in Titanic boat. Through these five components found by principal component analysis method we can explain 100% of the variance present in the data. Variance explained by each of the principal components in the current phenomenon can be explained through usage of below table which describes the standard deviation and variance explained by each of the principal components.
PC |
PC1 |
PC2 |
PC3 |
PC4 |
PC5 |
SD |
1.190 |
1.130 |
1.019 |
0.867 |
0.717 |
Variance |
0.283 |
0.255 |
0.207 |
0.150 |
0.102 |
Cumm. Prop. |
0.283 |
0.538 |
0.746 |
0.897 |
1.000 |
Table 1: Showing the principal component analysis in terms of standard deviation and variance
As explained by above table that principal component 1 can be used in order to explain 28.3% of the variance in the data, principal component 2 for 25.5%, principal component 3 can explain 20.7%, principal component 4 can explain 15% of variability while principal component 5 is able to explain remaining 10.2% of the variance in the data. In order to identify each of the principal components used in the current analysis rotation for each of the principal component as given in the below can be considered:
Factor |
PC1 |
PC2 |
PC3 |
PC4 |
PC5 |
Age |
-0.455 |
0.180 |
-0.579 |
0.641 |
-0.115 |
Sibsp |
-0.572 |
0.222 |
0.532 |
0.115 |
0.570 |
Parch |
-0.051 |
0.761 |
0.268 |
-0.112 |
-0.576 |
Fare |
-0.555 |
-0.037 |
-0.383 |
-0.735 |
-0.034 |
Body |
0.392 |
0.579 |
-0.401 |
-0.143 |
0.571 |
Table 2: Showing rotation variable for principal component identified
As given in above table that there are five different principal components for which rotation with each of the relevant variables can be determined. From the above table Sibsp which is number of siblings/spouse aboard is the most important variable which determines the probability of sinking for a person. This principal component is followed by parch which is number of parents/children aboard, age, passenger fare and body identification number of the passenger. Justification for each of the principal component can be given as below:
The five components identified above have high degree of relevance among them as well which can be found out through correlation analysis as given below:
Factor |
Age |
Sibsp |
Parch |
Fare |
Survived |
Body |
Age |
1.000 |
-0.267 |
-0.147 |
0.199 |
-0.052 |
0.125 |
Sibsp |
-0.267 |
1.000 |
0.360 |
0.151 |
-0.016 |
-0.129 |
Parch |
-0.147 |
0.360 |
1.000 |
0.176 |
0.078 |
0.0670 |
Fare |
0.199 |
0.151 |
0.176 |
1.000 |
0.233 |
-0.044 |
Survived |
-0.052 |
-0.016 |
0.078 |
0.233 |
1.000 |
NA |
Body |
0.125 |
-0.129 |
0.067 |
-0.044 |
NA |
1.000 |
Table 3: showing the correlation coefficient for each of the correlated variable
As shown in the figure above that different variables showing the sinking phenomenon of Titanic boat have varying degree of correlation among them. Variables with strong color shows strong degree of correlation while variables with light color shows lesser degree of correlation among the variables. Positive correlation among the variables is shown by blue dots while negative correlation is represented through orange dots. Hence from the above figure it can be identified that there is high degree of positive correlation exists between number of parents/children living aboard and number of siblings living aboard. Similar degree of positive correlation exists between passenger fare and survival rate of the passengers.
Similar orange balls represent the negative correlation among the variables hence there is high degree of negative correlation between number of siblings and age of the passengers. Also negative correlation also exists among the variables named survival rate & age, body & fare and no of parents/children residing aboard and age. Among the other variables represented in the table very low degree of positive or negative correlation exists which is not of much statistical significance from the point of view of analysis.
In order to further investigate the correlation cluster among the presented variable below figure can be represented which shows the variable correlation cluster for the various variable present in the task.
From the above figure it can be clearly demonstrated that there is very high degree of relationship among the variable age and survival rate of the passenger while lowest degree of correlation exist among the variable name body identification number and survival rate of a passenger.
In order to predict the survival rate of passenger boarded in the Titanic several factors are of immense importance which would be used in order to develop the predictive model for the event. As evident from the decision tree below that there are mainly four variables which are helpful in determining the probability rate of a passenger to be surviving from the boat. These four variables are fare paid by the passenger, no of siblings boarded in the boat, no of parents/children boarded in the boat and port of embarkation.
The four variables which are playing the important role in determining the survival chances of a passenger through Titanic boat have sub category in them which are of particular importance. First variable named passenger fare which is the fare paid by each of the passenger have two sub categories which are more than 25.467 and less than or equal to 25.467. Based on these two fare values chances for the passenger survival would vary as passengers paying higher fare would be having higher degree of probability for survival against the passengers paying lower fare and having lower degree of survival chances in the Titanic (Weiss and Indurkhya, 1998).
Similarly variable named Sibsp which provides the analysis of number of siblings boarded in the titanic boat hence giving chances of survival for the passengers. There would be adverse relation between the variable named Sibsp and survival rate of the passenger. With passengers having higher number of siblings boarded in the boat there would be lesser chances for the survival of the passengers. While passengers having lesser number of siblings boarded in the boat there would be higher probability for the survival of such passengers. For the current analysis a level of 2 siblings have been taken, for siblings more than 2 there would be lower chances of survival while for lesser than 2 siblings boarding the boat there would be more chances of survival for the passengers.
Third variable of consideration for the current analysis would be number of parents/children boarding the boat. There would be inverse proportionate relationship between the variable named parch and survival rate of the passengers. Parch variable represent the number of parents or children boarding the boat. The level of classification for the current variable has been taken as 1, i.e. with higher number of parents/children boarding the boat there would be lesser chances of survival for the passengers (Theodoridis and Koutroumbas, 2009).
Fourth variable which is of significance in the current analysis is port of embarkation as different ports are having varying degree of survival for the passengers. Some ports are considered to be safe while others are being treated as more risky in comparison to other ports embarked. There are three ports which have been considered for analysis of the survival rate among the passengers and these ports are Cherbourg, Queenstown and Southampton. Port of embarkation Southampton has higher probability of survival for the passengers in comparison to port of embarkation as Cherbourg and Queensland.
Decision tree made in order to provide predictive modeling for the passengers of Titanic has lead to development of different leaf nodes which would decide the chances of survival for any passenger based on the importance factors. As evident from the above figure that there are basically five nodes formed in the analyses which have been given number as node 3, 4, 7, 8 and 9. Each of the nodes represented in the above figure gives varying degree of prediction regarding survival of a passenger in the boat. Each node has been formed as a result of the combination through variables which are important in order to determine the survival rate of the passengers.
For each of the node given in the analysis probability of passenger survival can be determined through underlying variables and below description gives the analysis for each node with prediction regarding survival of the passengers:
Node 8: Node 8 has the prediction of highest chances for the survival of the passengers through underlying variables forming the node. There would be 15 sample data which are qualifying the node 8 and would be treated as survived as per the analysis carried out. Node 8 represents the passengers who have paid higher than 25.467 of passenger fare, having less than 2 siblings and more than 1 parent/children boarding in the boat. Survival chances for these passengers would be highest among all the passengers as given through predictive model developed in the context.
Node 4: Node 4 is the second node with highest probability of passenger survival as per the given variables of combination in order to define the survival rate of the passengers. Node 4 represents a set of 100 passengers who have paid passenger fair less than 25.467 and having port of embarkation as Southampton.
Node 7: Node 7 represents a set of 65 passengers boarding the boat with specific characteristics such as passenger fare greater than 25.467, less than or equal to 2 siblings in the boat and less than 1 parent/children boarding the boat. After Node 8 and Node 4 passengers encircled in node 7 are having highest degree of survival chances in comparison to adjacent nodes.
Node 3 & 9: Node 3 and contains 40 and 8 passengers respectively with their specific characteristics. Node 3 contains the passengers with lesser than 25.467 fare and port of embarkation as C or Q. While node 9 is formed with the passengers with passenger fare more than 25.467 and greater than 2 siblings in the boat.
Data warehouse architecture is used in order to help decision makers to take timely and better decision. Aim of a proper data warehouse architecture system is to provide the right information for business at right place in right time with cost effective solution so that better decision making can take place. Quality of information is very importance for the data warehouse architecture as it would be helpful in order to make accurate decisions with faster pace in comparison to the data warehouse with lower quality information contained into it (Murphy, 2011). Quality of information is important due to the reason that information is the only source which is helpful in taking decision.
Quality of data warehouse architecture is important in order to determine five key variables to the data warehouse architecture which are accessibility, interpretability, usefulness, believability and validation. Accessibility of the data warehouse architecture signifies data source, data warehouse design and processes used in data warehouse. Hence in order to have data accessibility which is one of the prime feature for any data warehousing architecture data quality should be improved so that there can be proper data design managed in order to arrange the data (Miner et al, 2009).
Interpretability of the data signifies the design of data warehouse, models & language used in order to interpret data, processing of queries and data warehousing processes used in order to analyze the collected data. In order to develop data warehouse architecture which can provide high level of interpretability for the decision maker it is essential to have high quality of information inputted in the data warehouse architecture. Inputting high quality information would be essential in order to manage the models and languages used in the data warehouses (Teisseire et al, 2007).
Usefulness of the data warehouse architecture signifies that there are updated policies, data warehouses evolution, data sources, data warehouse design and processes in order to manage the usefulness of the data. For a data warehouse architecture to become useful for the people making use of it, quality of data should be ensured in such a way that warehouse architecture can be properly managed and decision making process can take place much faster and accurately.
Believability of the data warehouse architecture signifies the reliability of the data so that data results outputted through warehouse architecture can be easily believed upon and can be used for the analysis purpose. Finally quality of data can be used in order to ensure the validation of data which comes through validating the data contained in warehouse structure with desired set of data. Validation of data can be ensured through high quality of data only (Kumar et al, 2005).
Graph below shows the key trend observed in the each quarter for the store values in different countries and state provinces. Among all the quarters in year 1998 quarter 1 has proved better for all the countries i.e. whether its Canada, Mexico or USA and in terms of all the store parameters i.e. store cost, store sales and unit sales for the store. Quarter 4 has remained with the lowest value of all store parameters for all the three countries and their state provinces. Quarter 2 and quarter 3 has remained in similar trend for all the parameters for three countries.
Values |
|||
Row Labels |
Store Cost |
Store Sales |
Unit Sales |
1998 |
432565.7289 |
1079147.47 |
509987 |
Quarter 1 |
116512.6905 |
290873.18 |
137078 |
Canada |
9576.6446 |
23881.13 |
11160 |
Mexico |
47502.2264 |
118589.41 |
56133 |
USA |
59433.8195 |
148402.64 |
69785 |
Quarter 2 |
115080.3318 |
287009.99 |
135745 |
Canada |
11072.1808 |
27685 |
12885 |
Mexico |
45683.9482 |
113830.59 |
54005 |
USA |
58324.2028 |
145494.4 |
68855 |
Quarter 3 |
118322.14 |
295040.55 |
139412 |
Canada |
10915.5866 |
27176.3 |
12966 |
Mexico |
49267.9496 |
122706.05 |
57872 |
USA |
58138.6038 |
145158.2 |
68574 |
Quarter 4 |
82650.5666 |
206223.75 |
97752 |
Canada |
7768.1585 |
19303.03 |
9146 |
Mexico |
30133.9206 |
75167.54 |
35904 |
USA |
44748.4875 |
111753.18 |
52702 |
Grand Total |
432565.7289 |
1079147.47 |
509987 |
Table 4: Showing the store values for state/provinces for all quarters
Table below provides the product category and sub-category for the unit sales in various countries and state/provinces for breakfast. From the below graph it can be clearly determine that Cereals is having the highest sales followed by the Waffles, pancake mix and pancake. Among the three selected countries USA has highest consumption of breakfast food followed by Mexico and least consumption in there is Canada. Among each kind of sub category USA has the highest consumption for cereals followed by waffles, pancakes and pancakes mix. Among all the sub categories and countries cereals in USA is the sub product category which is highly consumed by the customers.
Unit Sales |
Column Labels |
|||
Row Labels |
Canada |
Mexico |
USA |
Grand Total |
Breakfast Foods |
1453 |
6594 |
8502 |
16549 |
Cereal |
556 |
2585 |
3499 |
6640 |
Pancake Mix |
112 |
701 |
844 |
1657 |
Pancakes |
156 |
603 |
865 |
1624 |
Waffles |
629 |
2705 |
3294 |
6628 |
Grand Total |
1453 |
6594 |
8502 |
16549 |
Analysis below represents the unit sales and store sales values for the two provinces in United States of America. As shown in the table below that for the two states i.e. Oregon & Washington, unit sales is more than doubled for the Washington in comparison to unit sales for Oregon. Similar pattern has been observed for the store sales values as well for which total sales value is more than doubled for Washington in comparison to Oregon. Hence it can be observed from the analysis that consumption for Washington sales is double in comparison to Oregon which might be due to higher population.
Row Labels |
Unit Sales |
Store Sales |
USA |
186899 |
396294.93 |
OR |
60612 |
128598.5 |
WA |
126287 |
267696.43 |
Grand Total |
186899 |
396294.93 |
Table below provides the statistical analysis for the two product categories i.e. wine and bear along with their sub product categories in terms of their unit sales. From the analysis below it can be determined that wine has higher unit sales in comparison to the beer and sales of wine is more than twice as compared to the unit sales of beer. Among the product category of beer good is the brand which is highest selling while top measure is the lowest selling brand in beer product category. For wine product category Good is the highest selling brand while Walrus is the lowest selling brand in the wine category.
Row Labels |
Unit Sales |
Beer and Wine |
13069 |
Beer |
3359 |
Good |
767 |
Pearl |
725 |
Portsmouth |
713 |
Top Measure |
546 |
Walrus |
608 |
Wine |
9710 |
Good |
2097 |
Pearl |
2028 |
Portsmouth |
1942 |
Top Measure |
1883 |
Walrus |
1760 |
Grand Total |
13069 |
Present research work has been carried out in order to analyze the survival rate of passengers boarded in Titanic with help of data collected. There were more than 2200 people present on the ship when it meets with a collision with an ice berg. In this disaster a total 1400 plus people died and only 700 people were saved. Data analysis was done to assess the probability of saving for a passenger who was aboard on titanic through data mining. There were many variables which impacted the survival probability of a passenger like their ticket category, age, sex, identification status etc. Data analysis done through rattle data mining tool has been used with specific data mining technique such as principal component analysis to determine which factors are of ultimate importance, correlation to find out value of correlation between two given variables and decision tree to develop predictive modeling for the passengers of Titanic. From the principal component analysis it was found out that five variables which are of ultimate importance are number of sibling boarded, no of parents boarded, passenger fair, embarkation.