Delivery in day(s): 3
CIS8008 Data Warehouse Assignment
Titanic was a British Passenger liner that capsized in North Atlantic Ocean on the dreadful night of 14th April, 1912 and the early morning of 15th April 2012. It was one of the largest ships and was on her maiden voyage starting at Southampton while stopping at Cherbourg in France and Queenstown in Ireland. The survival rate of the passengers on the ship is determined by the data mining tools and in this we have used rattle data mining tool in order to understand and analyze the survival rate of the passengers on the ship. This data mining tool helps in identifying the important variables and helps in understanding the survival chances of passengers on the ship. These significant variables helps in developing a decision tree or a predictive model for the survival of the passengers on the ship and these important variables are age, number of siblings, number of parents per children, Passenger fare and embarkation point.
Each of these variables has its own chances of survival and in the present report we will consider each of this significant variables and their individual level of importance which helps us in determining different levels of survival chances of the passengers on the ship. Also, the quality of data and its significance is understood which helps in analysing sales for a different territories and different products. The analysis is done through understanding of various graphs and interpretation of sales results is done along with consumption patterns of different products and service in different territories.
2.0 Identifying Significant Variables & Its Analysis
The above pie graph represents the variance explained by each of principal component and this figure helps in understating the significant variables. The five preclinical components explain nearly 100 % of variances in the data collected. The highest is with Principal component 1 which explains nearly 28.3 % of total variances. The pie graph can be extrapolated in bar graph for clear understanding as below:
The sum of all principal components is nearly 100 % and hence we can conclude that these five principal components represent nearly 100 % of the total variance in the data. Also, below bar graph represents the standard deviation along with Variance for each of five principal components. The First two principal components explains nearly 54 % of total variances in the passengers data while the last principal component explains 10.2 % variance in the data.
The five key variables are Age, number of siblings, Number of Parents per children, Fare of the passenger boarding the titanic ship and the body identification number. Rotation with each five principal components is done and ethical analysis of each of five variables is done. In order to explain each of the five variables we need to have a closer look at above variables.
Significant Variable 1- Sibsp:It is first significant variables and is termed under principal component 1 as it explains nearly 28.3 % of variation in data alone. It represents as the number of siblings per spouse and as the number of siblings per spouse increases there is less chances of survival. Hence, this variable is inversely proportional to the chances of survival. We can conclude that more the number of siblings per spouse less are their survival rates.
Significant Variable 2- Parch: This is the second significant variable and is termed under principal component 2. It explains nearly 25.5 % of the total variances in the data. Parch represents number of parents per children and the first two significant variables combined results is nearly 54 % of variation explained in the data. Hence first two significant variables explain more than 50 % of variation in passenger’s data. Also, as the number of parents per children increases the survival chances decreases that mean more the number of parents per child less is their survival rate.
Significant Variable 3- Age:This is the third significant variable which explains early 21 % of total variation in the passenger’s data. The first three variables combined explain nearly 75 % of total variation in the passenger’s data that means ¾ the variation in the passenger’s data is explained by Sibsp, Parch and Age. Here Age is the age of the passengers on titanic board and as the age increases there would be less chances of survival on the titanic ship.
Significant Variable 4- Passenger Fare: This is the fourth significant variables and explains 15 % variation in the data. As there were different classes for passengers on titanic board, and as the class increased the passenger fare also increased. This variable explains the more the passenger fare, the more the chances of survival of the passengers and the first four variables explains nearly 90 % of total variation in the data.
Significant Variable 5- Port of Embarkation:This is the fifth significant variables and it alone explains nearly 10 % of the total variation in the data. There were different ports of embarkation such as Southampton, Cherbourg and Queenstown. The survival rates of different embarkation ports are different and it is found that Cherbourg has highest survival rate as compared to other two Southampton and Queenstown.
Correlation Clusters among different variables
The above table and figure shows the correlation among different variables such as Age, Sibsp, Parch, Fare and Body. There is positive correlation between Age and Fare and also Age and Body and highest correlation is between Sibsp and Parch which is 0.36 that means number of siblings per spouse is positively correlated to the number of parents per children. Likewise, there exists positive correlation between Fare and Survived people that mean higher the passenger fare higher are the chances of passenger’s survival on Titanic. There exists high negative correlation between Age and Sibsp and Age and Sibsp are negatively correlated to Survived people. We can draw an inference that people with higher age and people with more number of siblings per spouse have less chances of survival. The correlation cluster also represents clearly the correlation among different variables and shows there is enough correlation between fare and survived people and also some correlation between number of passenger per children and survival chances.
3.0 Building Predictive Model
A predictive model is developed for survival chances of passengers and the predictive modelling technique will help in identifying and analysing in detail about the chances of survival of the passengers on the titanic ship. A decision tree analysis is done which helps in predicting the survival chances and it starts with identification of variables which has significant relationship with the survival chances of passengers. As the variables are identified it then involves in identification of correlation among different variables. There are four variables which are found to be significant in determining the survival chances which are Passenger Fare, Sibsp that is number of siblings per spouse, Port of Embarkation and Parch which is number of parents per spouse.
There are three ports of embarkation of titanic, it started with Southampton and went to Cherbourg in France followed by Queenstown in Ireland. Below figure represent each of these variables in the decision tree and the correlation among these variables. There are different nodes which are identified in the above decision making tree diagram and as different nodes are formed it helps in identifying which variables or combination of these variables have the highest chances of survival of the passengers on the titanic. The passenger fare is segregated in two, one more than 25.467 and other less than 25.467. The passenger fare is directly proportional to the survival chances that means higher the passenger fare higher are the chances of survival.
Node 8 is formed by passenger fare greater than 25.467 ad number of siblings per parents less than or equal to 2 along with number of parents per children more then 1 and this node represents highest survival chances of the passengers on the Titanic Ship with n = 15. Sibsp and Parch are also segregated in two and their combination with other variables leads to development of nodes which helps in predicting the survival chances. Node 4 is next significant node developed with the combination of variables in which passenger fare is less than 25.467 and the port of embarkation is Southampton, and it is developed with n = 100. Southampton port is the significant port which has highest chances of survival as compared to Cherbourg and Queenstown. Passenger fare less than 25.467 combined with Southampton port builds node 4 and it represent passengers who gave less than 25.467 and also who boarded the ship at Southampton have high chances of survival next to node 8.
Node 7 comes next to node 4 and is build by three variables with passenger fare greater than 25.467, number of siblings per spouse less than or equal to 2 and number of parents per children less than or equal to 1. This node represents next higher probability of survival with n = 65, this node is also significant for survival of the passengers on the Titanic. Node 3 is build with passengers having less than 25.467 and there port of embarkation being Cherbourg or Queenstown. This node has very less chances of survival and as we have already developed the reason of lesser the passenger fare lesser is the chances of survival and also embarkation port being Cherbourg and Queenstown and not Southampton which contributed to build this node at less significant level and hence less chances of survival than Node 8, 7 and 4. Node 9 is the least significant node that is built and is formed by passenger fare greater than 25.467 and number of siblings per spouse greater than 2. This node has the least chances of survival if we compare the survival chances with other nodes.
4.0 Data Quality & Data Warehouse Architecture
Data is considered to be of good quality if it represents the real world scenario and that helps in making accurate decisions. The analysis is accurate if the data is good and if the data is good, the more effective decision making (Teisseire et al, 2007). The degree of excellence of the data is dependent on the data which represents actual scenario and the features such as accurate, standardised and consistent. It is the reliability and increase in efficiency of the decisions when it is kept in data warehouse and this data warehouse architecture consists of different layers which help in securing the data and maintaining the accuracy and reliability of the data. This data warehouse architecture which has the high quality of data can be used for reporting and data analysis and the typical ETL based data warehouse spaces the raw data from different external source systems (Wang et al, 2002).
Data mining, analytical processing of data which acts business intelligence tools and quick decision making is solely depended to the quality of the data. There are few key factors that play an important role such as Data Reliability, Data Accuracy and Data management which helps in maintaining the data with high quality in data warehouse. To house its key functions data warehouse architecture is built generally on staging, integration of data and different access layers and as the data is cleaned, transformed and spaces into data warehouse after the output is given by the source systems (Kumar et al, 2005).
5.0 Creating a report and accompanying graph
5.1 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists the store values for each quarter of 1998 across country and state province and comment on key trends and patterns in this report
We analyse each of three countries namely Canada, USA and Mexico in terms of store cost, store sales and unit sales in each of four quarters in 1998. In Canada, the store sales were highest in Quarter 2 and least was in quarter 4. Quarter 2 and quarter 3 had similar unit sales and total sales in Canada in 1998.
In USA, first three quarters of 1998 had almost similar sales with range of between 145000 and 149000 and as in Canada; USA sales were also least in last quarter in 1998. If we compare the store sales and unit sales of USA and Canada in 1998, then USA sales steals the show has high numbers if we compare these two states.
Lastly, in Mexico the store sales were highest in quarter 3 of 1998 and were least in quarter 4. All the three countries have lowest sales in last quarter of 1998. USA store sales are more than the numbers combined in Canada and Mexico in the year 1998. USA is the big market in terms of geography and consumer base which may be the reason for such high numbers as compared to other two states Mexico and Canada.
Below is the sales and store cost in the tabular form in each of four quarters in 1998. As we discussed above the numbers in USA are higher than both the states combined Mexico and Canada. The geography and the consumer base in USA is the reason for such high numbers as compared to Mexico and Canada.
5.2 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists for country and state province across product category and unit sales sub product category of breakfast foods and comment on key trends and patterns in this report
For four breakfast foods namely cereal, pancake mix, pancakes and waffles each of the pie chart below represent the unit sales in three states Mexico, Canada and USA. USA alone accounts for more than 50 % of the total unit sales of each of breakfast foods. Canada has the least unit sales of each of four breakfast foods.
The above graph explains clearly the demand and the consumer base of breakfast foods in USA as compared to Mexico and Canada. In Canada the demand of these breakfast foods is not much and hence accounts for fewer sales. Cereal and Pancakes has the highest demand in USA and accounts for 53 % sales in USA alone. Mexico’s unit sales of each of four breakfast foods ranges between 37 % and 42 % and Pancake Mix accounts for 42 % total unit sales in Mexico alone which is highest for Mexico. The above pie graph also represent the market of breakfast foods in USA and hence the demand id proportional to the sales in USA.
5.3 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists for the states of Oregon (OR) and Washington (WA) the total sales and total sales value in dollar terms and comment on key trends and patterns in this report.
Here two states namely Oregon and Washington are taken and their sales are analysed. The unit sales and the store sales for Oregon accounted for 32 % of the total sales and Washington accounting standerd for 68 % of the sales. The results are also shown in tabular form below and Washington accounts for more than two thirds of unit sales and store sales as compared to Oregon which accounts for one third. Also, it shows the high consumer base in Washington as compared to Oregon. As there is higher population in Washington compared to Oregon, the demand is higher, which accounts for higher sales in Washington
5.4 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists by product categories of beer and wine and their product sub categories, by order of unit sales and comment on key trends and patterns in this report. The below bar graph shows the unit sales of beer and Good is the highest unit sales as compared to other brands accounting 23 % of the total followed closely by Pearl and Portsmouth. The sales of the different brands vary from 16 % to 23 % and all there is no significant difference in sales of different brands.
Below Pie Chart shows the unit sales of different Wine brands with Good brand accounting to 22 % of total unit sales. The demand of Good brand is highest in Wine Category and is same in Beer Category. Wine sales account for nearly 74 % of total unit sales and beer sales are accounted to 26 %. The below table represents the data in tabular form and the number shows the sale of Good brand in both Beer and Wine Category is the highest and the Wine number are nealy three times that of Beer Numbers.
This study explains the survival chances of passengers on the Titanic Ship and significant variables were identified such as Age, Parch, Sibsp and Fare. Using principal component techniques and using these significant variables a predictive model is developed which helps in understanding the survival chances of passengers on the Titanic. Rattle Data mining tool is used for this and various correlation coefficients are determined between different variables which helps us in comprehending and analysing the survival rates of passengers on the Titanic Ship. The Decision tree is developed based on significant variables and different nodes are formed, a different node explains different survival chances of passengers on the Titanic.
Poncelet, P; Masseglia, F; and Teisseire, M (2007); "Data Mining Patterns: New Methods and Applications", Information Science Reference, ISBN 978-1-59904-162-9.
Tan, P; Steinbach, M; and Kumar, V (2005); Introduction to Data Mining, ISBN 0-321-32136-7.
Kahn, B., Strong, D., Wang, R. (2002) "Information Quality Management Benchmarks: Product and Service Performance," Communications of the ACM, April 2002. pp. 184–192.