INFO H515 R Programming Editing and Proof Reading Services 1. Consider the cpus data from R package MASS.

We will use linear regression to investigate the relationship between variables in this data set and estimated performance (variable estperf). Do not use published performance as a predictor of performance in this problem.

a. Investigate the relationship between variables in the cpus dataset, both numerically and visually.

b. Use either method commonly used in the book/lecture notes to build a linear regression model predicting estimated performance from predictors in the cpus dataset. Do not consider name in this modelling approach.

c. Create a residual plot using this model and comment on its features. Do any of the assumptions of linear regression seem to be violated? What might be done to adjust our model? Adjust the model if necessary by considering various residual plots, updating the model, and assessing residual plots using the updated model.

d. How well does the final model fit the data? Comment on some model fit criteria from the model built in c)

e. Interpret all variables in your final model using complete sentences, making sure to account for the fact that this may be a multivariable model. Give interpretations in terms of as meaningful of units as possible (it may not be possible to use seconds for cycle time - the answer is too large, but you may use MB instead of kB, for instance). Adjust interpretations as needed, both for units, and the fact that our outcome has been log-transformed (how do we get to the raw data values from a log transformation? Start by thinking: what is the inverse of the log function???)

f. Calculate indices that help assess multicollinearity between predictors in your final model. Is there evidence of multicollinearity? What does this imply, and should you take action? Take action if appropriate.

g. Are there any outliers or influential observations in this data? Calculate relevant indices or provide visualizations to justify your answer. Make sure to use rules of thumb discussed in class if necessary for interpretations. 2. Consider the birthwt data from R package MASS.

We will investigate the relationship between low birth weight and the predictors in the birthwt data using logistic regression and discriminant analysis.1

a. Investigate the relationship between variables in the birthwt dataset. Do you see anything surprising? Use both numeric and visual summaries.

b.Fit a logistic regression model using methods discussed in class/the book, similar to as in problem 1). Be careful to understand each variable in birthwt to avoid including variables that are not logically acceptable for inclusion in the model.

c. What do you notice regarding the variables ptl and ftv. What is your logistic regression model in b) (perhaps before performing variable selection) implicitly assuming regarding these variables’ effects on the log odds of giving birth to a low weight baby? Are these assumptions realistic?

d. Create a new variable for ptl named ptl2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories.

e. Create a new variable for ftv named ftv2 which is more useful for analysis. Keep in mind that with very small sample sizes, it may be worthwhile to collapse multiple categories. Also, it may be helpful to form tables which summarize low birth weight probabilities by levels of the variable in order to better understand the relationship between probability of low birth weight and the newly created variable.

f. Using the newly created variables in d) and e), reassess the logistic regression model arrived at in b) using the new versions of ftv and ptl instead of the old versions.

g. In a manner similar to the approach used in the book, split the birthwt data into a training and test set, where the test set is about 20% the size of the entire dataset. Then, using variables that are justifiable for inclusion in the discriminant analysis, fit LDA and QDA models to the training set and form confusion matrices, calculate the sensitivity, specificity, and the accuracy of each method using the test set, and do the same for the logistic regression models built in f) and b). Which model performs the best? Remember you MUST set the seed using the TeachingDemos package in a manner similar to as done in the notes (but don’t use my name to set the seed!)

h. Using your final model from f), interpret the estimates for all covariates.

3. Consider the following function below. This function is created to count the number of 2s in a numeric vector of length 1 or greater.

countTheTwos <- function(x)
{
if (length(x) == 0)
{
return("Object x is of length 0. No 2s will be found.")
}
assess2s <- function(x)
{
if (x == 2)
{
2
count <- count + 1
}
}
count <- 0
for (i in 1:length(x))
{
assess2s(x[i])
}
return(count)
}
countTheTwos(c(1,2,3,2,2,4,7,-10))
##  0

Make a very small modification to the function above using the assign function in order to make the function work correctly. Run your countTheTwos() function with argument c(1,2,3,2,2,4,7,-10) when finished.

4. In continuation of 3), assume that assess2() is defined in the global environment: countTheTwos <- function(x)

{
if (length(x) == 0)
{
return("Object x is of length 0. No 2s will be found.")
}
count <- 0
for (i in 1:length(x))
{
assess2s(x[i])
}
return(count)
}
assess2s <- function(x)
{
if (x == 2)
{
count <- count + 1
}
}

countTheTwos(c(1,2,3,2,2,4,7,-10))

Make minor modification to the above function so that it will also work as in 3). You will still need a similar modification as you used in 3) to make this work, but must also add an additional modification. Run your countTheTwos() function with argument c(1,2,3,2,2,4,7,-10) when finished.

5. Why not use superassignment, i.e. <<-, in 4)?

6. Parsing Log Files

Please download the offline.txt file from Canvas under Exam. This file was obtained from a book written by Duncan Temple Lang and Deborah Nolan titled “Data Science in R: A Case Studies Approach to Computational Reasoning and Problem-Solving in R”, which I highly recommend. This file is composed of data which can be used to train an algorithm for positional tracking of people and things inside stores, warehouses, etc. using indoor positioning systems (IPS). Essentially, the strength of the signal between a handheld device (laptop, cellular telephone, tablet, etc.) and a fixed access point (such as a router) is measured at already known positions in many locations around the building. This training data can be used to build an algorithm for estimating a position of a handheld device using future observations. Our goal in this problem is just to parse the raw data files and create a more reasonable data structure, perhaps for future use. This dataset uses media access control (MAC) addresses, which are in the form xx:xx:xx:xx:xx:xx. The data per observation are: t: the timestamp since midnight, January 1, 1970 UTC, id: the MAC of the scan degrees, MACofResponseI: the MAC of the Ith responder to the scan. The values of the device, pos:, the real position, degree: the orientation of the user carrying the scanning device in MAC responses are in the form: dBm (signal strength), channel frequency, and mode. The mode takes on values 3 for access points within the building (routers), and 1 for adhoc devices, i.e. other individual handheld devices.

Parse the file from the canvas in order to create a final dataset named scanAgg that contains one row per observation in the original file, but summarizes the responses into just two new variables adhoc and nResponse, where adhoc denotes the number of adhoc responses to each scan, and nResponse denotes the total number of responses to each scan. Also, parse the file to create multiple rows per scan, one for each response, so that if there are 12 MAC address responses, this scan device will have 12 rows associated with it’s timestamp in the final dataset. Call this dataset scanLong. Print out the first 20 rows of each dataset as your final proof of answer. Make sure you have not reordered any of the data - I am looking for summaries of the first 20 lines that would result if you completed the operations above successfully. If you are having speed issues in creating large data frames, you may want to consider the R package data.table, which was made specifically for larger datasets. You may also want to look at the function rbindlist() from this package.

7. Experimenting with tree-based methods

In this problem, you will use CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of variables describing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam (mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data

a. Split the mushroom data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead, use the sample() function).

b. Using a CART, fit a classification tree to the training data and employ cost-complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.

c. Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for B, the number of bagged trees, to assess the sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.

d. Use random forests on the training set to build a model which can predict edible vs poisonous status. Test your resulting random forest model on the testing data created in part a). Try some alternative values for mtry, the number of predictors which are “tried” as optimal splitters at each node, and also B. Compare these models in terms of their test performance. Which Band mtrycombination seems best? Create variable importance plots for this final model and interpret the findings.

e. Use boosting the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for _, perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between _and B? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify interaction.depth, and instead keep it set to 1. Which combination (of _, B) seems best? Create variable importance plots for this final model and interpret the findings.

f. Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(char2seed(“MyName”)))?

8. Using the Bootstrap to estimate standard errors

In this problem, we will employ bootstrapping to estimate the variability in two problems where standard software does not exist.

a. Using the Auto data from package ISLR, estimate the standard error of the Median mpg (miles per gallon) in the sample of cars with 4 cylinders.
b. Using the Auto data from package ISLR, estimate the standard error of the Median mpg (miles per gallon) in the sample of cars with 8 cylinders.
c. Using the Auto data from package ISLR, estimate the standard error of the difference in Median mpg (miles per gallon) between 4 cylinder and 8 cylinder cars.

9. Write your own function for conducting 10-fold cross-validation to assess the test error in a model predicting estimated performance in the cpus dataset from R package MASS, using minimum main memory and maximum main memory. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the createFolds() function from the R package caret, as explained in Lecture 8.

With OZ Assignments avail best R Language assignment help at affordable prices. Take our experts’ help and get high-quality assignments. Our experienced and highly qualified experts are adept in R Language concepts and provide you with the best help.