Introduction

My

independent variable is the number of miles traveled (in millions) per state in

America. My dependent variable is the number of automobile related deaths in

America. These two variables are expected to be related in a positively

correlating relationship. It is believed that the more miles that are travels,

there is a higher possibility of the chance of an accident. Therefore, the more

miles traveled by vehicles per state, the more automobile accidents will occur.

In many cases, these accidents can cause death, so it is necessary to examine

the number of reported automobile related deaths per state (General

statistics).

This

statistical investigation explores the relationship between the number of miles

traveled (in millions) per state and is the number of automobile related deaths

in America, in the year 2015.

In

order to come to a conclusion about this statistical relationship, First I will

need to collect data online. I will need to collect at least 50 data points, in

order to provide satisfactory proof of the relationship, it is also important

to note that this data will need to be derived from a reliable source. Next I

will organize this data into a clearly labeled and neat table in Microsoft

Excel. Then I will verify that this data forms a line, by inserting a trend

line and finding the equation as well as r value. After this has been done, it

will be necessary for me to conduct by hand calculations and perform excel

functions using this data to find the mean and standard deviation of each data

set. The mean is the average of the data, and will be used from this point out

to conduct other calculations, including the standard deviation. The Standard

Deviation is necessary to find because it provides insight as to how far the

measurements for a group are spread out from the mean or expected value. (Standard Deviation Formula) This

means that the lower the Standard Deviation, the stronger the relationship of

the two variables, which will give insight as to the strength of the

association between vehicle miles traveled (in millions) and automobile related

deaths per state, in America, 2015.

Succeeding this, I will display the data in a scatter plot. After this

step has been completed, I will then evaluate the person’s correlation

coefficient, or the r value. If this value is greater than .75, then I can

conclude that the data has a strong relationship. Another way I will do this is

that I will utilize a Chi-Squared test to evaluate the correlation between the

two variables. If the critical value is greater than the calculated value, then I will be able to

conclude that the null hypothesis should be not be rejected, and the

relationship between the two variables is not statistically significant. From

this point, I will use all of the calculated information in order to conclude

if there is a strong relationship between the two variables. Finally, I will

validate all the math workings that I have completed in Excel and by hand, as

well as fix and discuss any observed issues.

Data

The

data for this project was found from the US department of Transportation

website, under the category, “The federal highway administration statistics”. It

is reported by the government, and the data is verified and made sure to be

actuate.

Sample ( See Appendix

One for full Data Table)

12.5

Outlier Workings

In

statistics, Quartiles divide data into quarters, based on their position of the

number line. They are a tool in order to observe the distribution of the data

set. Quartile 1 shows the median of the lower half of the data. Quartile 2

shows the median of your data. Quartile 3 shows the median of the upper half of

the data set. This information assists when finding the interquartile range,

telling us where the bulk of the data lies. Quartile 1 is found by multiplying

n (or the number of data points) by .25 (Excel QUARTILE Function). Refer

to Appendix one for the full set of data. Quartile 3 if found by multiplying n

(or the number of data points) by .75. In order to analyze the significance of

these points, it is necessary to find the interquartile range. The

interquartile range allows us to establish the upper and lower boundary and

allows outliers in the data to be eliminated. It is found by subtracting the

first quartile from the third. From this point, I will use the interquartile

range to establish the upper and lower boundaries. I will then use this

information to eliminate outliers in my data set, and continue to investigate

the relationship between vehicle miles traveled (in millions) and automobile

related deaths in America per state, in 2015. Outliers are known to decrease

the relationship between data, so they are often taken out as to come to a

reasonable conclusion.

530.4375

Mean:

The mean is useful when

calculating the standard deviation of the data set, which is necessary in order

to determine how far off the data is from its expected value. The mean is

calculated by taking the average of the data, or dividing the sum of the x

values, by the number of data points. The mean is representative of a value

that can represent the whole of the data. This value is found after the outliers

have been eliminated, in order that the value is not skewed by a number that is

too large, or too little (Mean, Median, Mode, and Range).

The mean of the x values of the data set is 47,806.875. The mean of the Y

values of this data set is 530.4375. See appendix two for workings.

34419.1335

377.7470442450

Standard Deviation:

Standard Deviation is used to

determine if the data is close to the expected value or spread out over a wide

range. Standard deviation is used to compare two sets of data effectively. If

the standard deviation is low, this means that the numbers in the data are

closer to the expected value, or average (Standard Deviation Formula).

If the standard deviation is high, this means that the numbers are far from, or

deviate from the expected value and average. The standard deviation of the X

values is high, compared to the rest of the data, meaning that this these

values are somewhat far from the average. However, the Y value standard

deviation is lower, meaning that the numbers in this data set are closer to the

expected value, or average.( Standard Deviation Formula).

See appendix 2 for workings.

.010247525

Least Squared Regression:

In order to find the line of best fit/linear line it is

necessary to calculate the slope. This is done by using the formula mentioned

above. This line illustrates the relationship between the data points. Then,

the relationship can be evaluated by observing the data points compared to the

line. The closer they are, the closer to the expected value they will be. This

allows us to further evaluate the relationship between the variables (Least Squares Regression). See

appendix 3 for workings.

Pearson’s Correlation Coefficient :

The Pearson

correlation coefficient is a statistical formula that measures the

strength between variables and relationships. The formula above is used to find

the R value, or coefficient. This value shows the linear relationship between

two variables. It will either have a strong, weak, or no correlation. If the

value is .75 or higher, it has a strong correlation. If the value is lower than

.75, it has a weak correlation. A value of zero means that there is no correlation

(Calculating the Pearson product-Moment correlation coefficient). This is

useful because it allows us to evaluate the strength of the relationship

between the two variables. In this case, the R value of my data is .869This means that the two variables have a relationship that is strongly

correlated.

Cell

Row

1, Column 1

22

12

Row

1,

Column

2

2

12

Row

2,

Column

1

2

12

Row

2,

Column

2

22

12

Total

48

48

Contingency Table:

A contingency table is broken

up and based on different sets of data. It is used in statistics in order to

summarize the relationship between several, in this case, two variables. A

contingency table is unique because it allows for the variables to be observed

simultaneously. The contingency

table is used in tandem, and is a necessary step to have when conducting the

Chi Squared test (Cross-Tabulating Variables:

How to Create a Contingency Table in Microsoft Excel). See appendix five for

workings.

( (

Chi Squared with Yates Correction:

The

chi squared test is used to determine the significance of the relationship between

variables. It tests what is called the “null hypothesis”, which states that

there is no significant difference between the expected and the observed results.

In this case, the purpose of the test is to determine if there is a significant

relationship between number of vehicle miles traveled and Automobile related

deaths in the US in the year 2015. In order to conduct this test, I have

arranged the data into a contingency table. The contingency table will allow me

to determine the expected values. Each row of the contingency table, as

mentioned above, represents a category for one variable, as compared to another

variable. The test is conducted by calculating the expected value of the two

nominal variables by using the formula listed above. The expected value is then

compared to the observed value, in order to obtain the “critical value”. The

critical value is the point on the scale of the test statistic as to where the

null hypothesis is rejected. Next the Yate’s correction is necessary in this

case. The yates correction is made to account for the fact that the Chi square

test used to complete this project, is biased upwards for a 2×2 contingency

table. This can cause the test to become inaccurate because it will allow the

results of the test to be larger, skewing the results (Yates Correction). This could result in

the inaccurate rejection of the null hypothesis. See appendix 5 for workings.

Overall Conclusion

There is a positive relationship between the relationship

of Vehicle Miles traveled (in millions) and Automobile Related deaths in

America (per state) in 2015. The chi squared test has shown that the variables

are not independent of each other, due to the fact that the chi squared test

shows that the chi squared value is larger than the critical value. The r value

shows that the data is close to the line of best fit, therefore there is a

positive relationship between the two variables.

Validity

Although there is a proved positive relationship between

the number of miles traveled (in millions) per state in the United States and

the number of automobile related deaths, there are factors that could cause

this data to be slightly skewed. One of these examples that needs to be

addressed, is that the data of miles traveled has been collected from many

different types of vehicles. This could potentially skew the data due to the

differing level of safety on each Vehicle. Additionally, the speed traveled by

each vehicle could negatively affect the validity of the data. It is more

dangerous to travel at higher speeds, increasing the chances of an accident or

automobile related death. The math workings of this project have been verified,

and matched through by hand calculations and Microsoft excel. This is proof

that they are valid, and can be relied upon.

Improvements

An

improvement that could be made on this study, addressing the problem of the

different types of vehicles used to collect data, is that the data could be

collected from the manufacturer of one certain type of vehicle. This should be

done so that it could be confirmed that the level of safety of the differing

vehicles would not affect the number of automobile related deaths.