Introduction accident. Therefore, the more miles traveledIntroduction accident. Therefore, the more miles traveled



independent variable is the number of miles traveled (in millions) per state in
America. My dependent variable is the number of automobile related deaths in
America. These two variables are expected to be related in a positively
correlating relationship. It is believed that the more miles that are travels,
there is a higher possibility of the chance of an accident. Therefore, the more
miles traveled by vehicles per state, the more automobile accidents will occur.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

In many cases, these accidents can cause death, so it is necessary to examine
the number of reported automobile related deaths per state (General

statistical investigation explores the relationship between the number of miles
traveled (in millions) per state and is the number of automobile related deaths
in America, in the year 2015.

order to come to a conclusion about this statistical relationship, First I will
need to collect data online. I will need to collect at least 50 data points, in
order to provide satisfactory proof of the relationship, it is also important
to note that this data will need to be derived from a reliable source. Next I
will organize this data into a clearly labeled and neat table in Microsoft
Excel. Then I will verify that this data forms a line, by inserting a trend
line and finding the equation as well as r value. After this has been done, it
will be necessary for me to conduct by hand calculations and perform excel
functions using this data to find the mean and standard deviation of each data
set. The mean is the average of the data, and will be used from this point out
to conduct other calculations, including the standard deviation. The Standard
Deviation is necessary to find because it provides insight as to how far the
measurements for a group are spread out from the mean or expected value. (Standard Deviation Formula) This
means that the lower the Standard Deviation, the stronger the relationship of
the two variables, which will give insight as to the strength of the
association between vehicle miles traveled (in millions) and automobile related
deaths per state, in America, 2015. 
Succeeding this, I will display the data in a scatter plot. After this
step has been completed, I will then evaluate the person’s correlation
coefficient, or the r value. If this value is greater than .75, then I can
conclude that the data has a strong relationship. Another way I will do this is
that I will utilize a Chi-Squared test to evaluate the correlation between the
two variables. If the  critical value is greater than the  calculated value, then I will be able to
conclude that the null hypothesis should be not be rejected, and the
relationship between the two variables is not statistically significant. From
this point, I will use all of the calculated information in order to conclude
if there is a strong relationship between the two variables. Finally, I will
validate all the math workings that I have completed in Excel and by hand, as
well as fix and discuss any observed issues.



data for this project was found from the US department of Transportation
website, under the category, “The federal highway administration statistics”. It
is reported by the government, and the data is verified and made sure to be

Sample ( See Appendix
One for full Data Table)

















Outlier Workings



statistics, Quartiles divide data into quarters, based on their position of the
number line. They are a tool in order to observe the distribution of the data
set. Quartile 1 shows the median of the lower half of the data. Quartile 2
shows the median of your data. Quartile 3 shows the median of the upper half of
the data set. This information assists when finding the interquartile range,
telling us where the bulk of the data lies. Quartile 1 is found by multiplying
n (or the number of data points) by .25 (Excel QUARTILE Function). Refer
to Appendix one for the full set of data. Quartile 3 if found by multiplying n
(or the number of data points) by .75. In order to analyze the significance of
these points, it is necessary to find the interquartile range. The
interquartile range allows us to establish the upper and lower boundary and
allows outliers in the data to be eliminated. It is found by subtracting the
first quartile from the third. From this point, I will use the interquartile
range to establish the upper and lower boundaries. I will then use this
information to eliminate outliers in my data set, and continue to investigate
the relationship between vehicle miles traveled (in millions) and automobile
related deaths in America per state, in 2015. Outliers are known to decrease
the relationship between data, so they are often taken out as to come to a
reasonable conclusion.







                 The mean is useful when
calculating the standard deviation of the data set, which is necessary in order
to determine how far off the data is from its expected value. The mean is
calculated by taking the average of the data, or dividing the sum of the x
values, by the number of data points. The mean is representative of a value
that can represent the whole of the data. This value is found after the outliers
have been eliminated, in order that the value is not skewed by a number that is
too large, or too little (Mean, Median, Mode, and Range).

The mean of the x values of the data set is 47,806.875. The mean of the Y
values of this data set is 530.4375. See appendix two for workings.









Standard Deviation:


                 Standard Deviation is used to
determine if the data is close to the expected value or spread out over a wide
range. Standard deviation is used to compare two sets of data effectively. If
the standard deviation is low, this means that the numbers in the data are
closer to the expected value, or average (Standard Deviation Formula).

If the standard deviation is high, this means that the numbers are far from, or
deviate from the expected value and average. The standard deviation of the X
values is high, compared to the rest of the data, meaning that this these
values are somewhat far from the average. However, the Y value standard
deviation is lower, meaning that the numbers in this data set are closer to the
expected value, or average.( Standard Deviation Formula).

See appendix 2 for workings.







Least Squared Regression:


                 In order to find the line of best fit/linear line it is
necessary to calculate the slope. This is done by using the formula mentioned
above. This line illustrates the relationship between the data points. Then,
the relationship can be evaluated by observing the data points compared to the
line. The closer they are, the closer to the expected value they will be. This
allows us to further evaluate the relationship between the variables (Least Squares Regression). See
appendix 3 for workings.








Pearson’s Correlation Coefficient :


            The Pearson
correlation coefficient is a statistical formula that measures the
strength between variables and relationships. The formula above is used to find
the R value, or coefficient. This value shows the linear relationship between
two variables. It will either have a strong, weak, or no correlation. If the
value is .75 or higher, it has a strong correlation. If the value is lower than
.75, it has a weak correlation. A value of zero means that there is no correlation
(Calculating the Pearson product-Moment correlation coefficient). This is
useful because it allows us to evaluate the strength of the relationship
between the two variables. In this case, the R value of my data is .869This means that the two variables have a relationship that is strongly




1, Column 1
















Contingency Table:


                 A contingency table is broken
up and based on different sets of data. It is used in statistics in order to
summarize the relationship between several, in this case, two variables. A
contingency table is unique because it allows for the variables to be observed
simultaneously. The contingency
table is used in tandem, and is a necessary step to have when conducting the
Chi Squared test (Cross-Tabulating Variables:
How to Create a Contingency Table in Microsoft Excel). See appendix five for








( (



Chi Squared with Yates Correction:


chi squared test is used to determine the significance of the relationship between
variables. It tests what is called the “null hypothesis”, which states that
there is no significant difference between the expected and the observed results.

In this case, the purpose of the test is to determine if there is a significant
relationship between number of vehicle miles traveled and Automobile related
deaths in the US in the year 2015. In order to conduct this test, I have
arranged the data into a contingency table. The contingency table will allow me
to determine the expected values. Each row of the contingency table, as
mentioned above, represents a category for one variable, as compared to another
variable. The test is conducted by calculating the expected value of the two
nominal variables by using the formula listed above. The expected value is then
compared to the observed value, in order to obtain the “critical value”. The
critical value is the point on the scale of the test statistic as to where the
null hypothesis is rejected. Next the Yate’s correction is necessary in this
case. The yates correction is made to account for the fact that the Chi square
test used to complete this project, is biased upwards for a 2×2 contingency
table. This can cause the test to become inaccurate because it will allow the
results of the test to be larger, skewing the results (Yates Correction). This could result in
the inaccurate rejection of the null hypothesis. See appendix 5 for workings.

Overall Conclusion

            There is a positive relationship between the relationship
of Vehicle Miles traveled (in millions) and Automobile Related deaths in
America (per state) in 2015. The chi squared test has shown that the variables
are not independent of each other, due to the fact that the chi squared test
shows that the chi squared value is larger than the critical value. The r value
shows that the data is close to the line of best fit, therefore there is a
positive relationship between the two variables.


            Although there is a proved positive relationship between
the number of miles traveled (in millions) per state in the United States and
the number of automobile related deaths, there are factors that could cause
this data to be slightly skewed. One of these examples that needs to be
addressed, is that the data of miles traveled has been collected from many
different types of vehicles. This could potentially skew the data due to the
differing level of safety on each Vehicle. Additionally, the speed traveled by
each vehicle could negatively affect the validity of the data. It is more
dangerous to travel at higher speeds, increasing the chances of an accident or
automobile related death. The math workings of this project have been verified,
and matched through by hand calculations and Microsoft excel. This is proof
that they are valid, and can be relied upon. 


improvement that could be made on this study, addressing the problem of the
different types of vehicles used to collect data, is that the data could be
collected from the manufacturer of one certain type of vehicle. This should be
done so that it could be confirmed that the level of safety of the differing
vehicles would not affect the number of automobile related deaths.