Science 480 Research Methods in Science
Study Guide :: Unit 2
Statistics
A lot of people from various and sundry walks of life seem to have an unusually protracted opinion about the study of Statistics—from utter rejection to begrudging acceptance.
For example, one 20th‑century mathematician allegedly wrote:
Statistics: the mathematical theory of ignorance.
A 19th‑century poet, novelist, and literary critic, best known as a collector of folk and fairy tales, once observed:
He uses statistics as a drunken man uses lamp posts—for support rather than illumination.
The English author, futurist, historian, and teacher, who is best remembered for his science fiction novels and whom some call the ‘Father of Science Fiction,’ was said to have claimed that:
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.
However, it was the ‘Lady of the Lamp’ who put her thoughts about statistical methods into action. Florence Nightingale (1820‑1910)—recognized by many as the founder of modern nursing—actually pioneered the use of social statistics by using data she had collected when she had encountered unsanitary and under‑supplied hospital conditions. For example, she developed graphical techniques to illustrate that, during the Crimean War (1853‑1856), more soldiers had died as a result of unsanitary conditions than had been killed in combat. Ultimately, through such statistical presentations to the Houses of the British Parliament of the time and through their subsequent intervention, she was instrumental in improving these conditions and, undoubtedly, saving numerous lives.
For these and many other reasons, Statistics has become an important discipline of study and research in our modern world—despite its relatively recent emergence as a field of applied mathematics. Today, many academic and professional disciplines consider the knowledge of statistical methods to be highly desirable for forays into many research endeavours, including the presentation of large amounts of data in a succinct, descriptive form.
It is expected that you have already obtained credit for an introductory Statistics course. This section is simply a fast‑forward journey through the concepts and methods generally covered. Its purpose is to provide a top‑down view of the tools such processes afford to help lay a mathematical foundation for data analysis in a wide array of research endeavours.
Therefore, as you progress through the various sections of this unit, think of an interesting research project (if you haven’t already done so). Explore what kind(s) of data you will be collecting, determine which statistical (or other) methods you are going to require to analyze them, and then experiment with possible ways of presenting the analyzed data in coherent chunks to make visualization easier.
In Unit 4, you will be asked to design a research proposal. Part of that proposal should contain a description of any statistical (if applicable) methods you intend to use for the analysis of the data you have collected in the hopes of answering your research questions. For a particularly useful video on how to define your variables and which analysis techniques are most appropriate for that purpose, please see Choosing the Appropriate Statistical Analysis Technique (no software involved) (RMUoHP).
After finishing this unit, do a cursory read of Unit 4 in this Study Guide, section 4.2 and watch the online videos for direction on how to present scientific data in an interesting, concise, and accurate way.
Unit 2 is divided into nine sections, loosely based on the textbook. Most sections begin with an opening discussion, followed by
- Learning Outcomes
- Readings (from the textbook)
- Exercises (for practice or for credit)
- Online Maple TA Study Sessions (to help reinforce the statistical ideas for the purposes of research).
- These self‑assessment exercises are designed to provide a review of introductory statistical concepts, terminology, and some methods of data analysis, both descriptive and inferential. Each session may be studied an unlimited number of times. The results of each attempt are not recorded. This site requires separate credentials for access. To be registered in this third party tool, please contact Julie Peschke (juliep@athabascau.ca) and request access to the SCIE 480 self‑assessment study sessions by supplying a) your first and last names, b) your AU student ID number and c) the name of the course.
- Terms to Understand (important and relevant statistical terminology)
- Supplementary Resource Materials (relevance specified for each section)
- Reference Materials (relevance specified for each section)
Assignment 2 is divided into a Discussion, a Written Exercise, and an Online Quiz, worth a total of 25% towards your final grade in the course.
2.1: Motivations for Statistics
(Why do we do this at all?)
Learning Outcomes
After completing this section, you should be able to
- list at least four reasons for including statistical methods, both descriptive and inferential, in data driven research.
- explain the importance of measurements, their accuracy and their significance.
- optionally work with Microsoft Excel as a data analysis tool.
Reading
Study Chapter 3.1 (pp. 51‑55) of the textbook.
Appendix B: Extract from Galileo’s final book, Dialogues on Two New Sciences (pp. 201‑203)
Online Maple TA Study Sessions
See Using Statistical Analysis in Research and
Organizing and Graphing Data
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Note: The Unit 2 online quiz is based on a selection of questions from the Online Maple TA Study Sessions throughout this unit.
Terms to Understand
measurement error, distributed quantities, descriptive statistic, confidence, significance
Discussion (Part I of Assignment 2)
In your own words, write a short paragraph (maximum 150 words) on your perspective of the role of statistics in scientific research in your particular disciplinary interest. Post it on the A2 - Part I: Discussion and compare notes with your fellow students (2%).
Note, only after you post your own answer, you will be able to see other’s. You are encouraged to comment on each other.
Supplementary Resource Materials and Reference Materials
2.2: Reducing many numbers to few
(I call that number crunching!)
Learning Outcomes
After completing this section, you should be able to
- recognize, understand and use algebraic and summation notation for statistical quantities.
- define and calculate the average or mean and the median of a finite collection of numbers.
- describe, define and calculate the standard deviation of a finite collection of numbers.
- take a set of numerical data, group it into bins (or ranges) of values, and then create a histogram of the data set.
- interpret, in context, the meaning of the histogram of a given data set.
Reading
Chapter 3.2 (pp. 55‑61) of the textbook.
Practice Exercise
Question 3.3 on page 107 of the textbook
Online Maple TA Study Sessions
See Numerical Descriptive Measures: statistical number crunching
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
quantitative data set, mean of a data set, average of a data set, median of a data set, histogram of a data set, standard deviation of a data set
Supplementary Resource Materials
2.3: Probability distributions
(theoretical likelihood of events occurring if the experiment is conducted many times)
Classical probability techniques measure the exact, or theoretical, probability of an event happening. The techniques are based on a finite set of possible outcomes (finite discrete variable) and expressible as fractions between 0 and 1.
In the case that the response/outcome variable is continuous or infinitely discrete (countable but not finite, such as the counting numbers), the probability of an event happening is often calculated using ‘relative frequencies’ of that event based on previously collected data (excluding census data). However, the relative frequencies of such events are not exact probabilities but are approximate probabilities. The Law of Large Numbers is used to approximatethe actual or theoretical probability of such an event happening randomly. It is stated as follows:
If an experiment is repeated again and again, the probability of an event calculated from its relative frequency of occurrence will approach its actual or theoretical probability.
Note that the various probability distributions discussed in this section have been derived taking into account certain types of random event occurrence patterns using the Law of Large Numbers.
Learning Outcomes
After completing this section, you should be able to
- define a probability distribution and list its primary properties.
- evaluate the sums and products of probabilities.
- describe a discrete probability distribution and provide an example of one.
- list the properties of a continuous probability distribution and provide an example of one.
- describe the properties and the uses of the binomial distribution and evaluate probabilities of events using it.
Reading
Study Chapter 3.3 (pp. 61‑70) of the textbook.
Review
Think about these various kinds of probability distributions for hypothesis testing and look over some of the Online Maple TA Study Sessions below to achieve a better top‑down understanding as to why the notion of a probability distribution is essential to the validation of certain research experiments.
Online Maple TA Study Sessions
See Calculating Probabilities: basic definitions and rules;
Probability Distributions of Discrete Random Variables;
Discrete Probability Distributions: the binomial distribution; and
Probability Distributions of Continuous Random Variables
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
probability distribution, sample space of a probability distribution, sums and products of probabilities, discrete probability distribution, binomial distribution, continuous probability distribution, the normal distribution (sometimes called the Gaussian distribution or the bell curve)
Supplementary Resource Materials
2.4: Connecting data and probability distributions
(how to relate your sample data to a suitable theoretical probability distribution)
Learning Outcomes
After completing this section, you should be able to
- explain the meaning of the formulas in the summary boxes on pages 73 and 75 of the textbook.
- describe the Normal Distribution by stating its primary characteristics and its graphical shape.
Reading
Study Chapter 3.4 (pp. 70‑75) of the textbook.
Online Maple TA Study Sessions
See Probability Distributions of Discrete Random Variables and
Probability Distributions of Continuous Random Variables
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
population mean and standard deviation/variance of a theoretical discrete probability distribution and of a theoretical continuous probability distribution; the Normal Distribution
Supplementary Resource Materials
2.5: What happens to sample averages as the sample size $(N)$ increases
In this section, we find the mean and standard deviation of the sampling distribution of the sample mean to determine how the sample mean behaves with respect to variation in sample size. For any given population, the population mean is a fixed quantity but the means of the samples extracted from it vary. Therefore, we may consider the sample mean from a single population as a random variable. This led researchers to think about the probability distribution of the sample mean treated as a random variable—which, over time, came to be called the sampling distribution of the sample mean.
Sampling error is the difference between the value of a sample statistic and the value of the corresponding population parameter. In the case of the sample mean, the sampling error of the mean = sample mean (a variable)—population mean (a constant).
The standard error is the standard deviation of the sampling distribution of a statistic (for example, a sample mean or a sample proportion). The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate.
Learning Outcomes
After completing this section, you should be able to
- estimate the population standard deviation of a random variable from the sample standard deviation of a sample taken from that population (p. 80).
- improve the accuracy of an experiment involving a random variable by using the standard error of the variable (pp. 80‑81).
Reading
Study Chapter 3.5 (pp. 75‑81) of the textbook.
Review
Think about the notion of a sampling distribution for a statistic. Watch some of the online tutorials below to achieve a better top‑down understanding as to why sampling distributions are fundamental to the procedures for hypothesis testing.
Online Maple TA Study Sessions
See Sampling Distributions: on the road to hypothesis testing
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
sampling distribution of a statistic, sampling error of a statistic, standard sampling error of a random variable
Supplementary Resource Materials
2.6: The Central Limit Theorem
Why is the Central Limit Theorem so central to the study of statistics? It turns out that sample means from a single population, taken collectively, are actually quite well‑behaved statistically. If you think about it, averages do tend to iron out the kinks in the shape of the data caused, firstly, by certain anomalies—or the outliers, as statisticians call them—and, secondly, by incidences of extreme skewness.
In fact, it has been discovered that:
- If the population is normally distributed (in a bell‑shape curve about its mean), then the shape of the sampling distribution of the mean is the same as the shape of the normal distribution;
- If the population is not normally distributed and the sample size is large $(N\ge 30)$, then the shape of the sampling distribution of the mean is approximately normal! That is the conclusion of the Central Limit Theorem.
Learning Outcomes
After completing this section, you should be able to
- articulate the statement, the meaning, and implications of the Central Limit Theorem.
Reading
Study Chapter 3.6 (pp. 81‑84) of the textbook.
Online Maple TA Study Sessions
See Sampling Distributions: on the road to hypothesis testing
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
The Central Limit Theorem
Supplementary Resource Materials
Sections 2.6.1‑4 How the Standard Normal Distribution Leads to Hypothesis Testing
(following the path from the sampling distribution of a statistic to the normal distribution to the standard normal distribution to hypothesis testing)
Learning Outcomes
After completing this section, you should be able to
- describe the properties of the sampling distribution of the sample mean.
- describe how the standard normal distribution is derived from the normal distribution.
- list and describe the properties of the standard normal distribution.
- describe how the properties of the standard normal distribution lead to the Empirical Rule, which is a special case of Chebyshev’s Theorem for a bell‑shaped distribution (a normal curve). (Chebyshev’s Theorem is “For any number $k>1$, at least $(1-{1}/{{{k}^{2}}}\;)$ of the data values lie within $k$ standard deviations of the mean.”)
Reading
Study Chapter 3.6.1‑3.6.4 (pp. 84‑89) of the textbook.
Practice Exercise
Try explaining hypothesis testing to a friend, using a specific scenario which may occur in the discipline of interest to you.
Online Maple TA Study Sessions
See Hypothesis Testing for ONE population
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
the normal distribution, the standard normal distribution, level of significance of a test, $p$‑value of a test, one‑ and two‑tailed tests of hypotheses, one sample tests for the mean and proportion
Supplementary Resource Materials and Reference Materials
Sections 2.6.5‑2.6.6 Confidence Intervals and Significance of a Test
Learning Outcomes
After completing this section, you should be able to
- describe and evaluate a confidence interval of a population parameter, given a specific level of confidence.
- explain the meaning of the significance of a hypothesis test.
Reading
Study Chapter 3.6.5‑3.6.7 (pp. 89‑92) of the textbook.
Practice Exercise
question 3.4, parts a‑h on pages 107‑08 of the textbook
Online Maple TA Study Sessions
See Estimation of Parameters and
Hypothesis Testing for ONE population
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
confidence intervals for a true population parameter, upper and lower limits of a confidence interval, margin of error, and Type I and Type II errors in hypothesis testing
Supplementary Resource Materials
2.7: Comparing many samples
(Introducing two‑sample hypothesis tests)
What if a researcher wants to compare two population means $({{\mu }_{1}}\ \text{and}\ {{\mu }_{2}})$ or proportions $({{p}_{1}}\ \text{and}\ {{p}_{2}})$? The simplest way to do this is to treat the difference of the two population means $({{\mu }_{1}}-{{\mu }_{2}})$ or proportions $({{p}_{1}}-{{p}_{2}})$ as a single difference parameter and then test it against the difference of the sample means $({{\bar{x}}_{1}}-{{\bar{x}}_{2}})$ or sample proportions $({{\hat{p}}_{1}}-{{\hat{p}}_{2}})$, respectively, as the statistic of comparison.
Note that the theoretical basis of these two‑sample hypothesis tests has its foundations in the shape and the means of the corresponding sampling distributions, namely the sampling distributions of ${{\bar{x}}_{1}}-{{\bar{x}}_{2}}$ and ${{\hat{p}}_{1}}-{{\hat{p}}_{2}}$, respectively.
Learning Outcomes
After completing this section, you should be able to
- articulate the conditions under which a z‑test versus a Student’s t test can be used in hypothesis tests involving two population parameters.
- explain when to use a pooled‑variance Student’s t test and an unspooled‑variance Student’s t test.
- define and calculate the degrees of freedom for a t test.
- test a difference of population means.
- test a difference of population proportions.
Reading
Study Chapter 3.7 (pp. 92‑97) of the textbook.
Practice Exercise
Read question 3.5 on page 108‑109 of the textbook.
Using the data collected, if you were to test the hypothesis that taking 500 mg of Vitamin C per day helps to prevent the catching of colds, then
- What significance level would, in your opinion, be the most appropriate for a test of this kind?
- What would your null and alternative hypotheses be? Be sure to define your parameters.
- Which distribution would you use in this test? Justify the reasons for your choice.
- What would be the test statistic?
Online Maple TA Study Sessions
See Hypothesis Testing for TWO Populations
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
pooled‑variance Student’s t test, unpooled‑variance Student’s t test
Supplementary Resource Materials and Reference Materials
2.8: Data with categorical variables
(The role of the Chi‑squared tests)
Learning Outcomes
After completing this section, you should be able to
- list the properties of the Chi‑squared distribution and how its shape varies with the number of degrees of freedom of the test.
- define and evaluate the number of degrees of freedom of a Chi‑squared hypothesis test.
- perform Chi‑squared tests on categorical data.
Reading
Study Chapter 3.8 (pp. 97‑105) of the textbook.
Online Maple TA Study Sessions
See The Chi‑squared Tests: for categorical data
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
mean and standard deviation of a Chi‑squared distribution, the degrees of freedom of a Chi‑squared hypothesis test
Supplementary Resource Materials and Reference Materials
Written Exercise (Part II of Assignment 2)
For a given published research paper, you are asked to analyze the statistical procedures and their ensuing implications. (15%)
2.9: Other statistical tests
(the study of ways to use statistical analysis in research has just begun)
Learning Outcomes
After completing this section, you should be able to
- list at least two other statistical methods for analyzing quantitative data, stating the contexts in which each would be appropriate.
Reading
Study Chapter 3.9 (p. 105) of the textbook
Online Maple TA Study Sessions
See Review: Choosing the Right Statistical Test
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
Terms to Understand
simple regression analysis, multiple regression analysis, the analysis of variance (ANOVA) tests
Supplementary Resource Materials
Online Quiz (Part III of Assignment 2)
(to be submitted through the Maple TA assessment site)
https://mapleta10.athabascau.ca/mapleta/modules/ClassHomepage.do?cid=35
In each section of Unit 2, a review study session of the statistical ideas, terminology, and methods was provided for practice purposes. The online comprehensive quiz part of Assignment 2 is a dynamically generated selection from these review exercises, covering all sections in Unit 2. There will be a time limit of one and one‑half hours (1 hour 30 minutes) to complete the quiz. Only one (1) attempt is allowed. (8%)