Scatter diagrams, correlation coefficients, regression lines, interpolation and extrapolation.
Answer at least 3 of 3 correctly to complete this section.
When we collect data on two variables — such as hours of revision and exam scores, or temperature and ice cream sales — we want to know: is there a relationship, and can we use it to make predictions?
Correlation measures the strength and direction of a linear relationship. Regression gives us an equation to predict one variable from the other. Together, they are among the most widely-used tools in data analysis, from medical research to economics.
A scatter diagram (or scatter plot) displays bivariate data as points on a coordinate grid. The explanatory variable (independent) goes on the x-axis; the response variable (dependent) goes on the y-axis.
By looking at a scatter diagram, you can identify:
The PMCC r quantifies the strength and direction of a linear relationship:
r=SxxSyySxy
where:
Sxy=∑xy−n∑x∑y,Sxx=∑x2−n(∑x)2,Syy=∑y2−n(∑y)2
| Value of r | Interpretation |
|---|---|
| r=1 | Perfect positive linear correlation |
| 0.7<r<1 | Strong positive linear correlation |
| 0.3<r<0.7 | Moderate positive linear correlation |
| 0<r<0.3 | Weak positive linear correlation |
| r=0 | No linear correlation |
| −1<r<0 | Negative linear correlation (mirror of above) |
| r=−1 | Perfect negative linear correlation |
r measures linear correlation only. A set of data with a clear curved pattern can have r≈0 because r does not detect non-linear relationships.
For a data set with n=8: ∑x=40, ∑y=80, ∑x2=220, ∑xy=430, ∑y2=900. Find the PMCC.
Sxx=220−8402=220−200=20
Syy=900−8802=900−800=100
Sxy=430−840×80=430−400=30
r=20×10030=200030=44.7230=0.671
This indicates a moderate positive linear correlation.
Sense-check: r must always satisfy −1≤r≤1. If your answer falls outside this range, recheck your summary statistics.
Answer at least 3 of 3 correctly to complete this section.
The least squares regression line minimises the sum of the squared vertical distances between the data points and the line. Its equation is:
y=a+bx
where:
b=SxxSxyanda=yˉ−bxˉ
The gradient b tells us the average change in y for each unit increase in x. The line always passes through (xˉ,yˉ).
Using the corrected data from before (n=8, xˉ=5, yˉ=10, Sxx=20, Sxy=30), find the equation of the regression line of y on x.
b=SxxSxy=2030=1.5
a=yˉ−bxˉ=10−1.5×5=10−7.5=2.5
The regression line is y=2.5+1.5x.
Interpretation: For each additional unit increase in x, y increases by 1.5 on average.
The regression line relating temperature (x °C) and number of cold drinks sold (y) at a cafe is y=12+3.5x, based on data for temperatures between 10 °C and 30 °C.
(a) Predict the number of drinks sold when the temperature is 20 °C.
y=12+3.5×20=12+70=82
This is interpolation (20 °C is within the data range 10–30 °C), so the prediction is reliable.
(b) Predict the number of drinks sold when the temperature is 40 °C.
y=12+3.5×40=12+140=152
This is extrapolation (40 °C is outside the data range), so the prediction is unreliable. The linear relationship may not hold at extreme temperatures.
(c) Interpret the gradient.
For each additional 1 °C rise in temperature, the number of cold drinks sold increases by 3.5 on average.
Interpolation uses the model within the range of the observed data — this is generally reliable. Extrapolation uses the model beyond the data range — this is unreliable because there is no evidence the pattern continues.
Data for advertising spend (x in £1000s) and sales (y in £1000s) is coded using p=x−5 and q=y−20. Given that the regression line of q on p is q=2.4+1.8p, find the regression line of y on x.
Substitute back: p=x−5 and q=y−20.
y−20=2.4+1.8(x−5)
y−20=2.4+1.8x−9
y=2.4+1.8x−9+20=1.8x+13.4
The regression line of y on x is y=13.4+1.8x.
Coding does not change the gradient. The gradient of y on x is the same as q on p when the coding is linear. Only the intercept changes.
Answer at least 3 of 3 correctly to complete this section.
The table below shows data for six students’ hours of study (x) and test scores (y).
Summary statistics: n=6, ∑x=30, ∑y=420, ∑x2=190, ∑xy=2350, ∑y2=30400.
(a) Find the equation of the regression line of y on x.
(b) Interpret the gradient in context.
(c) Estimate the test score for a student who studies for 7 hours. Comment on the reliability of your estimate.
(a)
xˉ=630=5,yˉ=6420=70
Sxx=190−6302=190−150=40
Sxy=2350−630×420=2350−2100=250
b=40250=6.25,a=70−6.25×5=70−31.25=38.75
The regression line is y=38.75+6.25x.
(b) For each additional hour of study, the test score increases by 6.25 marks on average.
(c) y=38.75+6.25×7=38.75+43.75=82.5 marks.
If 7 hours is within the range of the data, this is interpolation and reasonably reliable. If 7 hours is outside the range, the estimate is less reliable (extrapolation).
| Concept | What to remember |
|---|---|
| PMCC r | Always between −1 and +1; measures linear correlation only |
| Gradient b | Interpret in context with units |
| Intercept a | Often has no practical meaning (e.g. “zero hours of sunshine”) |
| Interpolation | Within data range — reliable |
| Extrapolation | Outside data range — unreliable |
| Causation | Correlation = causation; look for lurking variables |
| Coded data | Gradient is unchanged; intercept changes |
| Regression direction | y on x predicts y; x on y predicts x |
Final exam tip: If asked to “comment on reliability”, mention whether the prediction is interpolation or extrapolation AND whether the PMCC is strong enough to justify using the model.
Answer at least 3 of 3 correctly to complete this section.
Lock in what you've learned with exam-style questions and spaced repetition.