PK iAhV V 1 30._Correlation/01-Introducing_Correlation.en.srt1
00:00:00,000 --> 00:00:06,000
In this unit, I'll teach you about a term called correlation.
2
00:00:06,000 --> 00:00:12,000
It's an important statistical term and in the end of this unit, you'll be able to use it yourself.
3
00:00:12,000 --> 00:00:14,000
Here's the fundamental problem.
4
00:00:14,000 --> 00:00:18,000
Sometimes, there are lines done very nicely like these points over here.
5
00:00:18,000 --> 00:00:22,000
Other times, two variables seem to be utterly unrelated.
6
00:00:22,000 --> 00:00:26,000
Correlation is a measure that lies within -1 all the way to 1.
7
00:00:26,000 --> 00:00:29,000
That tells us how far the data is described by a line.
8
00:00:29,000 --> 00:00:32,000
In both cases, you can fit a line but in one case
9
00:00:32,000 --> 00:00:36,000
it would be a really good description of the data whereas in the others don't.
10
00:00:36,000 --> 00:00:42,000
So the correlation coefficient of what we call r is 1 if the data is perfectly aligned for the line.
11
00:00:42,000 --> 00:00:48,000
It's 0 if there seems to be no relation between the two different axes and the data, and it can also be -1.
12
00:00:48,000 --> 00:00:53,000
In the case where the data in fact is still perfectly aligned but there's a negative relationship
13
00:00:53,000 --> 00:00:55,000
between one variable and the other variable.
14
00:00:55,000 --> 00:00:57,000
Let me see if you got this.
15
00:00:57,000 --> 00:01:03,000
Here are three data sets--in one, we have a strongly positive correlation; in another,
16
00:01:03,000 --> 00:01:08,000
it's above zero, and in the third one, it will be a negative correlation.
17
00:01:08,000 --> 00:01:11,000
Which of these cases best describe those conditions
18
00:01:11,000 --> 99:59:59,000
and use each condition exactly once here.
PK iAYO : 30._Correlation/02-Introducing_Correlation_Solution.en.srt1
00:00:00,000 --> 00:00:05,000
And the answers goes like this--the reason being that here we have a clear line.
2
00:00:05,000 --> 00:00:11,000
In fact, r will probably be one and it's positive. Both variables grow at the same time.
3
00:00:11,000 --> 00:00:15,000
With this U shape, there might be a dependence but it's not linear.
4
00:00:15,000 --> 00:00:18,000
We fit the best line, it's going to be flat, and in this flat line
5
00:00:18,000 --> 00:00:21,000
there is really no dependence between the x and the y variable.
6
00:00:21,000 --> 00:00:26,000
So definitely knowing x does not tell you about y and vice versa that tends to be r=0
7
00:00:26,000 --> 00:00:29,000
and in this last case, there is a negative line.
8
00:00:29,000 --> 00:00:35,000
The best line fit will go something like this and r might be as small as -0.2, but it is negative,
9
00:00:35,000 --> 99:59:59,000
the reason being that the line down here goes down negatively.
PK iAAg 5 30._Correlation/03-Correlation_From_Regression.en.srt1
00:00:00,000 --> 00:00:08,000
Let me ask a different quiz--suppose you run linear regression, and you found that b=4 and a=-3.
2
00:00:08,000 --> 00:00:12,000
To remind you, this describes the following linear relationship between x and y.
3
00:00:12,000 --> 00:00:15,000
My question is about the correlation coefficient r.
4
00:00:15,000 --> 99:59:59,000
Will r be positive, negative, zero, or can't we tell as in can't tell?Check exactly one box.
PK iA\ > 30._Correlation/04-Correlation_From_Regression_Solution.en.srt1
00:00:00,000 --> 00:00:03,000
I should say this is not an easy question. Any answer is positive.
2
00:00:03,000 --> 00:00:08,000
If you look at whatever the data was, this is roughly how this line looks like,
3
00:00:08,000 --> 00:00:11,000
and that means in the data, there is a tendency as you increase x
4
00:00:11,000 --> 00:00:15,000
there would be an increase of y, so that means the correlation will be positive.
5
00:00:15,000 --> 00:00:21,000
We can't tell what value it is. If the data fits exactly onto the line, then it would be one.
6
00:00:21,000 --> 00:00:25,000
It could also be that the data has enormous deviation from this line,
7
00:00:25,000 --> 00:00:27,000
and this is the best fitting line, in which case
8
00:00:27,000 --> 00:00:32,000
the correlation coefficient will be still larger than zero but it might be much closer to zero,
9
00:00:32,000 --> 99:59:59,000
depending on the amount of deviation of those points from the line.
PK iA, , - 30._Correlation/05-Correlation_Formula.en.srt1
00:00:00,000 --> 00:00:04,000
To summarize the correlation coefficient, which you're just about to learn about,
2
00:00:04,000 --> 00:00:12,000
is a value between -1 and 1, tells us how related or correlated two variables are
3
00:00:12,000 --> 00:00:17,000
and both 1 and -1 stand for perfectly linear data.
4
00:00:17,000 --> 00:00:23,000
In the case of +1, we know that this line increases in x and y simultaneously
5
00:00:23,000 --> 00:00:26,000
whereas for -1, we have the inverse effect.
6
00:00:26,000 --> 00:00:29,000
Let's compute r--my favorite way to compute it is very similar
7
00:00:29,000 --> 00:00:32,000
to the way we computed b in linear regression.
8
00:00:32,000 --> 00:00:38,000
It looks like the sum of all data points and takes the product of (xi-x-bar)
9
00:00:38,000 --> 00:00:46,000
and multiplies for each data point this with (yi - y-bar) and then, we have to normalize.
10
00:00:46,000 --> 00:00:51,000
This could be any value. It isn't between ±1.
11
00:00:51,000 --> 00:01:03,000
We normalize by a √(x-x-bar)² sum of all i's times the same expression for y.
12
00:01:03,000 --> 00:01:06,000
Now this looks a little bit wild, and this is probably the worst formula
13
00:01:06,000 --> 00:01:10,000
you've encountered in this class in terms of complexity but once you dived in,
14
00:01:10,000 --> 00:01:15,000
you'll realize this is really related to a lot of stuff you've seen before such as variances and similar.
15
00:01:15,000 --> 00:01:20,000
This is the quintessential term that occurs in the variance calculation of x.
16
00:01:20,000 --> 00:01:23,000
All that's missing is the normalizer.
17
00:01:23,000 --> 00:01:27,000
Same over here, this is the variance of y--this thing is the normalizer.
18
00:01:27,000 --> 00:01:31,000
We take the product of the variance of x and the variance of y modally with the normalizers
19
00:01:31,000 --> 00:01:34,000
and we get something quadratic even in variance space.
20
00:01:34,000 --> 00:01:40,000
This over here is kind of a mixed variance. This is often called the covariance.
21
00:01:40,000 --> 00:01:43,000
But notice that there is also a normalizer in this thing over here.
22
00:01:43,000 --> 00:01:47,000
In fact, the missing normalizers on top and bottom of this bar
23
00:01:47,000 --> 00:01:49,000
can't see each other out; hence, I just omitted them.
24
00:01:49,000 --> 00:01:57,000
But this one is just like the variance calculation but it mixes x's and y's whereas these are x² and y².
25
00:01:57,000 --> 00:02:00,000
This is called often covariance if you've normalized, because
26
00:02:00,000 --> 00:02:05,000
it is the variance calculation of a two co-occuring variables.
27
00:02:05,000 --> 00:02:07,000
These are the variances.
28
00:02:07,000 --> 00:02:11,000
So what this really tells you is kind of the ratio how much these two things co-evolve,
29
00:02:11,000 --> 00:02:17,000
how much the errors correspond versus normalized by the multitudes of errors individually
30
00:02:17,000 --> 00:02:20,000
and whether the ratio becomes 1, we have a perfect correlation.
31
00:02:20,000 --> 00:02:27,000
When the ratio becomes 0, then the numerator is 0 which means our errors cancel each other out.
32
00:02:27,000 --> 00:02:30,000
That is very different for x and for y under any linear model.
33
00:02:30,000 --> 99:59:59,000
So this complicated formula is what's called the correlation coefficient r. So let's try this out.
PK iAY{g / 30._Correlation/06-Compute_Correlation_1.en.srt1
00:00:00,000 --> 00:00:06,000
Let me give you a data set. x=3, 4, 5 and for those x's we get 7, 8, and 9.
2
00:00:06,000 --> 00:00:11,000
The first data item would be 3, 7. Second 4, 8. Third 5, 9.
3
00:00:11,000 --> 00:00:15,000
It's easy to see that the mean x-bar is 4, mean y-bar is 8
4
00:00:15,000 --> 00:00:25,000
and this gives us x - x-bar and y - y-bar, the new numbers -1, 0, 1 and -1, 0, 1 here again.
5
00:00:25,000 --> 00:00:29,000
So these are three mean and normalized data points.
6
00:00:29,000 --> 99:59:59,000
Let's now compute these three values over here for this example. Give me the first one.
PK iA+1 8 30._Correlation/07-Compute_Correlation_1_Solution.en.srt1
00:00:00,000 --> 00:00:05,000
The answer is 2--if we multiply the data point the first expression and the second expression
2
00:00:05,000 --> 99:59:59,000
get -1-1 which is 1, 0, and 1 again that adds up to 2.
PK iA~e e / 30._Correlation/08-Compute_Correlation_2.en.srt1
00:00:00,000 --> 00:00:04,000
Please calculate the expression over here and put it in this box.
PK iA! 8 30._Correlation/09-Compute_Correlation_2_Solution.en.srt1
00:00:00,000 --> 00:00:04,000
And the answer is 2 again. -1² +1² =2.
2
00:00:04,000 --> 99:59:59,000
And the third one will give you the same exact 2 as before. I just do this for you.
PK iA|E%\ \ / 30._Correlation/10-Compute_Correlation_3.en.srt1
00:00:00,000 --> 00:00:03,000
And we now work it out. What do you think is the answer?
PK !iA3q23 8 30._Correlation/11-Compute_Correlation_3_Solution.en.srt1
00:00:00,000 --> 00:00:07,000
And yes the answer is 1. 22 is 4. The square root of this is 2 again. 2/2 gives us 1.
PK !iAmw+ ! 30._Correlation/12-Guess_R.en.srt1
00:00:00,000 --> 00:00:06,000
Let's now work with a different data set. We write 2, 5, 8 for y, which gives us the mean for y.
2
00:00:06,000 --> 00:00:10,000
Before doing any calculations, let's take a guess what r might be.
3
00:00:10,000 --> 00:00:14,000
Is r is going to be 1, r going to be 3, r going to be 2, or is r going to be 0.
4
00:00:14,000 --> 99:59:59,000
Check one of these boxes over here.
PK "iA2% * 30._Correlation/13-Guess_R_Solution.en.srt1
00:00:00,000 --> 00:00:06,000
One is actually correct, and by virtue of what I told you before, you can figure it out that r has to be 1.
2
00:00:06,000 --> 00:00:11,000
The reason is we know that r is between -1 and +1 so it can't be 3 or 2.
3
00:00:11,000 --> 00:00:15,000
And 0 is kind of this pessimistic case where there's no relationship whatsoever
4
00:00:15,000 --> 00:00:18,000
between the data but clearly this data fits in the line.
5
00:00:18,000 --> 00:00:22,000
In fact, when it fits on the line no matter what the steepness of the line is if it isn't flat,
6
00:00:22,000 --> 00:00:26,000
if there's a positive relationship no matter how small or how large,
7
00:00:26,000 --> 99:59:59,000
it's going to be 1--so this is the correct answer.
PK #iAB * 30._Correlation/14-Compute_Actual_1.en.srt1
00:00:00,000 --> 00:00:03,000
Let's see if you can find this. I filled out this table for you.
2
00:00:03,000 --> 99:59:59,000
I know you can use this table to calculate the numerator of the fraction over here.
PK #iAu 3 30._Correlation/15-Compute_Actual_1_Solution.en.srt1
00:00:00,000 --> 00:00:08,000
And the answer is 6. -1-3 is 3. Add to it 0. Add to it another 3. It ends up to be 6.
PK $iA7n? ? * 30._Correlation/16-Compute_Actual_2.en.srt1
00:00:00,000 --> 00:00:02,000
What's the value over here?
PK %iAfwRr r 3 30._Correlation/17-Compute_Actual_2_Solution.en.srt1
00:00:00,000 --> 00:00:06,000
Clearly, it's 2. -1² is 1, 0, another 1. Add those together and you have 2.
PK &iA_F F * 30._Correlation/18-Compute_Actual_3.en.srt1
00:00:00,000 --> 00:00:02,000
And how about the value over here?
PK 'iAIݜy y 3 30._Correlation/19-Compute_Actual_3_Solution.en.srt1
00:00:00,000 --> 00:00:07,000
It's 18. The reason being -3² is 9. Add to it 0. Add to it another 9. We get 18.
PK 'iA*@ @ * 30._Correlation/20-Compute_Actual_4.en.srt1
00:00:00,000 --> 00:00:02,000
So what do we get down here?
PK (iA-% 3 30._Correlation/21-Compute_Actual_4_Solution.en.srt1
00:00:00,000 --> 00:00:06,000
Well, 18 2 makes 36. Square root of this is 6. 6/6 is 1.
2
00:00:06,000 --> 00:00:09,000
We've just proven to ourselves that 1 is the correlation
3
00:00:09,000 --> 99:59:59,000
coefficient even for this data set over here.
PK )iAN + 30._Correlation/22-Another_Example_1.en.srt1
00:00:00,000 --> 00:00:05,000
Now, how about we switch the order of the y from 3, 5, 8 to 8, 5, 3?
2
00:00:05,000 --> 00:00:08,000
The mean stays the same, but several of these values over here change.
3
00:00:08,000 --> 00:00:11,000
And before I compute it, let me again test your intuition.
4
00:00:11,000 --> 99:59:59,000
Is r larger than 0? r = 0? Or r smaller than 0? Pick one.
PK )iA#/ 4 30._Correlation/23-Another_Example_1_Solution.en.srt1
00:00:00,000 --> 00:00:03,000
And the answer is smaller than 0. There's a negative correlation.
2
00:00:03,000 --> 00:00:05,000
When x goes up, y goes down.
3
00:00:05,000 --> 00:00:10,000
I think if you draw out the data, you get something like this for x and y.
4
00:00:10,000 --> 99:59:59,000
And that data perfectly fits in line which makes me believe r = -1 is the perfect correlation down.
PK *iAuwjW W + 30._Correlation/24-Another_Example_2.en.srt1
00:00:00,000 --> 00:00:03,000
Let me just fill in the table over here. 3, 0, -3.
2
00:00:03,000 --> 00:00:07,000
In our equation, this term doesn't change because it only depends on x
3
00:00:07,000 --> 00:00:09,000
and I haven't changed x at all.
4
00:00:09,000 --> 99:59:59,000
But let's compute the numerator over here.
PK +iAqu u 4 30._Correlation/25-Another_Example_2_Solution.en.srt1
00:00:00,000 --> 00:00:09,000
The answer is -6. -13 is -3 add to it 0. Add to it another -3. You get -6.
PK ,iA7n? ? + 30._Correlation/26-Another_Example_3.en.srt1
00:00:00,000 --> 00:00:02,000
What's the value over here?
PK ,iAu u 4 30._Correlation/27-Another_Example_3_Solution.en.srt1
00:00:00,000 --> 00:00:06,000
And just like before, it's 18. We view all the numbers, but the sum is still 9+9.
PK -iA J J + 30._Correlation/28-Another_Example_4.en.srt1
00:00:00,000 --> 00:00:02,000
So what do...what do we get over here?
PK .iAff= 4 30._Correlation/29-Another_Example_4_Solution.en.srt1
00:00:00,000 --> 00:00:07,000
It's -1. Again, 218 gives us 36. Square root is 6. -6/6 gives us -1.
2
00:00:07,000 --> 99:59:59,000
This is the correct correlation.
PK /iAqAY % 30._Correlation/30-R_Intuition.en.srt1
00:00:00,000 --> 00:00:07,000
Now, let's do something tricky. Let's use 8, 5, 8 for y, which gives us a mean of 7 for y.
2
00:00:07,000 --> 00:00:11,000
And the following table down here 1, -2, 1. Let me test your intuition.
3
00:00:11,000 --> 00:00:16,000
Is r larger than 0? r = 0? Or r smaller than 0?
4
00:00:16,000 --> 99:59:59,000
Which means positive correlation, no correlation, negative correlation. Pick one.
PK /iAُ&I I . 30._Correlation/31-R_Intuition_Solution.en.srt1
00:00:00,000 --> 00:00:03,000
And you could do the intuition thing which is you can look at this
2
00:00:03,000 --> 00:00:07,000
and arrive at the point that there's no correlation and this is correct.
3
00:00:07,000 --> 00:00:12,000
It's a little bit tricky, but you can see x go up from 3 to 4 and y shrinks.
4
00:00:12,000 --> 00:00:17,000
Then it goes from 4 to 5--the opposite happens to y, and it increases by the same amount.
5
00:00:17,000 --> 00:00:22,000
That means in our data set, it will look as follows--up, down and up, and then they happens
6
00:00:22,000 --> 00:00:27,000
we already saw the best fit is the horizontal line, and the horizontal line just means
7
00:00:27,000 --> 00:00:31,000
x and y are completely independent for this best linear fit.
8
00:00:31,000 --> 00:00:36,000
Knowing x does nothing about y and knowing y does nothing about x,
9
00:00:36,000 --> 00:00:40,000
and this kind of independence leads to a coefficient of 0.
10
00:00:40,000 --> 00:00:47,000
Let's check it. What's the field over here? It's 0. It's 0 because -1+ 0+1=0,
11
00:00:47,000 --> 00:00:53,000
That's interesting even though this field over here ends up to be 6, 1²+2²+1².
12
00:00:53,000 --> 99:59:59,000
Zero of anything gives us a 0 at the end. So let's look at one final case.
PK 0iA8 ) 30._Correlation/32-Final_Example_1.en.srt1
00:00:00,000 --> 00:00:07,000
This is our data. Now y goes from 8 to 3 up to 7. It look something like this--8, 3 back to 7.
2
00:00:07,000 --> 99:59:59,000
Clearly, it doesn't look very correlated. Check your intuition. Is r large than 0, equals 0 or smaller than 0.
PK 0iA˛ 2 30._Correlation/33-Final_Example_1_Solution.en.srt1
00:00:00,000 --> 00:00:03,000
And the answer ends up being smaller than 0.
2
00:00:03,000 --> 00:00:08,000
You can see from the data points if the blue one guy over here was up here,
3
00:00:08,000 --> 00:00:12,000
then the line horizontal line would be the best fit, but the blue is a little bit lower
4
00:00:12,000 --> 99:59:59,000
so it's slightly downward tilted line, and they end up being better like this one over here.
PK 1iAqw# # ) 30._Correlation/34-Final_Example_2.en.srt1
00:00:00,000 --> 00:00:04,000
So let's look at this and compute the mean y value, which is 6.
2
00:00:04,000 --> 00:00:08,000
I fill in this table for you 2, -3, 1 which is this row over here
3
00:00:08,000 --> 99:59:59,000
minus 6, and now give me the first value over here.
PK 2iA&G7 7 2 30._Correlation/35-Final_Example_2_Solution.en.srt1
00:00:00,000 --> 00:00:06,000
It's -1. -2+0+1=-1.
PK 2iA@)A A ) 30._Correlation/36-Final_Example_3.en.srt1
00:00:00,000 --> 00:00:03,000
Give me this value over here.
PK 3iA+_ _ 2 30._Correlation/37-Final_Example_3_Solution.en.srt1
00:00:00,000 --> 00:00:06,000
Now, I think it's 14--2² is 4 plus 9 is 13 plus 1 is 14.
PK 4iA5V V ) 30._Correlation/38-Final_Example_4.en.srt1
00:00:00,000 --> 00:00:03,000
Let's go to the final value over here. What is it?
PK 4iAd 2 30._Correlation/39-Final_Example_4_Solution.en.srt1
00:00:00,000 --> 00:00:05,000
And now we need a calculator--you have a square root of 28 and you divide -1
2
00:00:05,000 --> 00:00:11,000
by that square root and gets approximately -0.189 when I work this out.
3
00:00:11,000 --> 00:00:17,000
Today's negative correlation but it's weak. The data really in't well-described by a linear function.
4
00:00:17,000 --> 99:59:59,000
If the data instead were to lie exactly on this line then r will be -1 with the negative correlation.
PK 5iAJ+ ! 30._Correlation/40-Summary.en.srt1
00:00:00,000 --> 00:00:04,000
In summary, you really learned about correlation coefficients.
2
00:00:04,000 --> 00:00:06,000
It's larger than 0.
3
00:00:06,000 --> 00:00:09,000
If there's a positive relationship between x and y.
4
00:00:09,000 --> 00:00:12,000
It's smaller than 0 if the relationship is negative.
5
00:00:12,000 --> 00:00:15,000
It's equal to 0 if there is no relationship.
6
00:00:15,000 --> 00:00:22,000
The magnitude of r goes to 1 as the relationship becomes increasingly linear
7
00:00:22,000 --> 00:00:25,000
without any noise or any deviation from the line.
8
00:00:25,000 --> 00:00:27,000
This is a powerful measure.
9
00:00:27,000 --> 00:00:30,000
For any data set with multiple variables,
10
00:00:30,000 --> 00:00:34,000
you can now tell how much variables relate to each other.
11
00:00:34,000 --> 00:00:38,000
If someone, for example, shows you the salary of a person and the age of a person,
12
00:00:38,000 --> 00:00:43,000
you can say they're really correlated, or you could say they're not correlated at all.
13
00:00:43,000 --> 00:00:47,000
Whatever you say, with this--what I believe to be a very simple formula,
14
00:00:47,000 --> 00:00:52,000
the formula right over here--you can now compute for any data set how much x and y relate.
15
00:00:52,000 --> 00:00:56,000
That's called the correlation, and it's really an important lesson in statistics.
16
00:00:56,000 --> 00:01:00,000
I use it all the time to inspect data to make a statement
17
00:01:00,000 --> 99:59:59,000
how much two variables relate to each other in a linear way. Thank you.
PK iAhV V 1 30._Correlation/01-Introducing_Correlation.en.srtPK iAYO : 30._Correlation/02-Introducing_Correlation_Solution.en.srtPK iAAg 5 30._Correlation/03-Correlation_From_Regression.en.srtPK iA\ >
30._Correlation/04-Correlation_From_Regression_Solution.en.srtPK iA, , - P 30._Correlation/05-Correlation_Formula.en.srtPK iAY{g / 30._Correlation/06-Compute_Correlation_1.en.srtPK iA+1 8 # 30._Correlation/07-Compute_Correlation_1_Solution.en.srtPK iA~e e / $ 30._Correlation/08-Compute_Correlation_2.en.srtPK iA! 8 o% 30._Correlation/09-Compute_Correlation_2_Solution.en.srtPK iA|E%\ \ / &