What is the difference between correlation and pairwise correlation in Stata?
In Stata, corr
and pwcorr
differ in how they handle missing values.
First, are you interested in assessing the correlation between variables even if there is missing on one or both measures? Or do you want to perform complete case correlation analyses?
This Statalist post helped me to understand the difference between correlation (corr
) and pairwise correlation (pwcorr
) in Stata. In general, corr
accounts for missing, while pwcorr
does not.
Let’s take and expand the Statalist example using Stata.
clear //clears Stata memory
sysuse auto //uses the shipped auto data from Stata
desc //describes the data
keep rep78 price trunk //let's only focus on the three variables in the example by keeping them in memory
desc //lets check to be sure we kept these variables
codebook //lets see the data contents for each variable
misstable summarize //tabulates missing values
// rep78 contains 5 missing values
tab rep78, m //lets check again for missing
Here’s the output of our tabulation of rep78.
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.70 2.70
2 | 8 10.81 13.51
3 | 30 40.54 54.05
4 | 18 24.32 78.38
5 | 11 14.86 93.24
. | 5 6.76 100.00
------------+-----------------------------------
Total | 74 100.00
Indeed, we can confirm that only rep78 contains missing values.
Assessing correlations with corr
corr rep78 price trunk
. corr rep78 price trunk
(obs=69)
| rep78 price trunk
-------------+---------------------------
rep78 | 1.0000
price | 0.0066 1.0000
trunk | -0.1572 0.3232 1.0000
We can see that corr
only used 69 observations for all comparisons across included variables.
Assessing pairwise correlations with pwcorr
pwcorr rep78 price trunk, obs //I'm adding the obs option to see the number of observations for each entry
. pwcorr rep78 price trunk, obs
| rep78 price trunk
-------------+---------------------------
rep78 | 1.0000
| 69
|
price | 0.0066 1.0000
| 69 74
|
trunk | -0.1572 0.3143 1.0000
| 69 74 74
As you can see, pwcorr
uses all available responses for each variable and does not account for missing.
Finally, let’s follow the example from Statalist to reproduce the results from corr
with the pwcorr
command.
pwcorr rep78 price trunk if !missing(rep78, price, trunk) //excluding missing for any of the variables
| rep78 price trunk
-------------+---------------------------
rep78 | 1.0000
price | 0.0066 1.0000
trunk | -0.1572 0.3232 1.0000
Since I’m often interested in complete case analyses, using corr
is my best bet.
However, whether you use corr
or pwcorr
may depend on where you are in your analyses. For example, If you have already narrowed down your dataset to contain your sample and addressed missing values, then corr
and pwcorr
will operate similarly.
You can browse additional options for corr
or pwcorr
here.
You may be interested in these posts…
- 10+ Reasons Reflect Notes Should Be Your Primary Note-Taking App
- A Virtual Study Space for Productive Work
- 5+ reasons to use Sunsama to organize your day
- Growth and fixed mindsets and working toward long-term goals
- Why completing a PhD program is more about grit than anything else
- Focus on the Process not the Outcome
- Short definitions for popular epidemiologic study designs
- What’s the difference between case-control studies and cohort studies ?
- What is an experimental study?
- What is a case-control study?