What is the difference between correlation and pairwise correlation in Stata?

In Stata, corr and pwcorr differ in how they handle missing values.

First, are you interested in assessing the correlation between variables even if there is missing on one or both measures? Or do you want to perform complete case correlation analyses?

This Statalist post helped me to understand the difference between correlation (corr) and pairwise correlation (pwcorr) in Stata. In general, corr accounts for missing, while pwcorr does not.

Let’s take and expand the Statalist example using Stata.

clear //clears Stata memory
sysuse auto //uses the shipped auto data from Stata
desc //describes the data

keep  rep78 price trunk //let's only focus on the three variables in the example by keeping them in memory
desc //lets check to be sure we kept these variables

codebook //lets see the data contents for each variable
misstable summarize //tabulates missing values
// rep78 contains 5 missing values

tab rep78, m //lets check again for missing

Here’s the output of our tabulation of rep78.

     Repair |
Record 1978 |      Freq.     Percent        Cum.
          1 |          2        2.70        2.70
          2 |          8       10.81       13.51
          3 |         30       40.54       54.05
          4 |         18       24.32       78.38
          5 |         11       14.86       93.24
          . |          5        6.76      100.00
      Total |         74      100.00

Indeed, we can confirm that only rep78 contains missing values.

Assessing correlations with corr

corr rep78 price trunk

. corr rep78 price trunk

             |    rep78    price    trunk
       rep78 |   1.0000
       price |   0.0066   1.0000
       trunk |  -0.1572   0.3232   1.0000

We can see that corr only used 69 observations for all comparisons across included variables.

Assessing pairwise correlations with pwcorr

pwcorr rep78 price trunk, obs //I'm adding the obs option to see the number of observations for each entry

. pwcorr rep78 price trunk, obs

             |    rep78    price    trunk
       rep78 |   1.0000
             |       69
       price |   0.0066   1.0000
             |       69       74
       trunk |  -0.1572   0.3143   1.0000
             |       69       74       74

As you can see, pwcorr uses all available responses for each variable and does not account for missing.

Finally, let’s follow the example from Statalist to reproduce the results from corr with the pwcorr command.

pwcorr rep78 price trunk if !missing(rep78, price, trunk) //excluding missing for any of the variables

      |    rep78    price    trunk
rep78 |   1.0000 
price |   0.0066   1.0000 
trunk |  -0.1572   0.3232   1.0000 

Since I’m often interested in complete case analyses, using corr is my best bet.

However, whether you use corr or pwcorr may depend on where you are in your analyses. For example, If you have already narrowed down your dataset to contain your sample and addressed missing values, then corr and pwcorr will operate similarly.

You can browse additional options for corr or pwcorr here.

You may be interested in these posts…

You may support me with a generous cup of coffee.

Plan your day with Sunsama! Get started with a 30-day trial.

Turn ideas into action with Notion’s AI-powered workspace.