1.1 Analyzing Categorical Data
b. Frequency Tables
Download the .rmd file, which you can run yourself in your installation of R, here.
In this tutorial, we’re going to review how to make frequency tables in R. We’ll make two kinds of frequency tables – a one-way table (which we already looked at in the previous tutorial) and a two-way table, which will be much more useful.
For this tutorial, we’ll create a dataset of Maryland colleges & universities. We’ll put in a little more information than we need – maybe we can reuse this dataset later!
#importing the data
maryland.college.df = data.frame(
name = c("Capitol College", "Notre Dame of Maryland", "Goucher College", "Hood College", "Johns Hopkins University", "Loyola University Maryland", "Maryland Institute College of Art", "McDaniel College", "Morgan State University", "Mount Saint Mary's University", "National Labor College", "Saint John's College-Annapolis", "Saint Mary's College of Maryland", "Saint Mary's Seminary & University", "Stevenson University", "United States Naval Academy", "Bowie State University", "Coppin State University", "Frostburg State University", "Salisbury University", "Towson University", "University of Baltimore", "University of Maryland-Baltimore", "University of Maryland-Baltimore County", "University of Maryland-College Park", "University of Maryland-Eastern Shore", "University of Maryland-University College", "Washington Adventist University", "Washington College"),
type = c("Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Private", "Federal", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Private", "Private"),
size.undergrad = c(441, 1230, 1449, 1387, 6251, 4004, 1778, 1692, 6252, 1741, 1364, 443, 1819, 294, 3847, 4576, 4368, 4000, 4704, 8004, 18779, 3526, 746, 11136, 26658, 3531, 26740, 1011, 1483),
tuition.instate = c(22176, 33100, 40558, 34120, 47060, 44255, 42390, 38350, 7378, 36021, 9528, 47826, 14874, NA, 28980, NA, 6971, 5076, 7982, 8560, 8342, 7838, 9680, 10068, 9427, 7287, 6552, 21395, 42592),
tuition.outstate = c(22176, 33100, 40558, 34120, 47060, 44255, 42390, 38350, 16862, 36021, 9528, 47826, 28674, NA, 28980, NA, 17538, 10010, 19274, 16906, 20020, 16288, 31210, 21642, 29720, 16331, 12336, 21395, 42592)
)
This is a significantly more complicated dataframe than what we used last time! It also took a long time to type out to set up. Don’t worry – in the future, we’ll be importing datasets from files rather than typing it all out, but I wanted you to see that you always have the option to type the numbers in manually.
Let’s have a look at this dataframe:
maryland.college.df
## name type size.undergrad
## 1 Capitol College Private 441
## 2 Notre Dame of Maryland Private 1230
## 3 Goucher College Private 1449
## 4 Hood College Private 1387
## 5 Johns Hopkins University Private 6251
## 6 Loyola University Maryland Private 4004
## 7 Maryland Institute College of Art Private 1778
## 8 McDaniel College Private 1692
## 9 Morgan State University Private 6252
## 10 Mount Saint Mary's University Private 1741
## 11 National Labor College Private 1364
## 12 Saint John's College-Annapolis Private 443
## 13 Saint Mary's College of Maryland Private 1819
## 14 Saint Mary's Seminary & University Private 294
## 15 Stevenson University Private 3847
## 16 United States Naval Academy Federal 4576
## 17 Bowie State University Public 4368
## 18 Coppin State University Public 4000
## 19 Frostburg State University Public 4704
## 20 Salisbury University Public 8004
## 21 Towson University Public 18779
## 22 University of Baltimore Public 3526
## 23 University of Maryland-Baltimore Public 746
## 24 University of Maryland-Baltimore County Public 11136
## 25 University of Maryland-College Park Public 26658
## 26 University of Maryland-Eastern Shore Public 3531
## 27 University of Maryland-University College Public 26740
## 28 Washington Adventist University Private 1011
## 29 Washington College Private 1483
## tuition.instate tuition.outstate
## 1 22176 22176
## 2 33100 33100
## 3 40558 40558
## 4 34120 34120
## 5 47060 47060
## 6 44255 44255
## 7 42390 42390
## 8 38350 38350
## 9 7378 16862
## 10 36021 36021
## 11 9528 9528
## 12 47826 47826
## 13 14874 28674
## 14 NA NA
## 15 28980 28980
## 16 NA NA
## 17 6971 17538
## 18 5076 10010
## 19 7982 19274
## 20 8560 16906
## 21 8342 20020
## 22 7838 16288
## 23 9680 31210
## 24 10068 21642
## 25 9427 29720
## 26 7287 16331
## 27 6552 12336
## 28 21395 21395
## 29 42592 42592
The first thing that we can do is look at the distribution of type of college using a one-way frequency table. Many states might have two categories here – Maryland has 3, because the Naval Academy is a federal college rather than public.
table(maryland.college.df$type)
##
## Federal Private Public
## 1 17 11
This is a pretty good starting table as it presents the frequency of
each level of the categorical variable type
. However, it doesn’t give
us some of the things we’re used to seeing in tables from AP Statistics:
for example, it doesn’t give us a total number of observations. We could
calculate this ourselves without too much trouble, but it’s also
possible to show this (and a heap of other information) automatically
using an R package called epiDisplay
.
_R note: R can use functions from downloadable code packages. When you want to use a function from a package, you have to first install it (if you haven’t already done so) and then load its library when you want to use it.
The code below will install the epiDisplay
package, but it is
currently “commented out” because it has the #
symbol in front of it.
Just remove the #
and run the chunk to install the epiDisplay
package!
#install.packages('epiDisplay')
You should only need to run the above code once. The chunk below will
use a function within epiDisplay
called tab1
that will create a much
more detailed table:
library(epiDisplay)
## Warning: package 'epiDisplay' was built under R version 4.1.3
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## Loading required package: nnet
tab1(maryland.college.df$type, graph=FALSE, cum.percent = FALSE)
## maryland.college.df$type :
## Frequency Percent
## Federal 1 3.4
## Private 17 58.6
## Public 11 37.9
## Total 29 100.0
Neat! Now we have a table that shows us a total for our type
variable
and also gives us percentages for each category.
You’ll notice that I included the maryland.college.df$type
variable as
an input argument to tab1()
, but also included two other arguments:
graph=FALSE
and cum.percent=FALSE
.
graph=FALSE
prevents tab1()
from printing out a bar chart along with
its table. You can add a bar chart simply by changing this argument to
graph=TRUE
.
Similarly, we can add cumulative percentages across categories by
changing cum.percent=TRUE
. This isn’t super meaningful in this case,
but let’s look at what it adds:
library(epiDisplay)
tab1(maryland.college.df$type, graph=FALSE, cum.percent = TRUE)
## maryland.college.df$type :
## Frequency Percent Cum. percent
## Federal 1 3.4 3.4
## Private 17 58.6 62.1
## Public 11 37.9 100.0
## Total 29 100.0 100.0
As you can see, cum.percent=TRUE
adds the cumulative percentage of all
observations from all the type
categories listed thus far.
We can also modify the table to be in decreasing order by frequency by
adding an argument, sort.group = "decreasing"
.
library(epiDisplay)
tab1(maryland.college.df$type, graph=FALSE, cum.percent = TRUE, sort.group="decreasing")
## maryland.college.df$type :
## Frequency Percent Cum. percent
## Private 17 58.6 58.6
## Public 11 37.9 96.6
## Federal 1 3.4 100.0
## Total 29 100.0 100.0
This simply reorders the table so that Private
, the largest category
within type
, is first, followed by Public
, the second-largest.
Next, we’re going to look at creating a two-way frequency table.
Before we do this, though, we’re going to create a new variable –
size.category
– which equals "Large"
if the school has more than
10,000 undergraduates and "Small"
if the school has fewer than 10,000
undergraduates.
This is a good chance to see how you can add a variable to a dataframe:
#if size.undergrad is less than 10,000, set size.category = "Small"
maryland.college.df$size.category[maryland.college.df$size.undergrad < 10000] = "Small"
#if size.undergrad is more than 10,000, set size.category = "Large"
maryland.college.df$size.category[maryland.college.df$size.undergrad > 10000] = "Large"
#print a table summarizing size.category
table(maryland.college.df$size.category)
##
## Large Small
## 4 25
The []
symbols are typically used to select some subset of a variable
or dataframe. Here, we used them to select rows within a dataframe based
on a value of a variable in that row. (If this is too much code for you,
don’t worry – you don’t have to do a whole lot of this in R, for the
most part!)
Okay, now we’re ready to make a two-way table. Let’s start simple, first!
#basic two-way table using the table() function
table(maryland.college.df$type, maryland.college.df$size.category)
##
## Large Small
## Federal 0 1
## Private 0 17
## Public 4 7
Sadly, tab1()
is only for one-way tables, so it can’t help us here.
However, we can install a new package to help us out! Remember: just
remove the #
from in front of the following line of code and run the
chunk once to install the gmodels
package.
#install.packages('gmodels')
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.1.3
##
## Attaching package: 'gmodels'
## The following object is masked from 'package:epiDisplay':
##
## ci
CrossTable(maryland.college.df$type, maryland.college.df$size.category,
expected=FALSE,
prop.chisq=FALSE,
prop.t=FALSE,
prop.r=TRUE,
prop.c=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 29
##
##
## | maryland.college.df$size.category
## maryland.college.df$type | Large | Small | Row Total |
## -------------------------|-----------|-----------|-----------|
## Federal | 0 | 1 | 1 |
## | 0.000 | 1.000 | 0.034 |
## -------------------------|-----------|-----------|-----------|
## Private | 0 | 17 | 17 |
## | 0.000 | 1.000 | 0.586 |
## -------------------------|-----------|-----------|-----------|
## Public | 4 | 7 | 11 |
## | 0.364 | 0.636 | 0.379 |
## -------------------------|-----------|-----------|-----------|
## Column Total | 4 | 25 | 29 |
## -------------------------|-----------|-----------|-----------|
##
##
With the current settings, the CrossTable()
function takes the two
categorical variables and shows us the relative frequencies by row. So,
we can see that 100% of Federal
colleges are Small
and that 63.6% of
Public
colleges are Small
.
We can add other relative frequencies to the CrossTable()
function by
switching other arguments to TRUE
. Let’s add all of the relative
frequencies now!
library(gmodels)
CrossTable(maryland.college.df$type, maryland.college.df$size.category,
expected=FALSE,
prop.chisq=FALSE,
prop.t=TRUE,
prop.r=TRUE,
prop.c=TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 29
##
##
## | maryland.college.df$size.category
## maryland.college.df$type | Large | Small | Row Total |
## -------------------------|-----------|-----------|-----------|
## Federal | 0 | 1 | 1 |
## | 0.000 | 1.000 | 0.034 |
## | 0.000 | 0.040 | |
## | 0.000 | 0.034 | |
## -------------------------|-----------|-----------|-----------|
## Private | 0 | 17 | 17 |
## | 0.000 | 1.000 | 0.586 |
## | 0.000 | 0.680 | |
## | 0.000 | 0.586 | |
## -------------------------|-----------|-----------|-----------|
## Public | 4 | 7 | 11 |
## | 0.364 | 0.636 | 0.379 |
## | 1.000 | 0.280 | |
## | 0.138 | 0.241 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 4 | 25 | 29 |
## | 0.138 | 0.862 | |
## -------------------------|-----------|-----------|-----------|
##
##
As shown in the order under the Cell Contents
legend, we can see that
each cell now contains the number of observations, the conditional
frequency in the row, the conditional frequency in the column, and
finally the proportion out of the total number of observations.