1.1 Analyzing Categorical Data
a. Total and Conditional Counts
We’ll start with some basic skills, like counting the number of observations in a dataset.
Author’s note: there are a bazillion different ways to get a total or conditional count in R. This is one, fairly simple way for each that (hopefully!) will be accessible for all R users, and doesn’t involve installing any packages.
For now, we’ll create a single variable in-line. Later, when our analyses get a bit more complex, we can import bigger datasets.
#importing the data
car.colors = c("red", "black", "red", "blue", "white", "pink", "yellow", "red", "black", "black", "blue", "yellow", "blue", "blue", "blue", "silver")
car.df = data.frame(car.colors)
We’ve created a variable called car.colors
, which I’ve completely made
up, that lists off a bunch of colors of hypothetical cars I observed. I
also created a “dataframe” object, car.df
, which we’ll often use in R
to store variables within.
We can quickly get a look at the frequencies of each color using the
table()
function.
table(car.df$car.colors)
##
## black blue pink red silver white yellow
## 3 5 1 3 1 1 2
I want to pause to notice two things that we did in the previous chunk of code:
-
We used the
table()
function, with an input inside the parentheses. Functions in R are similar to math functions, in the sense that they take in an input or multiple inputs (usually called “arguments”). -
We referred to
car.colors
, a variable within thecar.df
dataframe, using the dollar-sign operator$
. This operator tells R that we’re doing something within a particular dataframe or object.
If we’d like to count the total number of cars in the dataset, we could do the following:
#count the number of rows in the car.df dataset
nrow(car.df)
## [1] 16
Great! Now, if we wanted a conditional count for the number of cars in the dataset that are blue:
#count the number of rows in the car.df dataset where car.colors="blue"
sum(with(car.df, car.colors=="blue"))
## [1] 5
If you’ve done any work with if
statements or coding in general
before, the above conditional statement is probably fairly recognizable.
If it isn’t, I’ll break it down: our with()
function is looking
through the cars.df
dataset for any row where the car.colors
variable equals "blue"
, and returning a TRUE
anytime it finds one.
Outside of that, the sum
function looks across all the TRUE
values
and counts 5 of them, so it returns 5
.
If you run just the with()
part without the sum()
function around
it, you can see why summing the TRUE
values is helpful:
with(car.df, car.colors=="blue")
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [13] TRUE TRUE TRUE FALSE
You can see which cars are and aren’t blue, but you can’t see a total
conditional count without the sum()
function!
Finally, we might be interested in calculating the relative frequency of blue cars in our dataset. Instead of plugging the values into a calculator to calculate $\frac{blue}{total}$, we can let R do the math.
blue = sum(with(car.df, car.colors=="blue"))
total = nrow(car.df)
blue/total
## [1] 0.3125
What we just did was:
- Create a variable called
blue
equal to our previous conditional count for blue cars. Notice that, since we’re just defining a variable for R, it doesn’t print out its value for us here. - Create a variable called
total
equal to our previous total count. - Divide
blue/total
. Here, since we are not defining a new variable, R does print out the value for us!