1.1 Analyzing Categorical Data

a. Total and Conditional Counts

We’ll start with some basic skills, like counting the number of observations in a dataset.

Author’s note: there are a bazillion different ways to get a total or conditional count in R. This is one, fairly simple way for each that (hopefully!) will be accessible for all R users, and doesn’t involve installing any packages.

For now, we’ll create a single variable in-line. Later, when our analyses get a bit more complex, we can import bigger datasets.

#importing the data
car.colors = c("red", "black", "red", "blue", "white", "pink", "yellow", "red", "black", "black", "blue", "yellow", "blue", "blue", "blue", "silver")
car.df = data.frame(car.colors)

We’ve created a variable called car.colors, which I’ve completely made up, that lists off a bunch of colors of hypothetical cars I observed. I also created a “dataframe” object, car.df, which we’ll often use in R to store variables within.

We can quickly get a look at the frequencies of each color using the table() function.

table(car.df$car.colors)

## 
##  black   blue   pink    red silver  white yellow 
##      3      5      1      3      1      1      2

I want to pause to notice two things that we did in the previous chunk of code:

We used the table() function, with an input inside the parentheses. Functions in R are similar to math functions, in the sense that they take in an input or multiple inputs (usually called “arguments”).
We referred to car.colors, a variable within the car.df dataframe, using the dollar-sign operator $. This operator tells R that we’re doing something within a particular dataframe or object.

If we’d like to count the total number of cars in the dataset, we could do the following:

#count the number of rows in the car.df dataset
nrow(car.df)

## [1] 16

Great! Now, if we wanted a conditional count for the number of cars in the dataset that are blue:

#count the number of rows in the car.df dataset where car.colors="blue"
sum(with(car.df, car.colors=="blue"))

## [1] 5

If you’ve done any work with if statements or coding in general before, the above conditional statement is probably fairly recognizable. If it isn’t, I’ll break it down: our with() function is looking through the cars.df dataset for any row where the car.colors variable equals "blue", and returning a TRUE anytime it finds one. Outside of that, the sum function looks across all the TRUE values and counts 5 of them, so it returns 5.

If you run just the with() part without the sum() function around it, you can see why summing the TRUE values is helpful:

with(car.df, car.colors=="blue")

##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [13]  TRUE  TRUE  TRUE FALSE

You can see which cars are and aren’t blue, but you can’t see a total conditional count without the sum() function!

Finally, we might be interested in calculating the relative frequency of blue cars in our dataset. Instead of plugging the values into a calculator to calculate $\frac{blue}{total}$, we can let R do the math.

blue = sum(with(car.df, car.colors=="blue"))
total = nrow(car.df)
blue/total

## [1] 0.3125

What we just did was:

Create a variable called blue equal to our previous conditional count for blue cars. Notice that, since we’re just defining a variable for R, it doesn’t print out its value for us here.
Create a variable called total equal to our previous total count.
Divide blue/total. Here, since we are not defining a new variable, R does print out the value for us!