4 dplyr functions for Data Analysis

Diwash Shrestha
4 min readFeb 20, 2021

--

You might be here, if you have already begun coding in R and are familiar with the terms as packages and functions. And now you might want to learn codes that are used for data analysis. This blog will help you in learning what are dplyr functions and how theycan be used for data analysis in R.

dplyr is a package from the tidyverse package world used for data manipulation or analysis.”

Let’s begin,

Note : During this session mtcars is used, which is a builtin dataset within R.

Let’s load the packages and data.

## loading package
library(dplyr)
## loading data
mtcars
glimpse(mtcars)

glimpse() function is a dplyr function used to see the dimension of the dataset and display some portion of the data along with their data types of each column.

mtcars has 32 Rows and 11 Columns. The data was extracted from the 1974 Motor Trend US magazine and comprises data about fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74) models.

1. select()

select() function can be used to select columns by passing column names to the select() function.

Photo by Timothy Muza on Unsplash
mtcars %>% select(cyl)

“%>%” operator is called a pipe which forwards a value, or the result of an expression, into the next function call/expression. In above code mtcars dataframe is passed to select function using pipe.

We can select multiple columns in a dataframe by passing columns names separating them with commas.

mtcars %>% select(cyl,disp,mpg)

If we have a large number of columns in a dataframe and don’t want to select a particular columns in the dataframe, we can use “!” operator just before the name of the column or group of columns.

mtcars %>% select(!cyl)

2. mutate()

In most data analysis projects, we need to create data using the existing data. For instance, finding out age using date of birth. mutate() creates and adds new data columns to the given dataframe.

Photo by Chris Lawton on Unsplash

In the mtcars data frame, weight of the cars is given in a pound system, United States based metrics.

Let’s create a new column which has weight in gram or kg using mutate function.

# 1000 lbs = 453.592 kgs
mtcars <- mtcars %>% mutate(wt_kg = wt * 453.592)
mtcars

3. filter()

As the name suggests, filter() function is used to filter data from a dataframe using some conditional statements like equal to “==”, greater than “>”, etc.

Photo by Devin Avery on Unsplash

Let’s find cars that weighs less than 1000 kg.

mtcars %>% filter(wt_kg < 1000)

So, we found out that there are 6 cars that weigh less than 1000 kg.

We can use and “&” and or “|” operators use different conditions at the same time. For example, when we want to find out those cars which are greater than 1000 kgs and have 6-cylinder engines.

mtcars %>%
filter(wt_kg > 1000 & cyl == 6)

filter() function is very useful when we want to subset rows with columns values.

4. summarise()

In descriptive data analysis, we need the average, sum or count of a column value. summarise() is used to summarise columns using some other functions like mean, sum etc. We need to pass functions like mean(), sum() to summarise function then pass the name of the column for which we want to find the summarised value.

Photo by Kelly Sikkema on Unsplash

Let’s find the average mpg for the given cars.

mtcars %>%
summarise(avg_mpg = mean(mpg))

We can summarise more than one column value by using comma after the column name.

mtcars %>%
summarise(avg_mpg = mean(mpg), count = n())

Using group_by() function we can find a summary value for a column value based on another column. We can find average mpg for cars based on their cyl variable .

# group mtcars based on cyl column
mtcars %>% group_by(cyl)%>%
summarise(avg_mpg = mean(mpg))

Conclusion:

In this article, I talked about ways of using four dplyr functions for data analysis and manipulation.

--

--