Introduction
dformula allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following:
Operation | Function |
---|---|
Add new variables | add() |
Transform existing variables | transform() |
Rename existing variables | rename() |
Selection rows and columns | select() |
Removing row and column | remove |
The formula is composed of two part:
the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data.
The I()
is used in the right-hand side to indicate the
type of transformation of the existing variable. In this function, we
can insert logical statement, function implemented in R or user build
function.
For example:
$$ var\_name_1 + var\_name_2 \sim I(log(var\_name_1)) + I(var\_name_2 == "something") $$
the two variable and are transformed in or selected to be equal to $"something"$.
In the same fashion of SQL, we have the from
argument,
the input data, and the as
argument, the new name of the
variables, after transformation, selection or addition.
The CRAN version can be loaded
or the development version from GitHub:
remotes::install_github('serafinialessio/dformula')
The data are available in the package will be used in this overview
data("population_data")
pop_data <- population_data
which describes the Population and Area of world countries
str(pop_data)
## 'data.frame': 159 obs. of 3 variables:
## $ Country : chr "China" "India" "United States" "Indonesia" ...
## $ Population: num 1.44e+09 1.38e+09 3.31e+08 2.74e+08 2.21e+08 ...
## $ Area : num 9388211 2973190 9147420 1811570 771 ...
Adding
variables
The add()
function inserts new variables starting from
the existing columns in the data.
Suppose we want to calculate population density and attach this to the original dataset
new_pop <- add(from = pop_data, formula = ~ I(Population / Area))
head(new_pop)
## Country Population Area Var.1
## 1 China 1439323776 9388211.00 153.31183
## 2 India 1380004385 2973190.00 464.14941
## 3 United States 331002651 9147420.00 36.18536
## 4 Indonesia 273523615 1811570.00 150.98705
## 5 Pakistan 220892340 770.88 286545.68804
## 6 Brazil 212559417 8358140.00 25.43143
and give a name to this new variable
new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density")
head(new_pop)
## Country Population Area pop_density
## 1 China 1439323776 9388211.00 153.31183
## 2 India 1380004385 2973190.00 464.14941
## 3 United States 331002651 9147420.00 36.18536
## 4 Indonesia 273523615 1811570.00 150.98705
## 5 Pakistan 220892340 770.88 286545.68804
## 6 Brazil 212559417 8358140.00 25.43143
Multiple variable can be added with a single formula
new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)))
head(new_pop)
## Country Population Area Var.1 Var.2
## 1 China 1439323776 9388211.00 153.31183 16.054965
## 2 India 1380004385 2973190.00 464.14941 14.905146
## 3 United States 331002651 9147420.00 36.18536 16.028982
## 4 Indonesia 273523615 1811570.00 150.98705 14.409704
## 5 Pakistan 220892340 770.88 286545.68804 6.647533
## 6 Brazil 212559417 8358140.00 25.43143 15.938746
and with new names
new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)),
as = c("pop_density", "log_area"))
head(new_pop)
## Country Population Area pop_density log_area
## 1 China 1439323776 9388211.00 153.31183 16.054965
## 2 India 1380004385 2973190.00 464.14941 14.905146
## 3 United States 331002651 9147420.00 36.18536 16.028982
## 4 Indonesia 273523615 1811570.00 150.98705 14.409704
## 5 Pakistan 220892340 770.88 286545.68804 6.647533
## 6 Brazil 212559417 8358140.00 25.43143 15.938746
If we have one transformation applied to a group of variables, we do not specify the function multiple times
new_pop <- add(from = pop_data, formula = Population + Area ~ log())
head(new_pop)
## Country Population Area Var.1 Var.2
## 1 China 1439323776 9388211.00 21.08744 16.054965
## 2 India 1380004385 2973190.00 21.04535 14.905146
## 3 United States 331002651 9147420.00 19.61764 16.028982
## 4 Indonesia 273523615 1811570.00 19.42690 14.409704
## 5 Pakistan 220892340 770.88 19.21319 6.647533
## 6 Brazil 212559417 8358140.00 19.17473 15.938746
and with new column names
new_pop <- add(from = pop_data, formula = Population + Area ~ log(),
as = c("log_pop", "log_area"))
head(new_pop)
## Country Population Area log_pop log_area
## 1 China 1439323776 9388211.00 21.08744 16.054965
## 2 India 1380004385 2973190.00 21.04535 14.905146
## 3 United States 331002651 9147420.00 19.61764 16.028982
## 4 Indonesia 273523615 1811570.00 19.42690 14.409704
## 5 Pakistan 220892340 770.88 19.21319 6.647533
## 6 Brazil 212559417 8358140.00 19.17473 15.938746
Suppose we want to add a numerical id for the
countries at the beginning of the dataset, using the
position
argument
new_pop <- add(from = pop_data,
formula = ~ I(1:nrow(new_pop)),
position = "left", as = "id")
head(new_pop)
## id Country Population Area
## 1 1 China 1439323776 9388211.00
## 2 2 India 1380004385 2973190.00
## 3 3 United States 331002651 9147420.00
## 4 4 Indonesia 273523615 1811570.00
## 5 5 Pakistan 220892340 770.88
## 6 6 Brazil 212559417 8358140.00
We can also add a constant variable. For example the year of the observation
new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left")
head(new_pop)
## Var.1 Var.2 Var.3 Country Population Area
## 1 2020 2020 2020 China 1439323776 9388211.00
## 2 2020 2020 2020 India 1380004385 2973190.00
## 3 2020 2020 2020 United States 331002651 9147420.00
## 4 2020 2020 2020 Indonesia 273523615 1811570.00
## 5 2020 2020 2020 Pakistan 220892340 770.88
## 6 2020 2020 2020 Brazil 212559417 8358140.00
or both
new_pop <- add(from = pop_data,
formula = ~ I(1:nrow(new_pop)) + C("2020"),
position = "left", as = c("ids", "year"))
head(new_pop)
## ids year Country Population Area
## 1 1 2020 China 1439323776 9388211.00
## 2 2 2020 India 1380004385 2973190.00
## 3 3 2020 United States 331002651 9147420.00
## 4 4 2020 Indonesia 273523615 1811570.00
## 5 5 2020 Pakistan 220892340 770.88
## 6 6 2020 Brazil 212559417 8358140.00
The C()
construct add a constant for all the rows
We can be interested in having a dummy variable, i.e. a variable equal to if some event happen or otherwise. For example, we suppose to build a dummy variables with the most populated countries. In this we suppose countries with more than million of people.
new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000))
head(new_pop)
## Country Population Area Var.1
## 1 China 1439323776 9388211.00 1
## 2 India 1380004385 2973190.00 1
## 3 United States 331002651 9147420.00 1
## 4 Indonesia 273523615 1811570.00 1
## 5 Pakistan 220892340 770.88 1
## 6 Brazil 212559417 8358140.00 1
or two variables one with the most populated countries and the other with the biggest extended countries
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000) + I(Area > 8000000))
head(new_pop)
## Country Population Area Var.1 Var.2
## 1 China 1439323776 9388211.00 1 1
## 2 India 1380004385 2973190.00 1 0
## 3 United States 331002651 9147420.00 1 1
## 4 Indonesia 273523615 1811570.00 1 0
## 5 Pakistan 220892340 770.88 1 0
## 6 Brazil 212559417 8358140.00 1 1
or a variable indicating the most populated and the biggest countries togheter
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
## Country Population Area Var.1
## 1 China 1439323776 9388211.00 1
## 2 India 1380004385 2973190.00 0
## 3 United States 331002651 9147420.00 1
## 4 Indonesia 273523615 1811570.00 0
## 5 Pakistan 220892340 770.88 0
## 6 Brazil 212559417 8358140.00 1
If we want obtain a boolean vector, as an interrogation, setting to
TRUE
the argument logic_convert
the function
will return a boolean vector
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000),
logic_convert = FALSE, as = "most_populated")
head(new_pop)
## Country Population Area most_populated
## 1 China 1439323776 9388211.00 TRUE
## 2 India 1380004385 2973190.00 TRUE
## 3 United States 331002651 9147420.00 TRUE
## 4 Indonesia 273523615 1811570.00 TRUE
## 5 Pakistan 220892340 770.88 TRUE
## 6 Brazil 212559417 8358140.00 TRUE
Transform
variables
The transform()
function modifies existing variables in
the dataset.
Suppose we want to change the scale on the Population
new_pop <- transform(from = pop_data,
formula = Population ~ I(Population/10000))
head(new_pop)
## Country Population Area
## 1 China 143932.38 9388211.00
## 2 India 138000.44 2973190.00
## 3 United States 33100.27 9147420.00
## 4 Indonesia 27352.36 1811570.00
## 5 Pakistan 22089.23 770.88
## 6 Brazil 21255.94 8358140.00
or we want a logarithmic transformation, renaming the variable
new_pop <- transform(from = pop_data,
formula = Population ~ I(log(Population)),
as = "log_pop")
head(new_pop)
##
## 1 China 21.08744 9388211.00
## 2 India 21.04535 2973190.00
## 3 United States 19.61764 9147420.00
## 4 Indonesia 19.42690 1811570.00
## 5 Pakistan 19.21319 770.88
## 6 Brazil 19.17473 8358140.00
With a single formula multiple variables can be transformed, as showed before.
new_pop <- transform(from = pop_data,
formula = Population + Area~ I(log()))
head(new_pop)
## Country Population Area
## 1 China 21.08744 16.054965
## 2 India 21.04535 14.905146
## 3 United States 19.61764 16.028982
## 4 Indonesia 19.42690 14.409704
## 5 Pakistan 19.21319 6.647533
## 6 Brazil 19.17473 15.938746
We can also transformed multiple variables with multiple transformations
new_pop <- transform(from = pop_data,
formula = Population + Area ~ I(Population > 100000000) + I(log(Area)))
head(new_pop)
## Country Population Area
## 1 China 1 16.054965
## 2 India 1 14.905146
## 3 United States 1 16.028982
## 4 Indonesia 1 14.409704
## 5 Pakistan 1 6.647533
## 6 Brazil 1 15.938746
Rename
variables
The rename()
function may be used to change names of
existing variables, for example
new_pop <- rename(from = pop_data, formula = Population ~ pop )
head(new_pop)
## Country pop Area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00
or multiple variables
new_pop <- rename(from = pop_data, formula = Population + Area ~ pop + area)
head(new_pop)
## Country pop area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00
Select
variables and rows
In the same fashion of SQL, the select()
function first
select the rows, given a statement, and then shows the select
variables.
The first part of the formula are the columns to select, as the previous functions, and the right-hand side of the formula, the condition part, will select the rows.
Suppose to want to select only the most populated countries
new_pop <- select(from = pop_data,
formula = . ~ I(Population > 100000000))
head(new_pop)
## Country Population Area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00
you can also add .
to returns all variables instead of
nothing.
We want only the name of the most populated countries
new_pop <- select(from = pop_data,
formula = Country ~ I(Population > 100000000))
head(new_pop)
## Country
## 1 China
## 2 India
## 3 United States
## 4 Indonesia
## 5 Pakistan
## 6 Brazil
We might be interest in only the most populated and biggest countries
new_pop <- select(from = pop_data,
formula = . ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
## Country Population Area
## 1 China 1439323776 9388211
## 3 United States 331002651 9147420
## 6 Brazil 212559417 8358140
## 9 Russia 145934462 16376870
or both
new_pop <- select(from = pop_data,
formula = ~ I(Population > 100000000 | Area > 8000000))
head(new_pop)
## Country Population Area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00
by selecting only the names
Remove
variables
The remove()
function has the same syntax of
select()
function, but now the rows and columns will be
removed.
Handling Missing Values
In all the functions, except for rename
, the argument
na.remove
will remove all the rows with missing values,
after adding, transforming or selecting the rows.
The remove
function, can be employed to remove all the
rows with at least a missing observation,
data("airquality")
dt <- airquality
dt_new <- remove(from = dt,formula = .~., na.remove = TRUE)
head(dt_new)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
If we are interested to focus on the observation with missing values,
the na.return = TRUE
arguments of select
function will return only the incomplete rows after the selection