Chapter 4 Data manipulation
4.1 apply functions
This chapter is adapted based on the tutorial “apply(), lapply(), sapply(), tapply() Function in R with Examples” and RDocumentation apply functions.
4.1.1 apply()
apply()
takes Data frame or matrix as an input and gives output in vector, list or array. Apply function is used for matrix, it is designed to avoid explicit uses of loop constructs.
apply(X, MARGIN, FUN)
Here:
-x: an array or matrix
-MARGIN: take a value or range between 1 and 2 to define where to apply the function:
-MARGIN=1: the manipulation is performed on rows
-MARGIN=2: the manipulation is performed on columns
-MARGIN=c(1,2) the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>
Example1:
<- matrix(C<-(1:6),nrow=2, ncol=3)
m1
m1<- apply(m1, 2, sum)
m1_colsum m1_colsum
If you want to perform the function in each cell, you can set the margin to 1:2
or c(1,2)
apply(m1, 1:2, function(x) x+3)
Example2: a function to find standard error was created, then passed into an apply function to find the standard error of each column.
<- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)
m2
<- function(x){
st.err sd(x)/sqrt(length(x))
}
apply(m2, 2, st.err)
4.1.2 lapply()
lapply()
in R takes list, vector or data frame as input and gives output in list.
lapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x
Example:
<-c(1:9)
A<-c(1:12)
B<-c(1:15)
C<-list(A,B,C)
mylistlapply(mylist, sum)
4.1.3 sapply()
sapply()
works just like lapply, but will simplify the output if possible. This means that instead of returning a list like lapply, it will return a vector instead.
Example:
sapply(mylist, sum)
4.1.4 tapply()
Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
Example1:
<- as.data.frame(cbind(c(1,1,1,1,1,2,2,2,2,2), m2))
tdata tapply(tdata$V2, tdata$V1, mean)
Example2:
<- tapply(tdata$V2, tdata$V1, function(x) c(mean(x), sd(x))) summary
4.1.5 mapply()
mapply()
is a multivariate version of sapply ()
, the syntax of mapply is a little bit different.
mapply(fun, arg1, arg2, arg3, …)
Example:
mapply('*', 1:10, 5:14)
4.2 Pattern matching
4.2.1 Find strings
grep(pattern, string)
returns by default a vector of indices.
grepl(pattern, string)
returns by default a vector of logical values.
In grep()
, if you want to have the actual matching element values, set the option value to TRUE by value=TRUE. Pattern matching in R defaults to be case sensitive. You can use ignore.case = TRUE to avoid case sensitivivity.
<- c("Abcd", "cdab", "ca bd")
strings grep("ab", strings)
grepl("ab", strings)
grep("ab", strings, value = FALSE)
grep("ab", strings, value = TRUE)
grep("ab", strings, value = TRUE,ignore.case = TRUE)
4.2.2 Find and replace patterns
gsub(pattern, replacement, string)
returns the modified string after replacing every pattern occurrence with replacement in string.
<- c("u1000", "u1001", "u1002")
studentID gsub("u", "U", studentID)
4.2.3 Regular expression
The content in this chapter is adpted based on “Regular Expressions in R” and CRAN documentation “Regular expressions”.
Basic Quantifiers specify how many times that the preceding pattern should occur.
*
: matches at least 0 times.+
: matches at least 1 times.?
: matches at most 1 times.{n}
: matches exactly n times.{n,}
: matches at least n times.{,m}
: matches at most m times.{n,m}
: matches between n and m times.
<- c("a", "ab", "acb", "accb", "acccb", "accccb")
strings grep("ac*b", strings, value = TRUE)
grep("ac+b", strings, value = TRUE)
grep("ac?b", strings, value = TRUE)
grep("ac{2}b", strings, value = TRUE)
grep("ac{2,}b", strings, value = TRUE)
grep("ac{2,3}b", strings, value = TRUE)
^
: Start of the string.$
: End of the string.\\<
: Beginning of a word\\>
: End of a word
<- c("abcd", "cdab", "cabd", "c abd", "abc","*ab")
strings grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
grep("\\<a", strings, value = TRUE)
grep("c\\>", strings, value = TRUE)
.
: Any non-empty character[...]
: a permitted character list. Use-
inside the brackets to specify a range of characters.[^...]
: an excluded character list. Match any characters except those inside the square brackets.|
: an OR operator, matches patterns on either side of the |.\
: Escape metacharacters in regular expression, i.e.
$ * + . ? [ ] ^ { } | ( ) \`
<- c("^ab", "ab", "abc", "abd", " abc d", "abe", "ab 12")
strings grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)
Advance
[[:digit:]]
or\\d
or[0-9]
: digits 0 1 2 3 4 5 6 7 8 9\\D
or[^0-9]
: non-digits[[:lower:]]
or[a-z]
: lower-case letters[[:upper:]]
or[A-Z]
: upper-case letters[[:alpha:]]
or[A-z]
: alphabetic characters[[:alnum:]]
or[A-z0-9]
: alphanumeric characters\\w
: word characters include alphanumeric characters0-9,a-z,A-Z
,-
and underscores_
\\W
: non-word characters[[:blank:]]
: space and tab[[:space:]]
or\\s
: space characters: tab, newline, vertical tab, form feed, carriage return, space\\S
: not space characters[[:punct:]]
: punctuation characters
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _
{ | } ~`
[[:cntrl:]]
or\\c
: control characters, like\n
or\r
etc.
Regular expressions are a concise and flexible tool for describing patterns in strings. If you are interested, there are more examples in the tutorial provided by Cran. https://cran.rstudio.com/web/packages/stringr/vignettes/regular-expressions.html
4.3 Example
This exercise demonstrates how to use apply and tapply to summarize information from a table. More information of apply functions can be found in 3.1.
Using apply()
to get the mean and sd on every column
<- apply(mtcars[,1:7], 2, mean)
mtcars.mean <- apply(mtcars[,1:7], 2, sd) mtcars.sd
Using tapply()
to get the maxmimum mpg per vs type
<- tapply(mtcars$mpg, mtcars$vs, max) mtcars.max.vs
Filtering
Using grep()
to select names with Merc
grep("Merc", rownames(mtcars)),] mtcars[
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
combining tapply()
and apply()
.
apply(mtcars[,1:7],2,function(x) tapply(x, mtcars$gear, sum))
## mpg cyl disp hp drat wt qsec
## 3 241.6 112 4894.5 2642 46.99 58.389 265.38
## 4 294.4 56 1476.2 1074 48.52 31.400 227.58
## 5 106.9 30 1012.4 978 19.58 13.163 78.20
Using filter to select disp larger than 200.
$disp > 200,] mtcars[mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
4.4 Exercise
Load dataset mpg
library(ggplot2)
data("mpg")
Quiz
Q1 How many different car manufacturers are present?
Tip: Use functions: length() and unique()
Q2 How many SUV’s (class) does each car manufacturer have?
Tip: Use table()
Q3 Create a new data.table with only automatic transmission type (trans) and 4 cylinders (cyl)
Tip: Use grepl()
Q4 Create a new column called trans_subtype, with the transmission subtype information (the value inside the brackets)
Tip: Use gsub
Q5 For toyota’s “toyota tacoma 4wd” model, rename them to only “tacoma 4wd”
Tip: Try use gsub
Q6 Create a new column called ID, with the first 3 characters of the manufacturer name capitalised with year of manufacturer separated by a _
Tip: Use toupper() and substr() and paste0() E.g. audi 1999 --> AUD_1999
Q7 How many different model names have number(s) in their name for each manufacturer?
Tips: Use table() and grepl()
Q8 Find the average city mileage (cty) for each manufacturer type
Tip: use sapply
Q9 What is the average city mileage (cty) for models with number(s) in the name for each manufacturer?
Q10 Create a new data frame called mpg_summary with the following columns: - (A) Manufacturer name - (B) number of different models for that manufacturer - (C) average displ - (D) manufacturing range (e.g. 1999,2000,2003 –> 1999-2003) - (E) average cyl - (F) most frequent transmission type (if there is a tie, print out all separated by |) - (G) least frequent drv (if there is a tie, print out all separated by |) - (H) Max cty - (I) Min Hwy - (J) All fuel types (fl). Separate by | (E.g. f,r,e –> “f|r|e”) - (K) Class with the longest name - (L) Transition sub-type with the largest number (if there is tie, print out all separated by |)
Tips: Use this to set up your new dataframe:
mpg_summary <- data.frame(A=character(), B=character(), C=character(), D=character(), E=character(), F=character(), G=character(), H=character(),
I=character(), J=character(), K=character(), L=character())
Q11 Write a function that takes in a manufacturer name and print out the highway mileage (hwy) range for each car model