Chapter 4 Data manipulation

4.1 apply functions

This chapter is adapted based on the tutorial “apply(), lapply(), sapply(), tapply() Function in R with Examples” and RDocumentation apply functions.

4.1.1 apply()

apply() takes Data frame or matrix as an input and gives output in vector, list or array. Apply function is used for matrix, it is designed to avoid explicit uses of loop constructs.

apply(X, MARGIN, FUN)
Here:
-x: an array or matrix
-MARGIN:  take a value or range between 1 and 2 to define where to apply the function:
-MARGIN=1: the manipulation is performed on rows
-MARGIN=2: the manipulation is performed on columns
-MARGIN=c(1,2) the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>

Example1:

m1 <- matrix(C<-(1:6),nrow=2, ncol=3)
m1
m1_colsum <- apply(m1, 2, sum)
m1_colsum 

If you want to perform the function in each cell, you can set the margin to 1:2 or c(1,2)

apply(m1, 1:2, function(x) x+3)

Example2: a function to find standard error was created, then passed into an apply function to find the standard error of each column.

m2 <- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)

st.err <- function(x){
  sd(x)/sqrt(length(x))
}

apply(m2, 2, st.err)

4.1.2 lapply()

lapply() in R takes list, vector or data frame as input and gives output in list.

lapply(X, FUN)
Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x

Example:

A<-c(1:9)
B<-c(1:12)
C<-c(1:15)
mylist<-list(A,B,C)
lapply(mylist, sum)

4.1.3 sapply()

sapply() works just like lapply, but will simplify the output if possible. This means that instead of returning a list like lapply, it will return a vector instead. Example:

sapply(mylist, sum)

4.1.4 tapply()

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

Example1:

tdata <- as.data.frame(cbind(c(1,1,1,1,1,2,2,2,2,2), m2))
tapply(tdata$V2, tdata$V1, mean)

Example2:

summary <- tapply(tdata$V2, tdata$V1, function(x) c(mean(x), sd(x)))

4.1.5 mapply()

mapply() is a multivariate version of sapply (), the syntax of mapply is a little bit different.

mapply(fun, arg1, arg2, arg3, …)

Example:

mapply('*', 1:10, 5:14)

4.2 Pattern matching

4.2.1 Find strings

grep(pattern, string) returns by default a vector of indices.

grepl(pattern, string) returns by default a vector of logical values. In grep(), if you want to have the actual matching element values, set the option value to TRUE by value=TRUE. Pattern matching in R defaults to be case sensitive. You can use ignore.case = TRUE to avoid case sensitivivity.

strings <- c("Abcd", "cdab", "ca bd")
grep("ab", strings)
grepl("ab", strings)
grep("ab", strings, value = FALSE)
grep("ab", strings, value = TRUE)
grep("ab", strings, value = TRUE,ignore.case = TRUE)

4.2.2 Find and replace patterns

gsub(pattern, replacement, string) returns the modified string after replacing every pattern occurrence with replacement in string.

studentID <- c("u1000", "u1001", "u1002")
gsub("u", "U", studentID)

4.2.3 Regular expression

The content in this chapter is adpted based on “Regular Expressions in R” and CRAN documentation “Regular expressions”.

Basic Quantifiers specify how many times that the preceding pattern should occur.

  • * : matches at least 0 times.
  • + : matches at least 1 times.
  • ? : matches at most 1 times.
  • {n} : matches exactly n times.
  • {n,} : matches at least n times.
  • {,m} : matches at most m times.
  • {n,m} : matches between n and m times.
strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac*b", strings, value = TRUE)
grep("ac+b", strings, value = TRUE)
grep("ac?b", strings, value = TRUE)
grep("ac{2}b", strings, value = TRUE)
grep("ac{2,}b", strings, value = TRUE)
grep("ac{2,3}b", strings, value = TRUE)
  • ^ : Start of the string.
  • $ : End of the string.
  • \\< : Beginning of a word
  • \\> : End of a word
strings <- c("abcd", "cdab", "cabd", "c abd", "abc","*ab")
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
grep("\\<a", strings, value = TRUE)
grep("c\\>", strings, value = TRUE)
  • . : Any non-empty character

  • [...] : a permitted character list. Use - inside the brackets to specify a range of characters.

  • [^...] : an excluded character list. Match any characters except those inside the square brackets.

  • | : an OR operator, matches patterns on either side of the |.

  • \: Escape metacharacters in regular expression, i.e.

$ * + . ? [ ] ^ { } | ( ) \`
strings <- c("^ab", "ab", "abc", "abd", " abc d", "abe", "ab 12")
grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)

Advance

  • [[:digit:]] or \\d or [0-9] : digits 0 1 2 3 4 5 6 7 8 9

  • \\D or [^0-9] : non-digits

  • [[:lower:]] or [a-z] : lower-case letters

  • [[:upper:]] or [A-Z] : upper-case letters

  • [[:alpha:]] or [A-z] : alphabetic characters

  • [[:alnum:]] or [A-z0-9] : alphanumeric characters

  • \\w : word characters include alphanumeric characters 0-9,a-z,A-Z, - and underscores _

  • \\W : non-word characters

  • [[:blank:]] : space and tab

  • [[:space:]] or \\s : space characters: tab, newline, vertical tab, form feed, carriage return, space

  • \\S : not space characters

  • [[:punct:]] : punctuation characters

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~`

  • [[:cntrl:]] or \\c : control characters, like \n or \r etc.

Regular expressions are a concise and flexible tool for describing patterns in strings. If you are interested, there are more examples in the tutorial provided by Cran. https://cran.rstudio.com/web/packages/stringr/vignettes/regular-expressions.html

4.3 Example

This exercise demonstrates how to use apply and tapply to summarize information from a table. More information of apply functions can be found in 3.1.

Using apply() to get the mean and sd on every column

mtcars.mean <- apply(mtcars[,1:7], 2, mean)
mtcars.sd <- apply(mtcars[,1:7], 2, sd)

Using tapply() to get the maxmimum mpg per vs type

mtcars.max.vs <- tapply(mtcars$mpg, mtcars$vs, max)

Filtering

Using grep() to select names with Merc

mtcars[grep("Merc", rownames(mtcars)),]
##              mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 240D   24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
## Merc 230    22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
## Merc 280    19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
## Merc 280C   17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
## Merc 450SE  16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
## Merc 450SL  17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
## Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

combining tapply() and apply().

apply(mtcars[,1:7],2,function(x) tapply(x, mtcars$gear, sum))
##     mpg cyl   disp   hp  drat     wt   qsec
## 3 241.6 112 4894.5 2642 46.99 58.389 265.38
## 4 294.4  56 1476.2 1074 48.52 31.400 227.58
## 5 106.9  30 1012.4  978 19.58 13.163  78.20

Using filter to select disp larger than 200.

mtcars[mtcars$disp > 200,]
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

4.4 Exercise

Load dataset mpg

library(ggplot2)
data("mpg")

Quiz

Q1 How many different car manufacturers are present? Tip: Use functions: length() and unique()

Q2 How many SUV’s (class) does each car manufacturer have? Tip: Use table()

Q3 Create a new data.table with only automatic transmission type (trans) and 4 cylinders (cyl) Tip: Use grepl()

Q4 Create a new column called trans_subtype, with the transmission subtype information (the value inside the brackets) Tip: Use gsub

Q5 For toyota’s “toyota tacoma 4wd” model, rename them to only “tacoma 4wd” Tip: Try use gsub

Q6 Create a new column called ID, with the first 3 characters of the manufacturer name capitalised with year of manufacturer separated by a _ Tip: Use toupper() and substr() and paste0() E.g. audi 1999 --> AUD_1999

Q7 How many different model names have number(s) in their name for each manufacturer? Tips: Use table() and grepl()

Q8 Find the average city mileage (cty) for each manufacturer type Tip: use sapply

Q9 What is the average city mileage (cty) for models with number(s) in the name for each manufacturer?

Q10 Create a new data frame called mpg_summary with the following columns: - (A) Manufacturer name - (B) number of different models for that manufacturer - (C) average displ - (D) manufacturing range (e.g. 1999,2000,2003 –> 1999-2003) - (E) average cyl - (F) most frequent transmission type (if there is a tie, print out all separated by |) - (G) least frequent drv (if there is a tie, print out all separated by |) - (H) Max cty - (I) Min Hwy - (J) All fuel types (fl). Separate by | (E.g. f,r,e –> “f|r|e”) - (K) Class with the longest name - (L) Transition sub-type with the largest number (if there is tie, print out all separated by |)

Tips: Use this to set up your new dataframe:

mpg_summary <- data.frame(A=character(), B=character(), C=character(), D=character(), E=character(), F=character(), G=character(), H=character(),
I=character(), J=character(), K=character(), L=character())

Q11 Write a function that takes in a manufacturer name and print out the highway mileage (hwy) range for each car model