Introduction to R

Load and explore data:

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Columns of the dataframe can be selected and manipulated easily:

mtcars$drat
##  [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
## [16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
## [31] 3.54 4.11
mtcars$drat + 1
##  [1] 4.90 4.90 4.85 4.08 4.15 3.76 4.21 4.69 4.92 4.92 4.92 4.07 4.07 4.07 3.93
## [16] 4.00 4.23 5.08 5.93 5.22 4.70 3.76 4.15 4.73 4.08 5.08 5.43 4.77 5.22 4.62
## [31] 4.54 5.11
mtcars$drat > 3.5
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [25] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Filter observations according to the value of a feature (mind the comma at the end):

mtcars[mtcars$drat > 3.5,]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Camaro Z28     13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
mtcars[mtcars$drat > 3.5 & mtcars$cyl == 6,]
##                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
mtcars[mtcars$drat > 3.5,]$carb
##  [1] 4 4 1 2 2 4 4 1 2 1 1 4 1 2 2 4 6 8 2

Compute feature statistics:

mean(mtcars$mpg); sd(mtcars$mpg)
## [1] 20.09062
## [1] 6.026948

dplyr

library(dplyr)

Use summarise to create a new dataframe summarising some variables:

mtcars %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
##   mpg_mean   mpg_sd
## 1 20.09062 6.026948

Use group_by to create different group according to the values of one or more variables:

mtcars %>% group_by(cyl) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
## # A tibble: 3 x 3
##     cyl mpg_mean mpg_sd
##   <dbl>    <dbl>  <dbl>
## 1     4     26.7   4.51
## 2     6     19.7   1.45
## 3     8     15.1   2.56
mtcars %>% group_by(cyl, am) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
## # A tibble: 6 x 4
## # Groups:   cyl [3]
##     cyl    am mpg_mean mpg_sd
##   <dbl> <dbl>    <dbl>  <dbl>
## 1     4     0     22.9  1.45 
## 2     4     1     28.1  4.48 
## 3     6     0     19.1  1.63 
## 4     6     1     20.6  0.751
## 5     8     0     15.0  2.77 
## 6     8     1     15.4  0.566
mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
## # A tibble: 9 x 4
## # Groups:   cyl [3]
##     cyl  carb mpg_mean mpg_sd
##   <dbl> <dbl>    <dbl>  <dbl>
## 1     4     1     27.6   5.55
## 2     4     2     25.9   3.81
## 3     6     1     19.8   2.33
## 4     6     4     19.8   1.55
## 5     6     6     19.7  NA   
## 6     8     2     17.2   2.09
## 7     8     3     16.3   1.05
## 8     8     4     13.2   2.28
## 9     8     8     15    NA

Some standard deviation values are NaN. Why? Because there is a single observation having certains values for cyl and carb (for example, 6 and 6).

mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg), num = n())
## # A tibble: 9 x 5
## # Groups:   cyl [3]
##     cyl  carb mpg_mean mpg_sd   num
##   <dbl> <dbl>    <dbl>  <dbl> <int>
## 1     4     1     27.6   5.55     5
## 2     4     2     25.9   3.81     6
## 3     6     1     19.8   2.33     2
## 4     6     4     19.8   1.55     4
## 5     6     6     19.7  NA        1
## 6     8     2     17.2   2.09     4
## 7     8     3     16.3   1.05     3
## 8     8     4     13.2   2.28     6
## 9     8     8     15    NA        1

Note that n is a special function to be used in a data context. nrow returns the number of observations in a dataframe.

nrow(mtcars)
## [1] 32

ungroup is used to remove grouping and perform other operations on the whole dataframe. By itself, it does not do anything:

mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg), num = n()) %>% ungroup()
## # A tibble: 9 x 5
##     cyl  carb mpg_mean mpg_sd   num
##   <dbl> <dbl>    <dbl>  <dbl> <int>
## 1     4     1     27.6   5.55     5
## 2     4     2     25.9   3.81     6
## 3     6     1     19.8   2.33     2
## 4     6     4     19.8   1.55     4
## 5     6     6     19.7  NA        1
## 6     8     2     17.2   2.09     4
## 7     8     3     16.3   1.05     3
## 8     8     4     13.2   2.28     6
## 9     8     8     15    NA        1

mutate adds new features, while preserving existing ones:

mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg), num = n()) %>% ungroup() %>% mutate(mpg_gmean = mean(mpg_mean))
## # A tibble: 9 x 6
##     cyl  carb mpg_mean mpg_sd   num mpg_gmean
##   <dbl> <dbl>    <dbl>  <dbl> <int>     <dbl>
## 1     4     1     27.6   5.55     5      19.4
## 2     4     2     25.9   3.81     6      19.4
## 3     6     1     19.8   2.33     2      19.4
## 4     6     4     19.8   1.55     4      19.4
## 5     6     6     19.7  NA        1      19.4
## 6     8     2     17.2   2.09     4      19.4
## 7     8     3     16.3   1.05     3      19.4
## 8     8     4     13.2   2.28     6      19.4
## 9     8     8     15    NA        1      19.4
mtcars %>% group_by(cyl, carb) %>% summarise(mpg_mean = mean(mpg), mpg_sd = sd(mpg), num = n()) %>% ungroup() %>% mutate(mpg_gmean = mean(mpg_mean), deviation = mpg_mean - mpg_gmean)
## # A tibble: 9 x 7
##     cyl  carb mpg_mean mpg_sd   num mpg_gmean deviation
##   <dbl> <dbl>    <dbl>  <dbl> <int>     <dbl>     <dbl>
## 1     4     1     27.6   5.55     5      19.4     8.22 
## 2     4     2     25.9   3.81     6      19.4     6.54 
## 3     6     1     19.8   2.33     2      19.4     0.386
## 4     6     4     19.8   1.55     4      19.4     0.386
## 5     6     6     19.7  NA        1      19.4     0.336
## 6     8     2     17.2   2.09     4      19.4    -2.21 
## 7     8     3     16.3   1.05     3      19.4    -3.06 
## 8     8     4     13.2   2.28     6      19.4    -6.21 
## 9     8     8     15    NA        1      19.4    -4.36

Plotting in R

cars
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85
plot(cars)

plot(cars$speed, cars$dist)