Chapter 3 Estatistica e R

A estatística descritiva busca fornecer uma descrição útil de um grande número de dados a partir de valores como média, mediana, variância, desvio padrão e quartis, frequencia de valores e moda, correlação e covariância.

Alguns essas medidas buscam descrever o dado em termos de seus valores, distribuição dos valores e correlação entre os dados.

3.1 Entendo os Dados e Estatística Descritiva

3.1.1 Exploração inicial dos dados

Significado ds dados, quantidade e linhas e colunas, tipos de dados.

Primeiras linhas de um dataframe

library(MASS)
help(Cars93)

## starting httpd help server ... done

head(Cars93)

##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 1        Acura Integra   Small      12.9  15.9      18.8       25          31
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18          25
## 3         Audi      90 Compact      25.9  29.1      32.3       20          26
## 4         Audi     100 Midsize      30.8  37.7      44.6       19          26
## 5          BMW    535i Midsize      23.7  30.0      36.2       22          30
## 6        Buick Century Midsize      14.2  15.7      17.3       22          31
##              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 1               None      Front         4        1.8        140 6300
## 2 Driver & Passenger      Front         6        3.2        200 5500
## 3        Driver only      Front         6        2.8        172 5500
## 4 Driver & Passenger      Front         6        2.8        172 5500
## 5        Driver only       Rear         4        3.5        208 5700
## 6        Driver only      Front         4        2.2        110 5200
##   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1         2890             Yes               13.2          5    177       102
## 2         2335             Yes               18.0          5    195       115
## 3         2280             Yes               16.9          5    180       102
## 4         2535             Yes               21.1          6    193       106
## 5         2545             Yes               21.1          4    186       109
## 6         2565              No               16.4          6    189       105
##   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1    68          37           26.5           11   2705 non-USA Acura Integra
## 2    71          38           30.0           15   3560 non-USA  Acura Legend
## 3    67          37           28.0           14   3375 non-USA       Audi 90
## 4    70          37           31.0           17   3405 non-USA      Audi 100
## 5    69          39           27.0           13   3640 non-USA      BMW 535i
## 6    69          41           28.0           16   2880     USA Buick Century

Número de linhas e colunas.

nrow(Cars93)

## [1] 93

ncol(Cars93)

## [1] 27

Examinando estrutura e tipos de dados.

class(Cars93)

## [1] "data.frame"

str(Cars93)

## 'data.frame':    93 obs. of  27 variables:
##  $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
##  $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
##  $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
##  $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
##  $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
##  $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
##  $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
##  $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
##  $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
##  $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
##  $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
##  $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
##  $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
##  $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
##  $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
##  $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
##  $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
##  $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
##  $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
##  $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
##  $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
##  $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
##  $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
##  $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
##  $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
##  $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

class(Cars93$Model)

## [1] "factor"

class(Cars93$Price)

## [1] "numeric"

names(Cars93)

##  [1] "Manufacturer"       "Model"              "Type"              
##  [4] "Min.Price"          "Price"              "Max.Price"         
##  [7] "MPG.city"           "MPG.highway"        "AirBags"           
## [10] "DriveTrain"         "Cylinders"          "EngineSize"        
## [13] "Horsepower"         "RPM"                "Rev.per.mile"      
## [16] "Man.trans.avail"    "Fuel.tank.capacity" "Passengers"        
## [19] "Length"             "Wheelbase"          "Width"             
## [22] "Turn.circle"        "Rear.seat.room"     "Luggage.room"      
## [25] "Weight"             "Origin"             "Make"

3.1.2 Examinando valores

Médias, valores máximos e mínimos e seleção de valores.

3.1.3 Selecionando linhas

Note: df [ linhas, colunas ]

head(Cars93[Cars93$Price < 20,])

##    Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 1         Acura  Integra   Small      12.9  15.9      18.8       25          31
## 6         Buick  Century Midsize      14.2  15.7      17.3       22          31
## 12    Chevrolet Cavalier Compact       8.5  13.4      18.3       25          36
## 13    Chevrolet  Corsica Compact      11.4  11.4      11.4       25          34
## 14    Chevrolet   Camaro  Sporty      13.4  15.1      16.8       19          28
## 15    Chevrolet   Lumina Midsize      13.4  15.9      18.4       21          29
##               AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 1                None      Front         4        1.8        140 6300
## 6         Driver only      Front         4        2.2        110 5200
## 12               None      Front         4        2.2        110 5200
## 13        Driver only      Front         4        2.2        110 5200
## 14 Driver & Passenger       Rear         6        3.4        160 4600
## 15               None      Front         4        2.2        110 5200
##    Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1          2890             Yes               13.2          5    177       102
## 6          2565              No               16.4          6    189       105
## 12         2380             Yes               15.2          5    182       101
## 13         2665             Yes               15.6          5    184       103
## 14         1805             Yes               15.5          4    193       101
## 15         2595              No               16.5          6    198       108
##    Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
## 1     68          37           26.5           11   2705 non-USA
## 6     69          41           28.0           16   2880     USA
## 12    66          38           25.0           13   2490     USA
## 13    68          39           26.0           14   2785     USA
## 14    74          43           25.0           13   3240     USA
## 15    71          40           28.5           16   3195     USA
##                  Make
## 1       Acura Integra
## 6       Buick Century
## 12 Chevrolet Cavalier
## 13  Chevrolet Corsica
## 14   Chevrolet Camaro
## 15   Chevrolet Lumina

head(Cars93[Cars93$Price < 20 & Cars93$Type == 'Small',])

##    Manufacturer   Model  Type Min.Price Price Max.Price MPG.city MPG.highway
## 1         Acura Integra Small      12.9  15.9      18.8       25          31
## 23        Dodge    Colt Small       7.9   9.2      10.6       29          33
## 24        Dodge  Shadow Small       8.4  11.3      14.2       23          29
## 29        Eagle  Summit Small       7.9  12.2      16.5       29          33
## 31         Ford Festiva Small       6.9   7.4       7.9       31          33
## 32         Ford  Escort Small       8.4  10.1      11.9       23          30
##        AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
## 1         None      Front         4        1.8        140 6300         2890
## 23        None      Front         4        1.5         92 6000         3285
## 24 Driver only      Front         4        2.2         93 4800         2595
## 29        None      Front         4        1.5         92 6000         2505
## 31        None      Front         4        1.3         63 5000         3150
## 32        None      Front         4        1.8        127 6500         2410
##    Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
## 1              Yes               13.2          5    177       102    68
## 23             Yes               13.2          5    174        98    66
## 24             Yes               14.0          5    172        97    67
## 29             Yes               13.2          5    174        98    66
## 31             Yes               10.0          4    141        90    63
## 32             Yes               13.2          5    171        98    67
##    Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1           37           26.5           11   2705 non-USA Acura Integra
## 23          32           26.5           11   2270     USA    Dodge Colt
## 24          38           26.5           13   2670     USA  Dodge Shadow
## 29          36           26.5           11   2295     USA  Eagle Summit
## 31          33           26.0           12   1845     USA  Ford Festiva
## 32          36           28.0           12   2530     USA   Ford Escort

myCars = Cars93[Cars93$Price < 20 & Cars93$Type == 'Small',]
head(myCars)

##    Manufacturer   Model  Type Min.Price Price Max.Price MPG.city MPG.highway
## 1         Acura Integra Small      12.9  15.9      18.8       25          31
## 23        Dodge    Colt Small       7.9   9.2      10.6       29          33
## 24        Dodge  Shadow Small       8.4  11.3      14.2       23          29
## 29        Eagle  Summit Small       7.9  12.2      16.5       29          33
## 31         Ford Festiva Small       6.9   7.4       7.9       31          33
## 32         Ford  Escort Small       8.4  10.1      11.9       23          30
##        AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
## 1         None      Front         4        1.8        140 6300         2890
## 23        None      Front         4        1.5         92 6000         3285
## 24 Driver only      Front         4        2.2         93 4800         2595
## 29        None      Front         4        1.5         92 6000         2505
## 31        None      Front         4        1.3         63 5000         3150
## 32        None      Front         4        1.8        127 6500         2410
##    Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
## 1              Yes               13.2          5    177       102    68
## 23             Yes               13.2          5    174        98    66
## 24             Yes               14.0          5    172        97    67
## 29             Yes               13.2          5    174        98    66
## 31             Yes               10.0          4    141        90    63
## 32             Yes               13.2          5    171        98    67
##    Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1           37           26.5           11   2705 non-USA Acura Integra
## 23          32           26.5           11   2270     USA    Dodge Colt
## 24          38           26.5           13   2670     USA  Dodge Shadow
## 29          36           26.5           11   2295     USA  Eagle Summit
## 31          33           26.0           12   1845     USA  Ford Festiva
## 32          36           28.0           12   2530     USA   Ford Escort

3.1.4 Selecionando colunas

Note: df [ linhas, colunas ]

head(Cars93$Price)

## [1] 15.9 33.9 29.1 37.7 30.0 15.7

head(Cars93[,c('Price','Type')])

##   Price    Type
## 1  15.9   Small
## 2  33.9 Midsize
## 3  29.1 Compact
## 4  37.7 Midsize
## 5  30.0 Midsize
## 6  15.7 Midsize

head(Cars93[,c(3,5)])

##      Type Price
## 1   Small  15.9
## 2 Midsize  33.9
## 3 Compact  29.1
## 4 Midsize  37.7
## 5 Midsize  30.0
## 6 Midsize  15.7

myCars = Cars93[,c('Price','Type')]
head(myCars)

##   Price    Type
## 1  15.9   Small
## 2  33.9 Midsize
## 3  29.1 Compact
## 4  37.7 Midsize
## 5  30.0 Midsize
## 6  15.7 Midsize

3.1.5 Selecionando linhas e colunas

Note: df [ linhas, colunas ]

myCars = Cars93[ Cars93$Price < 20 & Cars93$Type == 'Small', c('Price','Type','MPG.city')]
head(myCars)

##    Price  Type MPG.city
## 1   15.9 Small       25
## 23   9.2 Small       29
## 24  11.3 Small       23
## 29  12.2 Small       29
## 31   7.4 Small       31
## 32  10.1 Small       23

nrow(myCars)

## [1] 21

3.1.6 Examinando valores

min, max, median etc.

Note: Valores categóricos requerem um tratamento diferente dos valores numéricos (por quê?)

min(myCars$Price)

## [1] 7.4

max(myCars$Price)

## [1] 15.9

mean(myCars$Price)

## [1] 10.16667

summary(myCars)

##      Price            Type       MPG.city    
##  Min.   : 7.40   Compact: 0   Min.   :22.00  
##  1st Qu.: 8.60   Large  : 0   1st Qu.:25.00  
##  Median :10.00   Midsize: 0   Median :29.00  
##  Mean   :10.17   Small  :21   Mean   :29.86  
##  3rd Qu.:11.30   Sporty : 0   3rd Qu.:31.00  
##  Max.   :15.90   Van    : 0   Max.   :46.00

Gráficos simples (veremos mais na próxima aula)

plot(Cars93$Horsepower, Cars93$Price, main='Preço vs HP')
abline(h=mean(Cars93$Price),col='red')
abline(v=mean(Cars93$Horsepower),col='red')

3.1.7 Examinando frequencia e distribuição dos dados

Você verá muito mais nas próximas aulas. Aqui nos deteremos a examinar apenas variáveis simples que informação sobre a frequencia e distribuição dos dados.

3.1.8 Frequência de valores

Não se preocupe com os gráficos agora. Você conhecerá mais sobre eles na próxima aula.

table(Cars93$Type)

## 
## Compact   Large Midsize   Small  Sporty     Van 
##      16      11      22      21      14       9

table(Cars93$Origin)

## 
##     USA non-USA 
##      48      45

par(mfrow = c(1, 2))
barplot(table(Cars93$Type), main='Quantidade de Veículos por Tipo',col='orange')
barplot(table(Cars93$Origin), main='Quantidade de Veículos por Origem',col=c('green','blue'))

3.1.9 Estatísticas básicas

Mediana, Quartis, Variância, Desvio Padrão

summary(Cars93[ , c('Type','Make','Price','Cylinders','Horsepower')])

##       Type               Make        Price        Cylinders    Horsepower   
##  Compact:16   Acura Integra: 1   Min.   : 7.40   3     : 3   Min.   : 55.0  
##  Large  :11   Acura Legend : 1   1st Qu.:12.20   4     :49   1st Qu.:103.0  
##  Midsize:22   Audi 100     : 1   Median :17.70   5     : 2   Median :140.0  
##  Small  :21   Audi 90      : 1   Mean   :19.51   6     :31   Mean   :143.8  
##  Sporty :14   BMW 535i     : 1   3rd Qu.:23.30   8     : 7   3rd Qu.:170.0  
##  Van    : 9   Buick Century: 1   Max.   :61.90   rotary: 1   Max.   :300.0  
##               (Other)      :87

attach(Cars93)
median(Price)

## [1] 17.7

quantile(Horsepower)

##   0%  25%  50%  75% 100% 
##   55  103  140  170  300

var(Price)

## [1] 93.30458

sd(Price)

## [1] 9.65943

# note
sqrt( var(Price) ) == sd(Price)

## [1] TRUE

var(Price) == sd(Price)**2

## [1] TRUE

par(mfrow = c(1, 2))
boxplot(Price,main='Preços', col='yellow')
boxplot(Price ~ Type,data=Cars93,main="Preços por Tipo",col='lightBlue')

detach(Cars93)

3.1.10 Distâncias inter quartis e remoção de Outliers

iqr = IQR(Cars93$Price); iqr

## [1] 11.1

Q = quantile(Cars93$Price, probs=c(.25, .75)); Q

##  25%  75% 
## 12.2 23.3

up = Q[2]+1.5*iqr # Maior valor  
low = Q[1]-1.5*iqr # Menor valor

head(Cars93[Cars93$Price > up,])

##     Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 11      Cadillac Seville Midsize      37.5  40.1      42.7       16          25
## 48      Infiniti     Q45 Midsize      45.4  47.9      50.4       17          22
## 59 Mercedes-Benz    300E Midsize      43.8  61.9      80.0       19          25
##               AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 11 Driver & Passenger      Front         8        4.6        295 6000
## 48        Driver only       Rear         8        4.5        278 6000
## 59 Driver & Passenger       Rear         6        3.2        217 5500
##    Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 11         1985              No               20.0          5    204       111
## 48         1955              No               22.5          5    200       113
## 59         2220              No               18.5          5    187       110
##    Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
## 11    74          44             31           14   3935     USA
## 48    72          42             29           15   4000 non-USA
## 59    69          37             27           15   3525 non-USA
##                  Make
## 11   Cadillac Seville
## 48       Infiniti Q45
## 59 Mercedes-Benz 300E

nrow(Cars93[Cars93$Price > up,])

## [1] 3

head(Cars93[Cars93$Price < low,])

##  [1] Manufacturer       Model              Type               Min.Price         
##  [5] Price              Max.Price          MPG.city           MPG.highway       
##  [9] AirBags            DriveTrain         Cylinders          EngineSize        
## [13] Horsepower         RPM                Rev.per.mile       Man.trans.avail   
## [17] Fuel.tank.capacity Passengers         Length             Wheelbase         
## [21] Width              Turn.circle        Rear.seat.room     Luggage.room      
## [25] Weight             Origin             Make              
## <0 rows> (or 0-length row.names)

nrow(Cars93[Cars93$Price < low,])

## [1] 0

Identificamos então 3 valores apenas como outliers com valores acima do esperado. Veja as marcas (fabricantes) dos carros!!!

3.1.11 Covariância e Coeficiente de Determinação

Podem haver muitas relações entre duas variáveis. Um muito importante, entre valores numéricos é haver uma relação linear. Isso pode ser medido pela Covariância.

A covariância de duas variáveis \(x\) e \(y\) de um conjunto de dados mede como as duas estão linearmente relacionadas. Uma covariância positiva indicaria uma relação linear positiva entre as variáveis e uma covariância negativa indicaria o oposto.

A covariância da amostra é definida em termos dos meios da amostra como:

\[ s_{xy} = \frac{1}{n-1} \sum_{i}{n} (x_i - \bar{x})(y_i - \bar{y}) \] Da mesma forma, a covariância populacional é definida em termos da média populacional como:

\[ \sigma_{xy} = \frac{1}{n} \sum_{i}{n} (x_i - \mu_{x})(y_i - \mu_{y}) \] Note a semelhança com a variância de uma única variável.

O coeficiente de determinação permite analisar o quanto a relação é linear fornecendo um valor entre \([-1,1]\).

cov(Cars93$Price, Cars93$Horsepower)

## [1] 398.7647

cor(Cars93$Price, Cars93$Horsepower)

## [1] 0.7882176

Outras relações podem existir entre os dados e podemos examinar através de gráficos de pares de variáveis.

myCars = Cars93[ , c('Price',"MPG.city", 'Weight','Horsepower')]
pairs(myCars)

"MPG.city", 'Weight' parecem ter uma relação linear e, de fato, podemos calcular isso.

plot( Cars93[ , c("MPG.city", 'Weight')] )
cov(Cars93[ , c("MPG.city", 'Weight')])

##             MPG.city     Weight
## MPG.city    31.58228  -2795.095
## Weight   -2795.09467 347977.893

cor(Cars93[ , c("MPG.city", 'Weight')])

##            MPG.city     Weight
## MPG.city  1.0000000 -0.8431385
## Weight   -0.8431385  1.0000000

abline(lsfit(Cars93$MPG.city, Cars93$Weight),col='red')

3.2 Exercícios

3.2.1 Exercício Resolvido

Considere a base.

df = read.csv('http://meusite.mackenzie.br/rogerio/TIC/mystocksn.csv')
head(df)

##         data   IBOV VALE3 PETR4  DOLAR
## 1 2020-01-02 118573 13.45 16.27 4.0163
## 2 2020-01-03 117707 13.29 15.99 4.0234
## 3 2020-01-06 116878 13.14 16.22 4.0570
## 4 2020-01-07 116662 13.23 16.06 4.0604
## 5 2020-01-08 116247 13.22 15.70 4.0662
## 6 2020-01-09 115947 12.99 15.75 4.0628

Inspecione os dados. Quantos registros e quantidade de atributos, quais atributos, etc. Qual o valor mínimo e máximo do dólar neste período.

nrow(df)

## [1] 43

ncol(df)

## [1] 5

names(df)

## [1] "data"  "IBOV"  "VALE3" "PETR4" "DOLAR"

min(df$DOLAR)

## [1] 4.0163

max(df$DOLAR)

## [1] 4.6062

3.2.2 Exercício

Forneça as principais estatísticas dos dados (média, median, quartis, variância, desvio padrão etc.) para os valores dos índices da base.

summary(df)

##          data         IBOV            VALE3           PETR4      
##  2020-01-02: 1   Min.   : 86067   Min.   : 7.97   Min.   : 7.26  
##  2020-01-03: 1   1st Qu.:113766   1st Qu.:11.79   1st Qu.:14.21  
##  2020-01-06: 1   Median :115528   Median :12.05   Median :14.63  
##  2020-01-07: 1   Mean   :113383   Mean   :12.05   Mean   :14.24  
##  2020-01-08: 1   3rd Qu.:116689   3rd Qu.:13.19   3rd Qu.:14.91  
##  2020-01-09: 1   Max.   :119528   Max.   :13.63   Max.   :16.27  
##  (Other)   :37                                                   
##      DOLAR      
##  Min.   :4.016  
##  1st Qu.:4.164  
##  Median :4.242  
##  Mean   :4.265  
##  3rd Qu.:4.355  
##  Max.   :4.606  
##

var(df[,-c(1)])

##                IBOV        VALE3         PETR4         DOLAR
## IBOV  42519429.1351 7820.1173325 10475.7766396 -824.17369291
## VALE3     7820.1173    1.6568144     2.0101713   -0.18469124
## PETR4    10475.7766    2.0101713     2.8063904   -0.22750142
## DOLAR     -824.1737   -0.1846912    -0.2275014    0.02506714

3.2.3 Exercício Resolvido

Faça um gráfico para exibir as relações de todos os pares de índice financeiros.

pairs(df[, -c(1)])

3.2.4 Exercício

Quais índices possuem um relação mais linear com o Dólar no período? (é preferível empregar o cor()).

cor(df$DOLAR,df[,-c(1)])

##            IBOV      VALE3      PETR4 DOLAR
## [1,] -0.7983119 -0.9062686 -0.8577439     1

cov(df$DOLAR,df[,-c(1)])

##           IBOV      VALE3      PETR4      DOLAR
## [1,] -824.1737 -0.1846912 -0.2275014 0.02506714

3.2.5 Exercício Resolvido

Qual média de potência (Horsepower) dos veículos de Cars93 por origem?

for (t in unique(Cars93$Type)){
  cat(t , '\n',  str( mean(Cars93[Cars93$Type == t, ]$Price) ) ) 
}

##  num 10.2
## Small 
##  num 27.2
## Midsize 
##  num 18.2
## Compact 
##  num 24.3
## Large 
##  num 19.4
## Sporty 
##  num 19.1
## Van

3.2.6 Exercício

Considere a base.

df = read.csv('https://meusite.mackenzie.br/rogerio/TIC/Life_Expectancy_Data.csv')
df = na.omit(df)
head(df)

##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2

Qual a média de BMI e Expectativa de Vida para os países em desenvolvimento e desenvolvidos?

for (s in unique(df$Status)){
  cat(s , '\n',  str( mean(df[df$Status == s, ]$Life.expectancy) ) , '\n',  str( mean(df[df$Status == s, ]$BMI) ) ) 
}

##  num 67.7
##  num 35.7
## Developing 
##  
##  num 78.7
##  num 52.3
## Developed 
##

3.2.7 Exercício

Existe correlação entre BMI e Expectativa de Vida para os desenvolvidos?

cor(df[df$Status == 'Developed', ]$Life.expectancy,df[df$Status == 'Developed', , ]$BMI)

## [1] 0.01079434

3.2.8 Exercício

Existem outliers de BMI e Expectativa de Vida no conjunto de todos os países?

par(mfrow = c(1, 2))
boxplot(df$Life.expectancy)
boxplot(df$BMI)

print('Life.expectancy')

## [1] "Life.expectancy"

iqr = IQR(df$Life.expectancy); iqr

## [1] 10.6

Q = quantile(df$Life.expectancy, probs=c(.25, .75)); Q

##  25%  75% 
## 64.4 75.0

up = Q[2]+1.5*iqr # Maior valor  
low = Q[1]-1.5*iqr # Menor valor

df[df$Life.expectancy > up,]

##  [1] Country                         Year                           
##  [3] Status                          Life.expectancy                
##  [5] Adult.Mortality                 infant.deaths                  
##  [7] Alcohol                         percentage.expenditure         
##  [9] Hepatitis.B                     Measles                        
## [11] BMI                             under.five.deaths              
## [13] Polio                           Total.expenditure              
## [15] Diphtheria                      HIV.AIDS                       
## [17] GDP                             Population                     
## [19] thinness..1.19.years            thinness.5.9.years             
## [21] Income.composition.of.resources Schooling                      
## <0 rows> (or 0-length row.names)

nrow(df[df$Life.expectancy > up,])

## [1] 0

df[df$Life.expectancy < low,]

##           Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 57         Angola 2007 Developing            48.2             375            87
## 348      Botswana 2004 Developing            48.1             652             2
## 349      Botswana 2003 Developing            46.4             693             2
## 350      Botswana 2002 Developing            46.0             699             2
## 351      Botswana 2001 Developing            46.7             679             2
## 352      Botswana 2000 Developing            47.8             647             2
## 1482      Lesotho 2008 Developing            47.8             592             5
## 1483      Lesotho 2007 Developing            46.2             633             4
## 1484      Lesotho 2006 Developing            45.3             654             5
## 1485      Lesotho 2005 Developing            44.5             675             5
## 1486      Lesotho 2004 Developing            44.8             666             5
## 1487      Lesotho 2003 Developing            45.5             648             5
## 1579       Malawi 2007 Developing            48.5             559            37
## 1580       Malawi 2006 Developing            47.1             587            38
## 1581       Malawi 2005 Developing            46.0              66            39
## 1582       Malawi 2004 Developing            45.1             615            40
## 1583       Malawi 2003 Developing            44.6             613            43
## 1584       Malawi 2002 Developing            44.0              67            46
## 2299 Sierra Leone 2014 Developing            48.1             463            23
## 2303 Sierra Leone 2010 Developing            48.1             424            27
## 2304 Sierra Leone 2009 Developing            47.1             433            28
## 2305 Sierra Leone 2008 Developing            46.2             441            29
## 2306 Sierra Leone 2007 Developing            45.3              45            29
## 2499    Swaziland 2006 Developing            47.8             564             3
## 2500    Swaziland 2005 Developing            46.0              63             3
## 2501    Swaziland 2004 Developing            45.6              69             3
## 2502    Swaziland 2003 Developing            45.9               6             3
## 2503    Swaziland 2002 Developing            46.4             587             3
## 2504    Swaziland 2001 Developing            47.1             568             3
## 2505    Swaziland 2000 Developing            48.4             536             3
## 2930     Zimbabwe 2008 Developing            48.2             632            30
## 2931     Zimbabwe 2007 Developing            46.6              67            29
## 2932     Zimbabwe 2006 Developing            45.4               7            28
## 2933     Zimbabwe 2005 Developing            44.6             717            28
## 2934     Zimbabwe 2004 Developing            44.3             723            27
## 2935     Zimbabwe 2003 Developing            44.5             715            26
## 2936     Zimbabwe 2002 Developing            44.8              73            25
## 2937     Zimbabwe 2001 Developing            45.3             686            25
## 2938     Zimbabwe 2000 Developing            46.0             665            24
##      Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 57      6.35             184.821345          73    1014 18.8               138
## 348     4.90             469.582390          91       1 32.2                 4
## 349     5.51             299.367125           9      59 31.6                 4
## 350     6.41               6.330007          88       7 31.1                 4
## 351     5.48             306.952735          87       1  3.5                 4
## 352     5.37             250.891648          86    2672 29.9                 4
## 1482    2.75              91.854328          88       0 28.8                 6
## 1483    2.69               9.184327           9       2 28.3                 6
## 1484    2.61              71.155776          91       1 27.9                 6
## 1485    2.67              57.903698          87       0 27.4                 6
## 1486    1.80              67.913618           6      31 26.9                 7
## 1487    1.99               5.300902          17       1 26.4                 7
## 1579    1.18               4.269511          87     143 16.6                59
## 1580    1.18               6.847034          99       1 16.2                61
## 1581    1.04               5.670640          93     184 15.9                62
## 1582    1.11              58.135833          89    1116 15.5                65
## 1583    1.08               4.375316          84     167 15.2                70
## 1584    1.10               3.885395          64      92 14.8                75
## 2299    0.01               1.443286          83    1006 23.8                32
## 2303    3.84               5.347718          86    1089 21.7                40
## 2304    3.97              49.837127          84      31 21.2                42
## 2305    3.91               5.379606          77      44  2.7                44
## 2306    3.86              45.571089          63       0  2.2                45
## 2499    5.53             437.080244          93       0 28.2                 4
## 2500    5.08             372.165147          95       0 27.8                 4
## 2501    5.78              37.438577          93       0 27.4                 4
## 2502    5.65               2.819124           9     350 27.1                 4
## 2503    5.52             131.042127          88      37 26.7                 4
## 2504    6.72             143.619732          86      49 26.3                 4
## 2505    7.19              25.216833          83      10 25.9                 4
## 2930    3.56              20.843429          75       0 28.6                46
## 2931    3.88              29.814566          72     242 28.2                46
## 2932    4.57              34.262169          68     212 27.9                45
## 2933    4.14               8.717409          65     420 27.5                43
## 2934    4.36               0.000000          68      31 27.1                42
## 2935    4.06               0.000000           7     998 26.7                41
## 2936    4.43               0.000000          73     304 26.3                40
## 2937    1.72               0.000000          76     529 25.9                39
## 2938    1.68               0.000000          79    1483 25.5                39
##      Polio Total.expenditure Diphtheria HIV.AIDS        GDP Population
## 57      75              3.38         73      2.6 2878.83714    2997687
## 348     96              5.56         96     28.4 4896.58384     182933
## 349     96              4.65         96     31.9 4163.65960     184339
## 350     97              6.47         97     34.6  355.61838    1779953
## 351     97              5.73         97     37.2 3128.97793    1754935
## 352     97              4.64         97     38.8 3349.68823     172834
## 1482    86              8.85         88     27.3  934.42856     199993
## 1483    87              8.47         88     30.0  918.43272    1982287
## 1484    88              7.12         89     34.1  915.77575    1965662
## 1485    88              6.30         89     34.8  862.94631    1949543
## 1486    89              6.96          9     34.6  781.51459    1933728
## 1487     9              7.13          9     33.8   63.63628     191897
## 1579    88              9.31         87     19.3   32.22273    1384969
## 1580    99              8.99         99     21.1  297.69712   13429262
## 1581    94              8.20         93     22.4   28.36738    1339711
## 1582    94              7.82         89     23.4  274.22563    1267638
## 1583    85              6.35         84     24.2   26.15252   12336687
## 1584    79              4.82         64     24.7   29.97990    1213711
## 2299    83             11.90         83      0.6   78.43948     779162
## 2303    84              1.32         86      1.6   45.12842     645872
## 2304    81             13.13         84      1.7  394.59324      63126
## 2305    75              1.29         77      1.9   46.37592    6165372
## 2306    63              1.12         64      2.2  358.82747     615417
## 2499    88              6.81         87     43.7 2937.36723     112514
## 2500    88              6.80         86     49.1 2873.86214     115873
## 2501    88              5.88         86     50.3 2529.63356      19553
## 2502    87              5.71         85     50.6   22.99449     187392
## 2503    87              5.16         85     49.9 1324.99623       1893
## 2504    87              5.11         84     48.8 1437.63495     172927
## 2505    87              5.26         84     46.4 1637.45670     161468
## 2930    75              4.96         75     20.5  325.67857   13558469
## 2931    73              4.47         73     23.7  396.99822    1332999
## 2932    71              5.12          7     26.8  414.79623   13124267
## 2933    69              6.44         68     30.3  444.76575     129432
## 2934    67              7.13         65     33.6  454.36665   12777511
## 2935     7              6.52         68     36.7  453.35116   12633897
## 2936    73              6.53         71     39.8   57.34834     125525
## 2937    76              6.16         75     42.1  548.58731   12366165
## 2938    78              7.10         78     43.5  547.35888   12222251
##      thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 57                    9.6                9.6                           0.454
## 348                   1.5                1.4                           0.580
## 349                   1.9                1.8                           0.567
## 350                  11.4               11.3                           0.558
## 351                  11.8               11.8                           0.560
## 352                  12.3               12.2                           0.559
## 1482                  8.0                7.8                           0.447
## 1483                  8.4                8.3                           0.440
## 1484                  8.8                8.7                           0.437
## 1485                  9.3                9.2                           0.437
## 1486                  9.7                9.7                           0.439
## 1487                  1.2                1.1                           0.440
## 1579                  7.1                7.0                           0.387
## 1580                  7.3                7.1                           0.377
## 1581                  7.4                7.2                           0.371
## 1582                  7.5                7.4                           0.366
## 1583                  7.6                7.5                           0.362
## 1584                  7.7                7.6                           0.388
## 2299                  7.5                7.4                           0.426
## 2303                  8.3                8.2                           0.384
## 2304                  8.5                8.4                           0.375
## 2305                  8.7                8.7                           0.367
## 2306                  8.9                8.9                           0.357
## 2499                  6.9                7.1                           0.502
## 2500                  7.3                7.5                           0.495
## 2501                  7.7                7.9                           0.492
## 2502                  8.2                8.4                           0.493
## 2503                  8.6                8.8                           0.502
## 2504                  9.0                9.2                           0.506
## 2505                  9.4                9.6                           0.516
## 2930                  7.8                7.8                           0.421
## 2931                  8.2                8.2                           0.414
## 2932                  8.6                8.6                           0.408
## 2933                  9.0                9.0                           0.406
## 2934                  9.4                9.4                           0.407
## 2935                  9.8                9.9                           0.418
## 2936                  1.2                1.3                           0.427
## 2937                  1.6                1.7                           0.427
## 2938                 11.0               11.2                           0.434
##      Schooling
## 57         7.7
## 348       11.8
## 349       11.8
## 350       11.9
## 351       11.8
## 352       11.7
## 1482      10.7
## 1483      10.6
## 1484      10.7
## 1485      10.7
## 1486      10.7
## 1487      10.5
## 1579       9.7
## 1580       9.6
## 1581       9.7
## 1582      10.0
## 1583      10.3
## 1584      10.4
## 2299       9.5
## 2303       8.7
## 2304       8.5
## 2305       8.3
## 2306       8.2
## 2499       9.9
## 2500       9.7
## 2501       9.4
## 2502       9.1
## 2503       9.2
## 2504       9.3
## 2505       9.4
## 2930       9.7
## 2931       9.6
## 2932       9.5
## 2933       9.3
## 2934       9.2
## 2935       9.5
## 2936      10.0
## 2937       9.8
## 2938       9.8

nrow(df[df$Life.expectancy < low,])

## [1] 39

print('BMI')

## [1] "BMI"

iqr = IQR(df$BMI); iqr

## [1] 36.3

Q = quantile(df$BMI, probs=c(.25, .75)); Q

##  25%  75% 
## 19.5 55.8

up = Q[2]+1.5*iqr # Maior valor  
low = Q[1]-1.5*iqr # Menor valor

df[df$BMI > up,]

##  [1] Country                         Year                           
##  [3] Status                          Life.expectancy                
##  [5] Adult.Mortality                 infant.deaths                  
##  [7] Alcohol                         percentage.expenditure         
##  [9] Hepatitis.B                     Measles                        
## [11] BMI                             under.five.deaths              
## [13] Polio                           Total.expenditure              
## [15] Diphtheria                      HIV.AIDS                       
## [17] GDP                             Population                     
## [19] thinness..1.19.years            thinness.5.9.years             
## [21] Income.composition.of.resources Schooling                      
## <0 rows> (or 0-length row.names)

nrow(df[df$BMI > up,])

## [1] 0

df[df$BMI < low,]

##  [1] Country                         Year                           
##  [3] Status                          Life.expectancy                
##  [5] Adult.Mortality                 infant.deaths                  
##  [7] Alcohol                         percentage.expenditure         
##  [9] Hepatitis.B                     Measles                        
## [11] BMI                             under.five.deaths              
## [13] Polio                           Total.expenditure              
## [15] Diphtheria                      HIV.AIDS                       
## [17] GDP                             Population                     
## [19] thinness..1.19.years            thinness.5.9.years             
## [21] Income.composition.of.resources Schooling                      
## <0 rows> (or 0-length row.names)

nrow(df[df$BMI < low,])

## [1] 0

3.2.9 Exercício

Qual a média de Expectativa de Vida com e sem outliers ?

print('Life.expectancy')

## [1] "Life.expectancy"

iqr = IQR(df$Life.expectancy); iqr

## [1] 10.6

Q = quantile(df$Life.expectancy, probs=c(.25, .75)); Q

##  25%  75% 
## 64.4 75.0

up = Q[2]+1.5*iqr # Maior valor  
low = Q[1]-1.5*iqr # Menor valor

head( df[df$Life.expectancy > up,] )

##  [1] Country                         Year                           
##  [3] Status                          Life.expectancy                
##  [5] Adult.Mortality                 infant.deaths                  
##  [7] Alcohol                         percentage.expenditure         
##  [9] Hepatitis.B                     Measles                        
## [11] BMI                             under.five.deaths              
## [13] Polio                           Total.expenditure              
## [15] Diphtheria                      HIV.AIDS                       
## [17] GDP                             Population                     
## [19] thinness..1.19.years            thinness.5.9.years             
## [21] Income.composition.of.resources Schooling                      
## <0 rows> (or 0-length row.names)

nrow(df[df$Life.expectancy > up,])

## [1] 0

head( df[df$Life.expectancy < low,] )

##      Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 57    Angola 2007 Developing            48.2             375            87
## 348 Botswana 2004 Developing            48.1             652             2
## 349 Botswana 2003 Developing            46.4             693             2
## 350 Botswana 2002 Developing            46.0             699             2
## 351 Botswana 2001 Developing            46.7             679             2
## 352 Botswana 2000 Developing            47.8             647             2
##     Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 57     6.35             184.821345          73    1014 18.8               138
## 348    4.90             469.582390          91       1 32.2                 4
## 349    5.51             299.367125           9      59 31.6                 4
## 350    6.41               6.330007          88       7 31.1                 4
## 351    5.48             306.952735          87       1  3.5                 4
## 352    5.37             250.891648          86    2672 29.9                 4
##     Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 57     75              3.38         73      2.6 2878.8371    2997687
## 348    96              5.56         96     28.4 4896.5838     182933
## 349    96              4.65         96     31.9 4163.6596     184339
## 350    97              6.47         97     34.6  355.6184    1779953
## 351    97              5.73         97     37.2 3128.9779    1754935
## 352    97              4.64         97     38.8 3349.6882     172834
##     thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 57                   9.6                9.6                           0.454
## 348                  1.5                1.4                           0.580
## 349                  1.9                1.8                           0.567
## 350                 11.4               11.3                           0.558
## 351                 11.8               11.8                           0.560
## 352                 12.3               12.2                           0.559
##     Schooling
## 57        7.7
## 348      11.8
## 349      11.8
## 350      11.9
## 351      11.8
## 352      11.7

nrow(df[df$Life.expectancy < low,])

## [1] 39

df_noout = df[-(df$Life.expectancy < low),]

mean(df$Life.expectancy)

## [1] 69.3023

mean(df_noout$Life.expectancy)

## [1] 69.30492

3.2.10 Exercício

Considere a base.

library(MASS)
help(painters)
painters = na.omit(painters)
head(painters)

##               Composition Drawing Colour Expression School
## Da Udine               10       8     16          3      A
## Da Vinci               15      16      4         14      A
## Del Piombo              8      13     16          7      A
## Del Sarto              12      16      9          8      A
## Fr. Penni               0      15      8          0      A
## Guilio Romano          15      16      4         14      A

Quantos tipos de escolas de pintores existem? (unique ou table)

unique(painters$School)

## [1] A B C D E F G H
## Levels: A B C D E F G H

table(painters$School)

## 
##  A  B  C  D  E  F  G  H 
## 10  6  6 10  7  4  7  4

3.2.11 Exercício

A moda em estatística é valor mais frequente dos dados. Qual a moda das escolas de pintores?

table(painters$School)

## 
##  A  B  C  D  E  F  G  H 
## 10  6  6 10  7  4  7  4

Podemos ver que é 'D'.

3.2.12 Exercício Resolvido

Quantos pintores estão acima da média em composição?

painters[painters$Composition >= mean(painters$Composition), ]

##                 Composition Drawing Colour Expression School
## Da Vinci                 15      16      4         14      A
## Del Sarto                12      16      9          8      A
## Guilio Romano            15      16      4         14      A
## Perino del Vaga          15      16      7          6      A
## Raphael                  17      18     12         18      A
## Fr. Salviata             13      15      8          8      B
## Primaticcio              15      14      7         10      B
## T. Zucarro               13      14     10          9      B
## Volterra                 12      15      5          8      B
## Barocci                  14      15      6         10      C
## Cortona                  16      14     12          6      C
## L. Jordaens              13      12      9          6      C
## Vanius                   15      15     12         13      C
## Palma Giovane            12       9     14          6      D
## Tintoretto               15      14     16          4      D
## Titian                   12      15     18          6      D
## Veronese                 15      10     16          3      D
## Albani                   14      14     10          6      E
## Corregio                 13      13     15         12      E
## Domenichino              15      17      9         17      E
## Guercino                 18      10     10          4      E
## Lanfranco                14      13     10          5      E
## The Carraci              15      17     13         13      E
## Otho Venius              13      14     10         10      G
## Rembrandt                15       6     17         12      G
## Rubens                   18      13     17         17      G
## Teniers                  15      12     13          6      G
## Van Dyck                 15      10     17         13      G
## Le Brun                  16      16      8         16      H
## Le Suer                  15      15      4         15      H
## Poussin                  15      17      6         15      H

nrow(painters[painters$Composition >= mean(painters$Composition), ])

## [1] 31

3.2.13 Exercício

Qual o pintor ou pintores com maior pontuação considerando todos os critérios? Não há muita surpresa aqui não?

painters['Score'] = painters[,c(1)] + painters[,c(2)] + painters[,c(3)] + painters[,c(4)] 
head(painters)

##               Composition Drawing Colour Expression School Score
## Da Udine               10       8     16          3      A    37
## Da Vinci               15      16      4         14      A    49
## Del Piombo              8      13     16          7      A    44
## Del Sarto              12      16      9          8      A    45
## Fr. Penni               0      15      8          0      A    23
## Guilio Romano          15      16      4         14      A    49

painters[painters$Score == max(painters$Score), ]

##         Composition Drawing Colour Expression School Score
## Raphael          17      18     12         18      A    65
## Rubens           18      13     17         17      G    65

3.2.13.0.0.1 Exercício Resolvido

Mas esse nem é um ### Exercício (rs). Entenda a mediana e média através das notas de Composição e Colour dos pintores.

par(mfrow = c(1, 2))
x = painters$Colour
h = hist(x, col="lightblue", xlab="Colour scores", main="Média e Mediana", xlim=c(0,30))
xfit = seq(min(x)-3*sd(x),max(x)+4*sd(x),length=100)
yfit = dnorm(xfit,mean=mean(x),sd=sd(x))
yfit = yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="black", lwd=1)
abline(v=mean(painters$Colour),col='red',lty = 2, lwd = 2)
abline(v=median(painters$Colour),col='darkblue',lty = 2, lwd = 2)

x = painters$Composition
h = hist(x, col="lightblue", xlab="Composition Scores", main="Média e Mediana", xlim=c(0,30))
xfit = seq(min(x)-3*sd(x),max(x)+4*sd(x),length=100)
yfit = dnorm(xfit,mean=mean(x),sd=sd(x))
yfit = yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="black", lwd=1)
abline(v=mean(painters$Colour),col='red',lty = 2, lwd = 2)
abline(v=median(painters$Colour),col='darkblue',lty = 2, lwd = 2)