R 笔记：Logistic 回归

Logistic 回归是通过一系列连续或类别型的预测变量来预测二值型结果变量的方法。

如因变量为是否患胃癌，有是或否两种结果。选择两组人，一组为胃癌患者，另一组为正常。采集相关的数据作为自变量，如年龄、性别、饮食等等，可以有多种。自变量可以是连续的，也可以是因子类型的，通过Logistic回归，就可以大致了解哪些因素是胃癌的危险因素。

分析数据集

以 R 包 AER 中的 Affairs 数据集作为案例学习 Logistic 回归，分析可能与出轨有关的因素。

Affairs 数据集包含了 601 个受访者的相关数据，包括出轨次数、性别、年龄、婚龄、是否有孩子、宗教、教育、职业及婚姻评分等 9 个变量。

安装并载入 Affairs 数据集

install.packages('AER')
data(Affairs, package = 'AER')
head(Affairs[,1:5] )
#    affairs gender age yearsmarried children
# 4        0   male  37           10       no
# 5        0 female  27            4       no
# 11       0 female  32           15      yes

描述性统计一下

> summary(Affairs)
#         affairs       gender             age     yearsmarried  children   religiousness       education      occupation          rating     
#  Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171   Min.   :1.000   Min.   : 9.00   Min.   :1.000   Min.   :1.000  
#  1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
#  Median : 0.000                Median :32.00   Median : 7.000             Median :3.000   Median :16.00   Median :5.000   Median :4.000  
#  Mean   : 1.456                Mean   :32.49   Mean   : 8.178             Mean   :3.116   Mean   :16.17   Mean   :4.195   Mean   :3.932  
#  3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000             3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
#  Max.   :12.000                Max.   :57.00   Max.   :15.000             Max.   :5.000   Max.   :20.00   Max.   :7.000   Max.   :5.000 

table(Affairs$affairs)

#   0   1   2   3   7  12 
# 451  34  17  19  42  38 

prop.table(table(Affairs$affairs))*100

#         0         1         2         3         7        12 
# 75.041597  5.657238  2.828619  3.161398  6.988353  6.322795 

prop.table(table(Affairs$gender))*100

#   female     male 
# 52.41265 47.58735

只关注是否出轨，新建 isaffair 变量，并转换为因子

Affairs$isaffair[Affairs$affairs > 0] <- 1
Affairs$isaffair[Affairs$affairs == 0] <- 0
table(Affairs$isaffair)

#   0   1 
# 451 150

`glm()` 回归

attach(Affairs)
fit <- glm(isaffair~gender+age+yearsmarried+
             children+religiousness+education+
             occupation+rating, data = Affairs, family = binomial())
summary(fit)

# Call:
# glm(formula = isaffair ~ gender + age + yearsmarried + children + 
#     religiousness + education + occupation + rating, family = binomial(), 
#     data = Affairs)

# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.5713  -0.7499  -0.5690  -0.2539   2.5191  

# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    1.37726    0.88776   1.551 0.120807    
# gendermale     0.28029    0.23909   1.172 0.241083    
# age           -0.04426    0.01825  -2.425 0.015301 *  
# yearsmarried   0.09477    0.03221   2.942 0.003262 ** 
# childrenyes    0.39767    0.29151   1.364 0.172508    
# religiousness -0.32472    0.08975  -3.618 0.000297 ***
# education      0.02105    0.05051   0.417 0.676851    
# occupation     0.03092    0.07178   0.431 0.666630    
# rating        -0.46845    0.09091  -5.153 2.56e-07 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# (Dispersion parameter for binomial family taken to be 1)

#     Null deviance: 675.38  on 600  degrees of freedom
# Residual deviance: 609.51  on 592  degrees of freedom
# AIC: 627.51

# Number of Fisher Scoring iterations: 4

有四个变量显著，其余如性别、职业、教育等不显著，与实际情况类似。
将不显著的变量去除，再次回归

fit_new <- glm(isaffair~age+yearsmarried+
                 religiousness+rating, data = Affairs, family = binomial())
summary(fit_new)

# Call:
# glm(formula = isaffair ~ age + yearsmarried + religiousness + 
#     rating, family = binomial(), data = Affairs)

# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.6278  -0.7550  -0.5701  -0.2624   2.3998  

# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
# (Intercept)    1.93083    0.61032   3.164 0.001558 ** 
# age           -0.03527    0.01736  -2.032 0.042127 *  
# yearsmarried   0.10062    0.02921   3.445 0.000571 ***
# religiousness -0.32902    0.08945  -3.678 0.000235 ***
# rating        -0.46136    0.08884  -5.193 2.06e-07 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# (Dispersion parameter for binomial family taken to be 1)

#     Null deviance: 675.38  on 600  degrees of freedom
# Residual deviance: 615.36  on 596  degrees of freedom
# AIC: 625.36

# Number of Fisher Scoring iterations: 4

可以用 anova() 分析比较一下两个模型
对于广义线性回归，可使用卡方检验

anova(fit, fit_new, test = 'Chisq')
# Analysis of Deviance Table

# Model 1: isaffair ~ gender + age + yearsmarried + children + religiousness + 
#     education + occupation + rating
# Model 2: isaffair ~ age + yearsmarried + religiousness + rating
#   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
# 1       592     609.51                     
# 2       596     615.36 -4  -5.8474   0.2108

卡方检验值不显著，说明两个模型差别不大，去除的变量不会显著提高方程的预测精度。

结果解释

回归系数

1
2
3

coef(fit_new)
  # (Intercept)           age  yearsmarried religiousness        rating 
  #  1.93083017   -0.03527112    0.10062274   -0.32902386   -0.46136144

Logistic 回归中，响应变量是 y=e 的对数优势比，回归系数的含义是，当其他的预测变量不变时，一单位预测变量变化时，可能引起的响应变量对数优势比的变化。
由于使用的是对数优势比，需进行指数运算才能回到正常模式。

1
2
3

exp(coef(fit_new))
  # (Intercept)           age  yearsmarried religiousness        rating 
  #   6.8952321     0.9653437     1.1058594     0.7196258     0.6304248

结果表示，当婚龄增加一年时，出轨的优势比将×1.1058594；相反年龄增加一岁时，出轨的优势比将×0.9653437，即婚龄越大，越容易出轨，年龄越小越容易出轨。预测变量不能为0，所以截距项没有意义。

预测数据

创建新的数据集用来预测，这里选平均值

# 评分变化
test <- data.frame(rating=1:5, age=mean(Affairs$age),
                   yearsmarried=mean(Affairs$yearsmarried),
                   religiousness=mean(Affairs$religiousness))
test
#   rating      age yearsmarried religiousness
# 1      1 32.48752     8.177696      3.116473
# 2      2 32.48752     8.177696      3.116473
# 3      3 32.48752     8.177696      3.116473
# 4      4 32.48752     8.177696      3.116473
# 5      5 32.48752     8.177696      3.116473

test$prob <- predict(fit_new, newdata = test, type='response')
test
#   rating      age yearsmarried religiousness      prob
# 1      1 32.48752     8.177696      3.116473 0.5302296
# 2      2 32.48752     8.177696      3.116473 0.4157377
# 3      3 32.48752     8.177696      3.116473 0.3096712
# 4      4 32.48752     8.177696      3.116473 0.2204547
# 5      5 32.48752     8.177696      3.116473 0.1513079

# 年龄变化
test <- data.frame(rating=mean(Affairs$rating), age=seq(20, 60, 10),
                   yearsmarried=mean(Affairs$yearsmarried),
                   religiousness=mean(Affairs$religiousness))
test
#    rating age yearsmarried religiousness
# 1 3.93178  20     8.177696      3.116473
# 2 3.93178  30     8.177696      3.116473
# 3 3.93178  40     8.177696      3.116473
# 4 3.93178  50     8.177696      3.116473
# 5 3.93178  60     8.177696      3.116473
test$prob <- predict(fit_new, newdata = test, type='response')
test
#    rating age yearsmarried religiousness       prob
# 1 3.93178  20     8.177696      3.116473 0.31193344
# 2 3.93178  30     8.177696      3.116473 0.24162210
# 3 3.93178  40     8.177696      3.116473 0.18294542
# 4 3.93178  50     8.177696      3.116473 0.13596342
# 5 3.93178  60     8.177696      3.116473 0.09957638

可见评分约低，出轨的可能性越大；年龄约小，出轨的可能性越大。