Logistic 回归是通过一系列连续或类别型的预测变量来预测二值型结果变量的方法。

如因变量为是否患胃癌,有是或否两种结果。选择两组人,一组为胃癌患者,另一组为正常。采集相关的数据作为自变量,如年龄、性别、饮食等等,可以有多种。自变量可以是连续的,也可以是因子类型的,通过Logistic回归,就可以大致了解哪些因素是胃癌的危险因素。

分析数据集

以 R 包 AER 中的 Affairs 数据集作为案例学习 Logistic 回归,分析可能与出轨有关的因素。

Affairs 数据集包含了 601 个受访者的相关数据,包括出轨次数、性别、年龄、婚龄、是否有孩子、宗教、教育、职业及婚姻评分等 9 个变量。

安装并载入 Affairs 数据集

1
2
3
4
5
6
7
install.packages('AER')
data(Affairs, package = 'AER')
head(Affairs[,1:5] )
# affairs gender age yearsmarried children
# 4 0 male 37 10 no
# 5 0 female 27 4 no
# 11 0 female 32 15 yes

描述性统计一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> summary(Affairs)
# affairs gender age yearsmarried children religiousness education occupation rating
# Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000 Min. : 9.00 Min. :1.000 Min. :1.000
# 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
# Median : 0.000 Median :32.00 Median : 7.000 Median :3.000 Median :16.00 Median :5.000 Median :4.000
# Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932
# 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
# Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000 Max. :20.00 Max. :7.000 Max. :5.000

table(Affairs$affairs)

# 0 1 2 3 7 12
# 451 34 17 19 42 38

prop.table(table(Affairs$affairs))*100

# 0 1 2 3 7 12
# 75.041597 5.657238 2.828619 3.161398 6.988353 6.322795

prop.table(table(Affairs$gender))*100

# female male
# 52.41265 47.58735

只关注是否出轨,新建 isaffair 变量,并转换为因子

1
2
3
4
5
6
Affairs$isaffair[Affairs$affairs > 0] <- 1
Affairs$isaffair[Affairs$affairs == 0] <- 0
table(Affairs$isaffair)

# 0 1
# 451 150

glm() 回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
attach(Affairs)
fit <- glm(isaffair~gender+age+yearsmarried+
children+religiousness+education+
occupation+rating, data = Affairs, family = binomial())
summary(fit)

# Call:
# glm(formula = isaffair ~ gender + age + yearsmarried + children +
# religiousness + education + occupation + rating, family = binomial(),
# data = Affairs)

# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.5713 -0.7499 -0.5690 -0.2539 2.5191

# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.37726 0.88776 1.551 0.120807
# gendermale 0.28029 0.23909 1.172 0.241083
# age -0.04426 0.01825 -2.425 0.015301 *
# yearsmarried 0.09477 0.03221 2.942 0.003262 **
# childrenyes 0.39767 0.29151 1.364 0.172508
# religiousness -0.32472 0.08975 -3.618 0.000297 ***
# education 0.02105 0.05051 0.417 0.676851
# occupation 0.03092 0.07178 0.431 0.666630
# rating -0.46845 0.09091 -5.153 2.56e-07 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# (Dispersion parameter for binomial family taken to be 1)

# Null deviance: 675.38 on 600 degrees of freedom
# Residual deviance: 609.51 on 592 degrees of freedom
# AIC: 627.51

# Number of Fisher Scoring iterations: 4

有四个变量显著,其余如性别、职业、教育等不显著,与实际情况类似。
将不显著的变量去除,再次回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
fit_new <- glm(isaffair~age+yearsmarried+
religiousness+rating, data = Affairs, family = binomial())
summary(fit_new)

# Call:
# glm(formula = isaffair ~ age + yearsmarried + religiousness +
# rating, family = binomial(), data = Affairs)

# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.6278 -0.7550 -0.5701 -0.2624 2.3998

# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 1.93083 0.61032 3.164 0.001558 **
# age -0.03527 0.01736 -2.032 0.042127 *
# yearsmarried 0.10062 0.02921 3.445 0.000571 ***
# religiousness -0.32902 0.08945 -3.678 0.000235 ***
# rating -0.46136 0.08884 -5.193 2.06e-07 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# (Dispersion parameter for binomial family taken to be 1)

# Null deviance: 675.38 on 600 degrees of freedom
# Residual deviance: 615.36 on 596 degrees of freedom
# AIC: 625.36

# Number of Fisher Scoring iterations: 4

可以用 anova() 分析比较一下两个模型
对于广义线性回归,可使用卡方检验

1
2
3
4
5
6
7
8
9
anova(fit, fit_new, test = 'Chisq')
# Analysis of Deviance Table

# Model 1: isaffair ~ gender + age + yearsmarried + children + religiousness +
# education + occupation + rating
# Model 2: isaffair ~ age + yearsmarried + religiousness + rating
# Resid. Df Resid. Dev Df Deviance Pr(>Chi)
# 1 592 609.51
# 2 596 615.36 -4 -5.8474 0.2108

卡方检验值不显著,说明两个模型差别不大,去除的变量不会显著提高方程的预测精度。

结果解释

回归系数

1
2
3
coef(fit_new)
# (Intercept) age yearsmarried religiousness rating
# 1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144

Logistic 回归中,响应变量是 y=e 的对数优势比,回归系数的含义是,当其他的预测变量不变时,一单位预测变量变化时,可能引起的响应变量对数优势比的变化。
由于使用的是对数优势比,需进行指数运算才能回到正常模式。

1
2
3
exp(coef(fit_new))
# (Intercept) age yearsmarried religiousness rating
# 6.8952321 0.9653437 1.1058594 0.7196258 0.6304248

结果表示,当婚龄增加一年时,出轨的优势比将×1.1058594;相反年龄增加一岁时,出轨的优势比将×0.9653437,即婚龄越大,越容易出轨,年龄越小越容易出轨。预测变量不能为0,所以截距项没有意义。

预测数据

创建新的数据集用来预测,这里选平均值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 评分变化
test <- data.frame(rating=1:5, age=mean(Affairs$age),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness))
test
# rating age yearsmarried religiousness
# 1 1 32.48752 8.177696 3.116473
# 2 2 32.48752 8.177696 3.116473
# 3 3 32.48752 8.177696 3.116473
# 4 4 32.48752 8.177696 3.116473
# 5 5 32.48752 8.177696 3.116473

test$prob <- predict(fit_new, newdata = test, type='response')
test
# rating age yearsmarried religiousness prob
# 1 1 32.48752 8.177696 3.116473 0.5302296
# 2 2 32.48752 8.177696 3.116473 0.4157377
# 3 3 32.48752 8.177696 3.116473 0.3096712
# 4 4 32.48752 8.177696 3.116473 0.2204547
# 5 5 32.48752 8.177696 3.116473 0.1513079

# 年龄变化
test <- data.frame(rating=mean(Affairs$rating), age=seq(20, 60, 10),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness))
test
# rating age yearsmarried religiousness
# 1 3.93178 20 8.177696 3.116473
# 2 3.93178 30 8.177696 3.116473
# 3 3.93178 40 8.177696 3.116473
# 4 3.93178 50 8.177696 3.116473
# 5 3.93178 60 8.177696 3.116473
test$prob <- predict(fit_new, newdata = test, type='response')
test
# rating age yearsmarried religiousness prob
# 1 3.93178 20 8.177696 3.116473 0.31193344
# 2 3.93178 30 8.177696 3.116473 0.24162210
# 3 3.93178 40 8.177696 3.116473 0.18294542
# 4 3.93178 50 8.177696 3.116473 0.13596342
# 5 3.93178 60 8.177696 3.116473 0.09957638

可见评分约低,出轨的可能性越大;年龄约小,出轨的可能性越大。