1. sci666首页
  2. 实用技巧
  3. 生物信息学

如何根据Cox回归Nomogram计算所有患者的风险得分?

1. 背景知识

我们通过Cox回归等方法构建临床预测模型,然后根据模型参数把患者的生存概率可视化为Nomogram。“Nomogra”中文常翻译为诺模图或列线图,其本质就是回归方程的可视化。它根据模型中所有自变量回归系数的大小来制定评分标准,据此给每个自变量的每种取值水平一个评分,这样对每个患者,就可以计算得到一个总分,再通过总分与结局发生概率之间的转换函数来计算每个患者的生存概率。举例来说,我们治疗了一个40岁的男性胰腺癌手术后的患者,术中进行了放疗,肿瘤位于胰腺头部,有腹膜转移,临床分期在IV期。根据上述条件,通过一定的数学模型判断每个变量的得分,比如年龄40岁,得分是10分,男性得分是4分……依次累计各个变量的得分,计算出总分,不同的总分值就对应3月、6月和1年的生存概率。复杂的Cox回归公式变成了直观的图形,临床医生可以很方便的计算每个患者的得分及对应的生存概率,从而告知患者一个相对准确的“算命”结果。

对于从Nomogram中获得单个患者的得分与相应的生存概率并不困难,但这里有两个技术问题一直困扰着很多研究者:

(1) 如何根据Cox回归列线图精确计算某个患者的风险得分与生存概率?

(2) 如何根据Cox回归列线图精确计算数据集中所有患者得分与生存概率?

如何解决这两个问题呢?下面我们就采用笔者发表在Ann Transl Med杂志的文献《In-depth mining of clinical data: the construction of clinical prediction model with R》[1] 中第20页的案例1演示如何解决这两个问题,数据下载地址请参阅本文献。

实际上,列线图只是Cox回归方程的可视化,无论精确计算单个患者的得分还是预测概率,其依据都是回归方程而非列线图本身。

2. R语言代码及解读

载入必要的程辑包,导入格式为.sav的数据,并剔除缺失值,展示前6行数据,查看数据的结构:

breast<-read.spss("BreastCancer.sav")
breast<-as.data.frame(breast)
breast<-na.omit(breast)
head(breast)
##    No    Months Status Age       ER      PgR Margin_status
## 9   9  8.633333 Censor  70 Positive Negative      Nagative
## 11 11 44.033333 Censor  56 Positive Positive      Nagative
## 12 12 48.766667 Censor  54 Positive Positive      Nagative
## 13 13 14.466667 Censor  61 Positive Positive      Nagative
## 14 14 47.900000 Censor  39 Negative Positive      Nagative
## 19 19 39.866667 Censor  50 Positive Positive      Nagative
##    Pathologic_stage HER2_Status Menopause_status
## 9           Stage I    Negative   Post menopause
## 11          Stage I    Negative    Pre menopause
## 12         Stage II    Negative    Pre menopause
## 13         Stage II    Negative   Post menopause
## 14         Stage II    Negative    Pre menopause
## 19         Stage II    Positive   Post menopause
##                 Surgery_method             Histological_type
## 9                   Lumpectomy                         Other
## 11 Modified Radical Mastectomy                         Other
## 12 Modified Radical Mastectomy Infiltrating Ductal Carcinoma
## 13                  Lumpectomy                         Other
## 14                  Lumpectomy Infiltrating Ductal Carcinoma
## 19                  Lumpectomy Infiltrating Ductal Carcinoma
str(breast)
## 'data.frame':    549 obs. of  12 variables:
##  $ No               : num  9 11 12 13 14 19 20 22 23 24 ...
##  $ Months           : num  8.63 44.03 48.77 14.47 47.9 ...
##  $ Status           : Factor w/ 2 levels "Censor","Dead": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Age              : num  70 56 54 61 39 50 67 45 66 36 ...
##  $ ER               : Factor w/ 2 levels "Negative","Positive": 2 2 2 2 1 2 1 2 2 1 ...
##  $ PgR              : Factor w/ 2 levels "Negative","Positive": 1 2 2 2 2 2 1 2 2 1 ...
##  $ Margin_status    : Factor w/ 2 levels "Nagative","Positive": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Pathologic_stage : Factor w/ 4 levels "Stage I","Stage II",..: 1 1 2 2 2 2 2 2 1 3 ...
##  $ HER2_Status      : Factor w/ 2 levels "Negative","Positive": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Menopause_status : Factor w/ 2 levels "Pre menopause",..: 2 1 1 2 1 2 2 1 2 1 ...
##  $ Surgery_method   : Factor w/ 4 levels "Lumpectomy","Modified Radical Mastectomy",..: 1 2 2 1 1 1 2 2 1 1 ...
##  $ Histological_type: Factor w/ 3 levels "Infiltrating Ductal Carcinoma",..: 3 3 1 3 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 10 15 ...
##   ..- attr(*, "names")= chr  "1" "2" "3" "4" ...

定义终点,将回归模型中的等级变量“Pathologic_stage”及二分类变量“PgR”作为数值型变量处理,或者定义为非负整数型变量,在后续计算得分及预测概率时要求模型中的自变量为非因子型变量,这点很重要,否则无法计算结果。

breast$Status<-ifelse(breast$Status == 'Dead',1,0)
breast$Pathologic_stage<-as.numeric(breast$Pathologic_stage)
breast$PgR <-as.numeric(ifelse(breast$PgR == "Positive", 1,0))
str(breast)
## 'data.frame':    549 obs. of  12 variables:
##  $ No               : num  9 11 12 13 14 19 20 22 23 24 ...
##  $ Months           : num  8.63 44.03 48.77 14.47 47.9 ...
##  $ Status           : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Age              : num  70 56 54 61 39 50 67 45 66 36 ...
##  $ ER               : Factor w/ 2 levels "Negative","Positive": 2 2 2 2 1 2 1 2 2 1 ...
##  $ PgR              : num  0 1 1 1 1 1 0 1 1 0 ...
##  $ Margin_status    : Factor w/ 2 levels "Nagative","Positive": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Pathologic_stage : num  1 1 2 2 2 2 2 2 1 3 ...
##  $ HER2_Status      : Factor w/ 2 levels "Negative","Positive": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Menopause_status : Factor w/ 2 levels "Pre menopause",..: 2 1 1 2 1 2 2 1 2 1 ...
##  $ Surgery_method   : Factor w/ 4 levels "Lumpectomy","Modified Radical Mastectomy",..: 1 2 2 1 1 1 2 2 1 1 ...
##  $ Histological_type: Factor w/ 3 levels "Infiltrating Ductal Carcinoma",..: 3 3 1 3 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 10 15 ...
##   ..- attr(*, "names")= chr  "1" "2" "3" "4" ...

将数据集breast按照5:5随机划分为训练集与验证集。

set.seed(123) #random number generator
ind <- sample(2, nrow(breast), replace = TRUE, prob = c(0.5, 0.5))
train <- breast[ind==1, ] #the training data set
test <- breast[ind==2, ] #the test data set
str(train)
## 'data.frame':    258 obs. of  12 variables:
##  $ No               : num  11 13 14 20 22 23 25 27 29 31 ...
##  $ Months           : num  44 14.5 47.9 28.4 18.5 ...
##  $ Status           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Age              : num  56 61 39 67 45 66 48 62 39 34 ...
##  $ ER               : Factor w/ 2 levels "Negative","Positive": 2 2 1 1 2 2 1 1 2 2 ...
##  $ PgR              : num  1 1 1 0 1 1 0 0 1 1 ...
##  $ Margin_status    : Factor w/ 2 levels "Nagative","Positive": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Pathologic_stage : num  1 2 2 2 2 1 1 2 2 2 ...
##  $ HER2_Status      : Factor w/ 2 levels "Negative","Positive": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Menopause_status : Factor w/ 2 levels "Pre menopause",..: 1 2 1 2 1 2 2 2 1 1 ...
##  $ Surgery_method   : Factor w/ 4 levels "Lumpectomy","Modified Radical Mastectomy",..: 2 1 1 2 2 1 3 1 3 3 ...
##  $ Histological_type: Factor w/ 3 levels "Infiltrating Ductal Carcinoma",..: 3 3 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 10 15 ...
##   ..- attr(*, "names")= chr  "1" "2" "3" "4" ...
str(test)
## 'data.frame':    291 obs. of  12 variables:
##  $ No               : num  9 12 19 24 26 30 32 34 35 49 ...
##  $ Months           : num  8.63 48.77 39.87 18.27 83.3 ...
##  $ Status           : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ Age              : num  70 54 50 36 36 50 53 37 40 76 ...
##  $ ER               : Factor w/ 2 levels "Negative","Positive": 2 2 2 1 2 1 2 2 1 1 ...
##  $ PgR              : num  0 1 1 0 1 0 1 1 0 0 ...
##  $ Margin_status    : Factor w/ 2 levels "Nagative","Positive": 1 1 1 2 1 1 1 2 1 1 ...
##  $ Pathologic_stage : num  1 2 2 3 1 2 2 3 2 2 ...
##  $ HER2_Status      : Factor w/ 2 levels "Negative","Positive": 1 1 2 1 1 2 1 1 1 2 ...
##  $ Menopause_status : Factor w/ 2 levels "Pre menopause",..: 2 1 2 1 1 1 2 1 1 2 ...
##  $ Surgery_method   : Factor w/ 4 levels "Lumpectomy","Modified Radical Mastectomy",..: 1 2 1 1 1 2 3 2 3 1 ...
##  $ Histological_type: Factor w/ 3 levels "Infiltrating Ductal Carcinoma",..: 3 1 1 1 1 3 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 10 15 ...
##   ..- attr(*, "names")= chr  "1" "2" "3" "4" ...

以下代码为构建Cox回归模型,并绘制列线图,此处不再详述,读者可以参阅文献[1]获得更详细的列线图及校正曲线绘制代码。此处在训练集中构建的Cox回归模型“coxm”及列线图对象“nom”是我们后续计算得分及生存概率的依据。

dd <- datadist(train)
option <- options(datadist = "dd")
coxm <- cph(Surv(Months,Status==1) ~ Age+Pathologic_stage+PgR,
            data = train, x = T, y = T, surv = T)
surv <- Survival(coxm)
nom <- nomogram(coxm,fun=list(function(x)surv(12, x), function(x)surv(36, x),
                              function(x)surv(60, x)),
                lp = T,funlabel = c('1-Yeas OS', '3-Year OS','5-YearOS'),
                maxscale = 100, fun.at = c('0.95','0.85','0.80','0.70','0.6','0.5','0.4','0.3','0.2','0.1'))
plot((nom), xfrac = .3)

如何根据Cox回归Nomogram计算所有患者的风险得分?

图1.Cox回归模型构建的列线图

使用nomogramFormula包中formula_lp()函数及points_cal()函数直接提取训练集中的线性预测值linear.predictors,计算列线图得分,展示前6行。但这种方式仅仅能计算训练集中单个患者的列线图得分。

options(option)
results <- formula_lp(nomogram = nom)
points <- points_cal(formula = results$formula, lp = coxm$linear.predictors)
head(points)
##        11        13        14        20        22        23
##  42.32681  82.48704  52.44867 146.00818  60.64095  55.98062

使用nomogramFormula包中formula_rd()函数及points_cal()函数根据训练集、验证或全集中自变量取值计算列线图得分,并将计算的得分做为新的列添加至对应的数据集中。我们推荐使用这种方法来计算。

options(option)
results <- formula_rd(nomogram = nom)
train$points <- points_cal(formula = results$formula,rd=train)
test$points <- points_cal(formula = results$formula,rd=test)
breast$points <- points_cal(formula = results$formula,rd=breast)
head(breast$points)
## [1] 116.77100  42.32681  72.92939  82.48705  52.44867  67.46786

从模型的数学本质上讲,列线图得分是表象,其对应的是通过Cox模型计算线性预测值与某时点的预测概率。所以,这里我们继续介绍如何计算模型的线性预测值与预测概率。

计算训练集、验证集与全集的线性预测值。训练集的线性预测值从模型coxm中直接提取即可,并做为新的列加入到训练集train中。

train$linear.predictors<-coxm$linear.predictors
head(train$linear.predictors)
## [1] -0.8616095 -0.2490837 -0.7072305  0.7197437 -0.5822813 -0.6533610

使用predict()函数分别在验证集test及全集中计算线性预测值,并做为新的列加入到对应的数据集中。

test$linear.predictors<-predict(coxm,type="lp",newdata=test)
breast$linear.predictors<-predict(coxm,type="lp",newdata=breast)
head(breast$linear.predictors)
## [1]  0.2738167 -0.8616095 -0.3948577 -0.2490837 -0.7072305 -0.4781571

下面我们继续使用pec包中predictSurvProb()函数计算训练集、验证集与全集中各患者的12个月的生存概率,并做为新的列加入到对应的数据集中,代码如下:

train$survprob1 <- predictSurvProb(coxm,newdata=train,times=12)
head(train$survprob1)
##           [,1]
## [1,] 0.9970325
## [2,] 0.9945315
## [3,] 0.9965379
## [4,] 0.9856556
## [5,] 0.9960781
## [6,] 0.9963467
test$survprob2 <- predictSurvProb(coxm,newdata=test,times=12)
head(test$survprob2)
##           [,1]
## [1,] 0.9907924
## [2,] 0.9952715
## [3,] 0.9956486
## [4,] 0.9874827
## [5,] 0.9980423
## [6,] 0.9899106
breast$survprob <- predictSurvProb(coxm,newdata=breast,times=12)
head(breast$survprob)
##           [,1]
## [1,] 0.9907924
## [2,] 0.9970325
## [3,] 0.9952715
## [4,] 0.9945315
## [5,] 0.9965379
## [6,] 0.9956486

3. 总结与讨论

以上介绍了如何基于R语言根据Cox回归模型计算数据集中单个患者的列线图的得分、线性预测值及生存概率。Cox回归模型的数学公式是我们计算的依据。

4. 参考文献

[1] Zhou ZR, Wang WW, Li Y, Jin KR, Wang XY, Wang ZW, Chen YS, Wang SJ, Hu J, Zhang HN, Huang P, Zhao GZ, Chen XX, Li B, Zhang TS. In-depth mining of clinical data: the construction of clinical prediction model with R. Ann Transl Med 2019. doi: 10.21037/atm.2019.08.63

这些可能会帮助到你: 问答社区 | 共享百度SVIP | 留言建议

欢迎入群交流:生信分析群: 732179952 · Meta分析群: 797345521 · 医学科研交流群: 797345521

发表评论

登录后才能评论

评论列表(1条)