br Keywords Survival analysis Kaplan Meier TCGA Cancer Gene
Keywords Survival analysis, Kaplan–Meier, TCGA, Cancer, Gene expression.
List of Abbreviations: TCGA, The Cancer Genome Atlas; OV, ovar-ian serous cystadenocarcinoma; HNSC, head and neck squamous cell carcinoma; PRAD, prostate adenocarcinoma; KIRC, kidney renal clear cell carcinoma; expo, Expression Oncology Project; AUC, Area Under the Curve; FPR, False positive rate; TPR, True positive rate; ROC, Receiver operating characteristic.
∗ Corresponding author at: Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, Aus-tralia.
E-mail addresses: [email protected], [email protected] einstein.yu.edu, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] 1 Present address: New York Genome Center, New York, NY, United States.
All large-scale cancer profiling studies consistently share a common feature that support the notion that cancer is a com-plex, heterogeneous genetic disease. Collections of genome-wide transcriptional profiling datasets captured using tech-nologies like microarrays and RNA-sequencing (RNA-seq) have improved our insight into the nature of this heterogene-ity for many different tumor types. Through efforts such as the Cancer Genome Atlas (TCGA) , the Expression Oncology Project (expo) , and projects stemming from the Interna-tional Cancer Genome Consortium (ICGC) , gene expres-sion datasets are publicly accessible with some collections even containing thousands of patient-derived tumor samples. From these vast resources, we have come to recognize that despite the heterogeneity that exists in cancer profiles, domi-nant features can be extracted from the data and represent di-agnostic capacity in the form of gene signatures and biomark-ers. To patient care, these gene sets are valuable because they Paclitaxel (Taxol) can be used to predict key properties of a tumor, such as grade and molecular subtype, or the expected length of survival time of a patient. Therefore, the most significant goal for bioinformatics, as a field, is to develop clear practices for how to best identify those genes that are predictive markers of survival status of cancer patients.
Studies of genetic variation in tumors have identified spe-cific genes and their associated lesions that have been used to delineate distinct patient sub-groups with prior success. Some examples include, the PTCH1 inactivating mutation in medulloblastoma , MYCN amplification in neuroblastoma , or KRAS mutation status in non-small cell lung cancer
. While these lesions have been valuable for identifying ge-netic regulators of cancer, it is becoming increasingly clear that given the complex and graded nature in which tumors are controlled by genes, there are advantages to expanding our focus to gene expression-based markers as well. Instead of dichotomous or binary associations that are represented by genetic variants or copy number aberrations, identifying pre-dictors of patient survival using gene expression may result in markers that can predict change between these two variables with greater sensitivity or more subtle degrees of detection.
To identify predictive biomarkers, a host of statistical meth-ods have been adapted from survival analysis techniques, in-cluding the Kaplan-Meier estimator, the log-rank test, or the Cox regression model. In general, these methods are geared towards binary inputs like the presence or absence of a muta-tion, lesion, gene fusion, translocation or other genetic event that is discrete. Standard workflows exist for handling this bi-nary data, and identification of these biomarkers are consid-ered routine. On the other hand, the equivalent framework for continuous inputs, like data based on gene expression is not so well-established. This is because it is not often clear what the optimal choice is for estimating the relationship between a gene’s expression profile and survival status in a patient co-hort. For instance, a standard survival analysis model could be used to assess the degree to which changes are occurring between gene expression and patient survival time, but gen-eral assumptions regarding regulatory conditions of the data may not always hold, especially those based on the distribu-tion of the data, such as a Normal distribution. Alternatively, a continuous variable may be transformed into a binary one