br A similar example was seen in the kidney cancer
A similar example was seen in the kidney cancer data, with the well-known FLT1 gene, which plays a role in a number of tumorigenic processes including angiogenesis, prolifera-tion, and metastasis, and has been associated with survival [15,36]. In the TCGA kidney cancer dataset, while all meth-ods do detect this gene as being significantly associated with survival, dichotomization-based methods had the highest P-values (approximately 100-fold).
A more extreme case can be observed in the TCGA ovar-ian cancer dataset, when examining the epigenetic regula-tor KDM5A. This gene is known to be amplified and over-expressed in a number of different cancer types, including serous ovarian cancer, and has been implicated in EMT, poor survival, and suppression of apoptosis [37,38]. In the ovarian cancer dataset we note that Cox regression, C-index, and D-index again find that this gene is significantly associated with survival (P-value < 0.01) whereas the dichotomization-based methods do not detect this gene as associated with survival at an alpha level of 0.05. Similarly, the C-X-C Motif Chemokine Ligand 12 (CXCL12) gene plays a role in the metastasis of prostate carcinomas, which may indicate that E-64-c of this gene is associated with poor survival [39,40]. Here again we find that this gene is not detected as associated with survival using the dichotomization-based methods, yet identified using Cox regression, C-index, and D-index. Over-all, these results point to the value of moving away from the dichotomization-based methods for adopting rational strate-gies for survival analysis using continuous variables like gene expression data.
While the eight methods were selected to conduct the same task – to identify gene-expressed based prognostic markers –mechanistically, the methods vary in their approach and the performance of these methods could largely be explained by the degree to which assumptions were made by each method. The method that made the least assumptions and therefore had the greatest flexibility, was the Cox regression method, which overall, had the strongest performance of the eight methods. The Cox regression is a framework that is commonly adopted as the most general method in survival analysis. In our study that focused on identifying prognostic predictors us-ing gene expression, this method is a natural choice for model-ing this continuous variable using a single covariate. The use of k-means as an alternative method could be considered the second least restrictive method because it assumed only the existence of two patient sub-groups, and beyond this, did not make any further assumptions regarding how these groups should be structured.
The class of methods tested that were based around a dichotomized value, such as the KaplanScan method, were more restrictive than the Cox regression and k-means meth-ods, because they assumed that an inherent ordering existed amongst the patients. The KaplanScan method had the ad-ditional set of iterative steps that sought to infer the bound-ary that would split the population into the two most optimal groups but retained some flexibility in that this boundary was derived from the data. Moreover, given that the number of multiple hypotheses to be tested is directly linked to the num-ber of genes considered in the analysis, the false positive rate grows linearly with more genes. Consider, for example, 100 samples of data, one would end up running 90 different log-rank tests for a given gene using KaplanScan. This therefore represents a clear limitation to this method.
Comparing survival analysis methods for cancer RNA-seq data
The distribution-based splitting method used parametric assumptions to define a set of rules for dichotomization based on quantiles that were estimated directly from a gene’s expression distribution in the patient population. While this method allowed for flexibility in that the dichotomization rule would adapt given the shape of the distribution, it also carried an additional layer of assumptions regarding the parametriza-tion of these distributions that may not always hold true for each gene. Finally, the two methods based on dichotomiz-ing using a quantile value were the simplest approaches to implement. However, they represented the most restrictive and least flexible of the eight methods because they did not account for the shape of the distribution, the data structure amongst patients, or the existence of alternative candidates for the breakpoint location. It was therefore unsurprising that the methods from this category had the poorest performance overall in all tests of reliability, accuracy, and robustness.