Ovarian cancer is a complex, heterogeneous genetic disease. Because of the high risk of recurrence in high-grade serous ovarian carcinoma (HGS-OvCa), the development of outcome predictors is important for patient stratification. In addition to predicting survival, the potential of prognostic classifiers lies in the ability to recognize categories of patients that are more likely to respond to particular therapies.

The lack of successful treatment strategies for ovarian cancer led the Cancer Genome Atlas (TCGA) researchers to analyze 489 cases of HGS-OvCa using copy number, expression and methylation arrays, and exonic sequencing data. Their work aimed to identify molecular abnormalities that influence pathophysiology, affect outcome and constitute therapeutic targets.

Gene expression profiles have been established to be associated with overall survival and some studies developed the prognostic signatures based on the TCGA microarray gene expression profiles by employing a univariate Cox regression analysis, and validated them on the external datasets.

Gene expression is considered to reflect the cancer progression driven by mutations and epigenetic modifications. The comprehension of these gene expression patterns can serve to distinguish between normal and cancer tissue, classify cancer subtypes and stages.


Evaluate machine-learning techniques on the up-to-date harmonized RNA-sequencing (RNA-seq) data from the TCGA-OV project in order to detect prognostic features.


TCGA RNA sequence level 3 normalized data and clinical data were downloaded from Genomics Data Commons (GDC) portal (https://portal.gdc.cancer.gov/) using the pipeline of the R/Bioconductor package TCGAbiolinks. The functions of this package were used to query, download and import the data into R for further preprocessing and analysis. The RNA-seq data was also normalized using TCGABiolinks.

Supplemental survival data were downloaded from the standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR). We merged the OV survival data from TCGA-CDR with the GDC clinical data. We made a choice to perform our tests on OS endpoint. The corresponding TCGA-CDR columns included OS for status and OS.time for time-to-event data. OS column contained the value 0 encoding for alive (censored) status and 1 for deceased (failure) and OS.time contained numbers of days from the date of diagnosis to either the date of last follow up if OS was 0 or time to death if OS was 1.

A total number of 379 RNA-seq samples were obtained for OV (TCGA-OV project), 5 of which were normal tissues and 374 tumor samples. After merging RNA-seq and clinical data, we obtained 374 cases among which we discarded 2 cases without survival data. As a result, our complete data set included normalized expression with P = 17401 genes and N = 372 samples.

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *