DETA NONOate br a Frequency calculated considering
a Frequency calculated considering the cleavage sites (CS) indicated in NCBI for 107 out of 117 DETA NONOate included in this study (the remaining genes have no description of CS in this database). Each dinucleotide was counted as a unit, including the case of genes with more than one CS sequence and transcripts derived from ERCC2, RET, and TSC2 genes that have different CS sequences in their specific isoforms.
In this regard, recent studies identified functional germline variants located within or in the vicinity of CPE sequences in two CPGs (Stacey et al., 2011; Decorsière et al., 2012), raising the necessity to char-acterize these elements. In the current study, we evaluated the fre-quency and sequence of PAS hexamers and CS dinucleotides in a set of CPGs, as well as the distance (in bp) between them. In agreement with some previous genomic scale data (Tian et al., 2005; Beaudoing et al., 2000), our analysis revealed that the majority of PAS contained the canonical hexamer AAUAAA, and the AUUAAA variant was the second most frequent. However, we found the “AA” dinucleotide in most CS sequences associated with CPGs, which has not been previously re-ported. Although the nucleotide sequence of the exact CS is not highly conserved (Sheets et al., 1990), most pre-mRNAs are cleaved down-stream of an adenosine residue (in agreement with our data) and “CA” was defined as the optimal CS (Chen et al., 1995; Gehring et al., 2001). Interestingly, “CA” dinucleotide was only the fourth most frequent CS in our gene set. Since > 50% of human protein coding genes harbor multiple mRNA CS (Tian et al., 2005), it appears that a “CA” dinu-cleotide cannot be an absolute requirement for correct cleavage, a si-tuation analogous to the use of both the canonical AAUAAA PAS and its variants by human genes. A functional study supporting this hypothesis demonstrated that the “CA” dinucleotide is preferred, but not required for cleavage machinery recognition, and CS usage at position −1 was found to be in the order of preference A > U > C ≫ G (Chen et al., 1995). This same study and other previous analyses (Proudfoot and Brownlee, 1976; Gil and Proudfoot, 1987) indicated that the CS is lo-cated no closer than 10 bases, but no further than 30 bases from the AAUAAA element. Surprisingly, about 11% of CPGs included in our study exhibited a PAS-CS distance greater than 30 bp. Thus, we can speculate that certain estimates provided by long-standing poly-adenylation studies do not apply to all human transcripts. Nonetheless, the results of the current study must be viewed in the context of two main limitations: a characterization targeting a specific group of genes, and a relatively small number of genes analyzed.
Strikingly, we did not find any reports of a PAS hexamer in about 18% of the CPGs using a reference database (NCBI). For these genes, a computational analysis was developed in this study to identify 3′-most hexamers (putative PAS) in the full corresponding mRNA sequences, because the predominant mRNA sequence is usually the longest one, generated by the 3′-most poly(A) site (Tian et al., 2005). Although we did not identify putative PAS for all the referred genes, these novel findings reinforce the relevance of establishing updated methods and/ Gene 712 (2019) 143943
or databases to detect this regulatory element of 3′ end processing in human genes. The strategy applied here could be easily employed in similar situations with additional genes outside the CPG context.
In addition, we also explored the frequency of functional APA sites among all CPGs studied. Indeed, APA has emerged as a major player in gene regulation (Lutz, 2008) and its pattern in mammals seems to be evolutionarily conserved (Ara et al., 2006) and regulated in a tissue-specific fashion (Beaudoing and Gautheret, 2001; Zhang et al., 2005; Ni et al., 2013). In this sense, here we reported a strong evidence of APA modulation to the PTEN tumor suppressor gene: it contains 61 APA sites differentially used in 22 distinct non-tumoral human tissues obtained from APASdb database. For instance, two of these APA sites are pre-ferentially used in PTEN mRNA processing, but their usage quantifica-tion was lower in specific tissues, such as kidney and spleen (data not shown). Overall, our analysis using recently released databases (APADB and APASdb) indicated that approximately 90% of selected CPGs have two or more APA sites. In contrast, a previous analysis estimated that about 54% of human genes are alternatively polyadenylated (Tian et al., 2005). Furthermore, in 13 normal human tissues, a poly-adenylation sequencing strategy (PA-seq) found that such APA events were present not only in protein-coding genes (38%) but also in non-coding genes (35%) (Ni et al., 2013), while a genome-wide APA site mapping in some cancer types and tumor cell lines identified around 30% of mRNAs containing APA sites, regardless of the cell type (Lin et al., 2012). Taken together, these findings suggest a greater com-plexity in the regulation of polyadenylation in transcripts specifically derived from CPGs. Importantly, a widespread APA-mediated 3′UTR shortening has been identified across the cancer genome (Lin et al., 2012; Mayr and Bartel, 2009; Lai et al., 2015; Erson-Bensan and Can, 2016). Since shorter 3′UTR isoforms have higher translational effi-ciency than their longer counterparts due to loss of miRNA regulation, APA events can activate some proto-oncogenes in cancer cells (Mayr and Bartel, 2009; An et al., 2013). More recently, Xiang et al. (2018) conducted a comprehensive APA characterization in clinical samples comprising 17 tumor types and 739 cancer cell lines, and demonstrated that the complexity of APA profiles might affect clinically actionable genes and drug sensitivity (Xiang et al., 2018).