1.5. Full transcription sequencing
The transcription level of mRNA is affected by the regulation of lncRNA, small RNA and circRNA. Quantitative analysis of biomolecular networks and regulatory pathways in cells or specific tissues in a certain space-time requires quantitative and qualitative research on all RNA molecules in the entire transcriptome. Whole transcriptome sequencing can determine all complete transcripts in a sample, including mRNA and non-coding RNA (lncRNA, circRNA and miRNA). The difference between full transcript sequencing and conventional RNA-seq is the main way of library construction. The whole transcriptome sequencing requires the establishment of 2 libraries (mRNA + lncRNA + circRNA library and miRNA library) or 3 libraries (mRNA + lncRNA library, circRNA library and miRNA library) during the library construction process. Through the whole transcriptome data, not only the expression profiles of all types of transcripts can be obtained, on this basis, different RNA molecules are identified and annotated, their encoded proteins and regulatory functions are analyzed, and the interaction between RNA molecules is regulated Network analysis, comprehensively and systematically analyze the biological characteristics of specific cells in a specific time and space.
1.6. Single-cell transcriptome sequencing
Combining the complementary DNA (cDNAs) technology of in vitro transcription linear amplification and PCR exponential amplification of a single cell with high-throughput sequencing technology helps to derive single cell RNA-seq (scRNA-seq). Single-cell transcriptome sequencing technology is a technique to study the entire transcriptome at the single-cell level. It is used to assess the differences in gene expression between single cells, which can avoid false-negative results introduced by the confusion of cell types, and may identify the rare cell population failing to pass mixed cell detection. Common single-cell sequencing platforms currently include Fluidigm, WaferGen, 10 × Genomics, and Illumina / Bio-Rad. Unlike other RNA sequencing technologies, scRNA-seq needs to first isolate and obtain all transcriptomes within a single cell. Single cell separation is a key step in scRNA-seq, which is mainly achieved through serial dilution, micromanipulation separation, fluorescence activated cell sorting (FACS) and microfluidic technology.
2. Construction of Transcriptome Sequencing Library
When performing transcriptome sequencing, the total RNA in the sample is extracted, rRNA is removed, and the target sequencing RNA molecule is enriched to construct a sequencing library. Sequencing libraries are divided into non-strand-specific libraries and strand-specific libraries. The non-strand-specific library refers to a library in which RNA is reverse transcribed into double-stranded cDNA, and a linker and information that does not distinguish the RNA strand are randomly added. During sequencing, double-stranded cDNA is used for sequencing, which cannot distinguish the transcription direction of m RNA. Strand-specific libraries can be divided into two categories, one is to label one strand with a chemical modification, for example, to treat RNA molecules with bisulfate, or to introduce dUTP during the synthesis of the second strand cDNA, and then degrade the strand containing U; Different linkers are used to connect the 5 'and 3' ends of RNA molecules or synthetic cDNA strands to distinguish the positive and antisense strands.
In transcriptome sequencing, distinguishing the source of RNA molecular chains can avoid the interference of reads on the antisense strand of genes, and can improve the accuracy of gene transcript identification and transcript quantification. When using transcriptome data for de novo stitching, it helps to demarcate the boundaries of transcripts and determine the sense chain information of transcripts.
3. Transcriptome Data Processing
When transcript sequencing data is used to compare quantitative differences between gene levels or transcript levels between different groups, the basic analysis process includes raw data preprocessing, reads comparison, transcript assembly, new transcript prediction, and transcript expression level, analysis and other steps. According to the purpose of the experiment, we can further analyze the difference in transcript expression between the experimental group and the control group, cluster gene expression patterns between samples, and perform joint analysis with other omics data.
3.1. Raw data preprocessing
After obtaining the raw data of the second-generation sequencing, the quality of the data needs to be evaluated and quality control (QC) is performed. The evaluation content includes data output, GC content, rRNA content, base quality distribution, and repeated sequences. The low-quality reads and linker sequences are removed, and the clean data after quality control is obtained for subsequent analysis.
3.2. Reads comparison
The transcriptome data is mainly derived from the exon sequence of the genome. The transcriptome reads obtained by sequencing are aligned to the genomic sequence, which will be separated by the intron sequence.
3.3. Transcript assembly
Transcript assembly is the assembly of sequencing data into transcripts. For species with a reference genome, according to the results of the transcriptome comparison, the connection mode between the exons is clarified, thereby constructing the structure of the transcript. For transcriptome data without a reference genomic sequence, in order to obtain a complete transcript sequence, short reads obtained from RNA sequencing need to be assembled de novo. For transcriptome studies of non-parameter species, more sequencing data is often needed to meet the requirements of de novo assembly. The greater the amount of valid data for assembly, the better the number and completeness of the transcripts that are spliced, and the easier it is to detect transcripts with lower expression levels.
3.4. Transcript prediction
Most genes have multiple splicing forms and may produce multiple transcripts, thereby encoding different proteins, which may cause a gene to have multiple functions. After splicing and assembling the transcript sequencing data, not only will you get the known transcript information, but also new transcript sequences, you need to identify and annotate the new transcripts, especially the new ones that are less studied ncRNA transcript.
For species with reference genome and transcript reference information, the transcript structure is mainly based on sequencing to obtain reads for comparison. The reads cover all transcript sequences and rely on the genome sequence to assemble complete transcript information. For species without a reference genome, the transcript sequence of the gene needs to be assembled by itself. The obtained gene or transcript sequence can be compared with unigene and EST databases of the same species or near-source species to judge the reliability of the obtained gene or transcript sequence. In this process, the blast method is commonly used for comparison to quickly identify the similarity between sequences. In the identification and analysis of new lncRNA, transcripts with a total exon length of> 200 nt are extracted from the transcriptome data based on the characteristics of lncRNA molecules, and then predicted based on open reading frames and compared with known protein databases Further isolate lncRNA from mRNA.
3.5. The analysis of transcript expression levels
After comparing the reads to the corresponding genomic position or assembling the transcript from scratch, the number of reads on each gene or transcript can reflect the expression abundance to a certain extent. There may be significant differences in the total output of data between samples, the number of gene expressions between samples, the length of different genes in a sample, or even the distribution of different transcripts within the same gene. When comparing expression levels, you need to normalize the data between samples.
To be continued in Part III…