Identification of small non-coding RNAs from Rhizobium etli by integrated genome-wide and transcriptome-based methods

Background: Small non-coding RNAs (sRNAs) are regulatory molecules, present in all forms of life, known to regulate various biological processes in response to the different environmental signals. In recent years, deep sequencing and various other computational prediction methods have been employed to identify and analyze sRNAs. Results: In the present study, we have applied an improved sRNA scanner method to predict sRNAs from the genome of Rhizobium etli , based on PWM matrix of conditional sigma factor 32. sRNAs predicted from the genome are integrated with the available stress specific transcriptome data to predict putative conditional specific sRNAs. A total of 271 sRNAs from the genome and 173 sRNAs from the transcriptome are computationally predicted. Of these, 25 sRNAs are found in both genome and transcriptome data. Putative targets for these sRNAs are predicted using TargetRNA2 and these targets are involved in a wide array of cellular functions such as cell division, transport and metabolism of amino acids, carbohydrates, energy production and conversion, translation, cell wall/membrane biogenesis, post-translation modification, protein turnover and chaperones. Predicted targets are functionally classified based on COG analysis and GO annotations. Conclusion: sRNAs predicted from the genome, using PWM matrices for conditional sigma factor 32 could be a better method to identify the conditional specific sRNAs which expand the list of putative sRNAs from the intergenic regions (IgRs) of R. etli and closely related α -proteobacteria. sRNAs identified in this study would be helpful to explore their regulatory role in biological cellular process during the stress.


Background
Small non-coding RNAs are bacterial regulatory molecules, 50-500 nt (bp) in length and contain several stem loops. sRNAs are often located in the intergenic regions, transcribed from their own promoter or promoters of nearby genes and contain rho-independent terminator. sRNAs regulate the gene expression by perfect or imperfect base pairing with complementary sequence stretches, generally located in 5′-UTR regions of transencoded target mRNAs, resulting in altered target mRNA translation and stability [1][2][3]. The regulation of sRNAs are mediated with the help of chaperone Hfq, enhance RNA-RNA interaction, through the preferential binding at single-stranded AU-rich regions of the noncoding RNAs and their target mRNAs [4].
Several sRNAs have been identified by genome-wide profiling and transcriptome-based methods. To date, many computational techniques and experimental methods have been used to predict sRNAs in both gram-negative and grampositive bacteria [5][6][7][8][9][10][11]. sRNAs regulate diverse cellular processes and conditionally expressed during oxidative stress, iron uptake, quorum sensing, virulence and heat shock [5,[12][13][14]. Sigma factors are transcription initiation factors that enable specific binding of RNA polymerase (RNAP) to gene promoters. Bacteria contain different sigma factors, each capable of directing the core polymerase to transcribe a specific set of genes, depending on the environmental or developmental signals [15]. However reports on screening the conditional sRNAs are not yet available for the group of αproteobacteria, except sRNAs predicted from RNA sequence analysis under heat shock and saline shock; and sRNAs predicted from the genome are integrated with the virulence specific transcriptome data [14,16]. In our previous work, conditional sigma factor based-sRNAs were predicted from the genome of Agrobacterium using an improved sRNA scanner. This method was used to identify the sRNAs that are regulated by several conditional sigma factors, such as, 24, 32 and 54. sRNA scanner identified the sRNAs resided in intergenic regions of the genome, based on the transcriptional signals. This bioinformatic tool uses PWM matrices of sRNA promoter and rho-independent terminators signals, through sliding window-based genome scans, using consensus sequences of sigma factor promoter binding sites − 35 and − 10 and rho-independent transcription terminator sequences [16].
Rhizobium etli is a gram-negative bacterium that belongs to Rhizobials of α-proteobacteria, interacts symbiotically with the common beans Phaseolus vulgaris to form nitrogen-fixing root nodules. Inside the nodules, bacteria differentiate into bacteroids that are capable of fixing the atmospheric N 2 into NH 3 . The genome of R. etli consists of one circular chromosome (6,530,228 bp) and six plasmids: p42a (194,229 bp), p42b (184,338 bp), p42c (250,948 bp), p42d (371,254 bp), p42e (505,334 bp) and p42f (642,517 bp) with 6034 protein-coding genes [17]. Two earlier studies were reported on the identification of sRNA candidates in R.etli. Using tiling microarray analysis, 66 novel sRNA candidates comprising 17 putative sRNAs and 49 putative cis-regulatory ncRNAs were computationally predicted and 4 of these were confirmed subsequently by wet-lab experiments [9]. Yet another study, identified 13 differentially expressed ncRNAs under heat shock and 9 under saline shock conditions in R.etli [14]. However, there is scanty information on stress conditional specific sRNAs in Rhizobium.
In the present study, we report the sRNAs predicted from the genome and transcriptome of Rhizobium etli. Further, sRNAs predicted from the genome are integrated with the stress-specific transcriptome to identify putative conditional specific sRNAs. The mRNA targets for these sRNAs were identified and data are presented on the functional categorization and regulatory network analysis for the predicted mRNA targets.

Results
Genome-wide sRNA prediction by improved sRNAscanner Prediction of sRNAs from the nitrogen-fixing Rhizobium was performed by genome-wide computational analysis, based on the PWM matrices of conditional sigma factor 32 (Heat shock sigma factor) using improved sRNA Scanner program [16]. sRNA scanner demarks the transcription units (TUs) using consensus sequences of sigma factor binding sites (− 35 and − 10 (Supplementary file 3)) and rho-independent transcription terminator sequences. An earlier version of sRNA Scanner uses PWM matrix; only for housekeeping sigma factor 70 and rhoindependent transcription termination in which limited numbers of training sequences were used. The total number of sRNAs predicted from each replicons of Rhizobium etli is graphically represented in Fig. 1.
The majority of the sRNA candidates identified varied in length between 50 and 500 nt. GC content for most of the sRNAs of Rhizobium found to have 50 to 70%. A total of 247 sRNAs were predicted from the genome of R. etli known to be conditionally regulated by sigma factor 32. To find the novel putative sRNAs, predicted sRNAs were searched against Rfam database and BSRD database to eliminate the conserved homologs (Table 1). Seventeen and four sRNA candidates have shown homology with already identified sRNAs in Rfam and BSRD database, respectively ( Table 1). The sRNAs predicted from the genome were compared with previously reported sRNAs. Eight sRNA candidates were conserved with the earlier reported sRNAs by Vercruysse et al. 2010 [9] and one sRNA with the López-Leal et al. 2015 [14].

Transcriptome based sRNAs prediction
The high-quality RNAseq reads of R. etli under control, heat and saline shock were aligned to the genome of R. etli using Rockhopper. After alignment, transcripts from the intergenic regions and antisense regions from the complementary strand of the protein-coding genes were identified. The intergenic sRNAs having a length of 50-500 nt were taken for further analysis. A total of 68 transencoded sRNAs under the heat shock and 105 under the saline shock were identified. A relatively larger number of sRNAs were found to be expressed from the chromosomes, has a length of 50 to 150 nt (Fig. 1). Further, predicted sRNA candidates were searched against Rfam and BSRD databases. To find the novel putative sRNA candidates, the above-screened sRNAs were compared with previously reported sRNAs (supplementary file 4). Eight sRNAs from heat shock and fourteen sRNAs from saline shock showed homology with already reported sRNAs by Vercruysse et al. 2010 [9]; five sRNAs from heat shock and two sRNAs from saline shock with sRNAs reported by López-Leal et al. 2015 [14].
sRNA conservation and comparative analysis sRNAs are known to be conserved in nature, in order to study the sRNA conservation in the present study, the sRNA conservation analysis was performed between the rhizobium strains, interestingly the sRNAs of Rhizobium etli were highly conserved with identities ranging from 80 to 100% with Rhizobium leguminosarum. From the analysis, it was found that 21 sRNAs from the genomebased search (18 sRNAs from chromosome and 1 sRNA from 2nd, 6th and 7th replicons) and in the case of the transcriptome (saline shock), 2 sRNAs (chromosomally encoded) were conserved with the sigma factor 32 regulated sRNAs of R. leguminosarum (unpublished data). Three chromosomally encoded sRNAs of R. etli were found to be conserved with one specific sRNA of R. leguminosarum (94-95% identity), which was further selected for the quantification analysis in R. leguminosarum ( Table 2). The identified novel sRNAs candidates of the genome and transcriptome were correlated to identify the common sRNAs between conditional Fig. 1 a The total number of sRNAs predicted from the heat shock, saline shock and sigma factor 32; b Total number of sRNA distribution in the replicons of Rhizobium etli. Sigma factor 32 based genome-wide prediction represented in red color and transcriptome-based heat shock specific sRNAs in blue color and saline shock in green color; c length distribution and d GC% content specific sigma factor 32 derived sRNAs with the stressspecific sRNA transcripts. A total of 271 sRNAs identified from the genome, of which 241 were novel. Similarly, 173 sRNAs from the transcriptome were identified (Supplementary file 1), of which 128 sRNAs were novel. Based on comparative analysis, 25 novel sRNAs were found to be common between the genome-wide and transcriptome data of R.etli.

Target identification
The sRNAs regulate diverse cellular processes by interacting with complementary sequence stretches of transencoded mRNAs. The target mRNAs were predicted for the sRNAs identified from genome-wide and transcriptome data by using TargetRNA2. Based on the thermodynamic interaction energy (kcal/mol) of hybridization between the sRNA and mRNA targets and significant pvalue (< 0.05), 30 sRNA candidates were selected from the transcriptome of heat and saline shock for further analysis (Table 3). Target mRNAs for the 25 common sRNAs predicted from the genome of R. etli are provided in supplementary file 6. Fifty-five sRNAs were taken for further analysis.

Functional categorization of sRNA target genes
To study the role of target mRNAs, selected target mRNAs of thirty sRNA candidates from the transcriptome of heat and saline shock conditions were functionally annotated by COG and GO analysis. Under heat and saline shock conditions, mRNA targets were enriched in COG categories of transport and metabolism of amino acids, carbohydrates, lipids, energy production, and conversion, post-translation modification, protein turn over and chaperons, cell wall/membrane biogenesis and translation (Fig. 2). Enriched GO terms for the target mRNAs were widely distributed about their respective biological cellular processes. The target genes were annotated in 3 classes, viz., biological processes, molecular functions and cellular components. The targets categorized under biological processes include the genes involved in metabolic and cellular processes, molecular function, catalytic activity and binding, such as transporter activity, DNA binding, RNA binding and ion binding (Fig. 3).

GO regulatory network
GO regulatory network (GRN) was constructed for the sRNA target genes for the sRNAs predicted from the transcriptome profiled under conditions of heat and saline. The GO network of mRNA targets of heat shock sRNAs is shown in Fig. 4. The regulation of cell shape was the central node in the GRN. Regulation of cell shape protein MviN (RHE_CH0386) showed interaction with many other GO terms such as regulation of DNA replication, signal transduction, protein folding, cellular amino acid metabolic process, carbohydrate metabolic process, fatty acid biosynthetic process, nitrogen metabolic process, and cell division. In the case of mRNA targets of saline shock sRNAs, phosphorelay signal transduction system was the central node in the network (Fig. 5) governed by feuP (RHE_CH01286) which showed interaction with other GO terms, such as, positive regulation of transcription, regulation of cell shape, metabolic process, transmembrane transport, translation, nucleotide catabolic process, cell wall organization, nitrate assimilation and cell cycle.
Promoter, terminator, secondary structure prediction Promoter and rho-independent terminator sequences were predicted for the identified putative novel sRNAs (Table 4 and Supplementary file 6). Secondary structure was predicted for the selected sRNAs using RNAfold server. The predicted minimum free energy for the majority of the sRNAs ranges from − 20 to − 70 kcal/mol.

Discussion
sRNAs are known to regulate diverse cellular processes in prokaryotes [2,18,19]. To date, many computational based methods have been used to identify small regulatory RNAs in bacteria, but there are only a few reports available on the functional roles of sRNAs in Rhizobium.
In 2016 Borella et al. have reported that the small RNA gene mmgR is controlled by nitrogen source in Sinorhizobium meliloti [20]. Recently, the function and mechanism of Sinorhizobium meliloti trans-sRNA NfeR1 (Nodule formation efficiency RNA) was experimentally studied on the effect of osmoadaptation and symbiotic efficiency in Alfa alfa [21]. In the present study, we have combined genome-wide and transcriptome based computational methods to identify the novel putative sRNA candidates in R. etli. Particularly, we have focused on the identification of sRNAs that are differentially expressed during heat and saline shock and its regulatory role. Genome-wide sigma factor 34 based sRNA predictions provided a total of 271 sRNA candidates. While comparing with the transcriptome based data, many sRNAs were predicted from the genome-wide prediction. A higher number of sRNA candidates were expressed from the chromosome than other replicons. One hundred sixty-nine sRNAs were predicted in the chromosome and 31 sRNAs were found in symbiotic plasmid p42d. A total of 128 novel sRNAs were predicted from the transcriptome data and we found number of sRNAs expressed under saline shock is more than the heat shock condition (  [14]. Our out of interest has led to the identification of more than 100 new sRNAs from RNA sequence data analysis. Besides, we have compared the identified sRNA candidates with previously reported sRNA data of R. etli [14], a total of 9 sRNAs from the sigma 32 genome-based method, 10 from the heat shock and 19 from the saline shock were conserved with the already reported sRNAs. While performing sRNA conservation analysis, 21 sRNAs were found to be overlapped with the R. leguminosarum, in which 3 sRNAs of R. etli were conserved (share 94-95% homology) with a single sRNA candidate and interestingly another chromosomally encoded sRNA (124 nt) share 100% homology with the sRNA of R. leguminosarum. sRNAs regulate the gene expression by perfect or imperfect base pairing with the target mRNA. Single sRNA is known to regulate multiple mRNA targets, either it upregulates or downregulates based on the binding sites of a set of genes. Earlier findings have shown that sRNAs regulate diverse biological and cellular processes, such as energy metabolism, quorum sensing (QS) and biofilm formation, stress responses and adaptation to adverse growth conditions, and pathogenesis [19,22,23]. In the present study, we have identified potential targets of sRNAs and analyzed its role using different computational methods. Target prediction method revealed that 15 sRNAs of heat shock sRNAs have complementary binding sites with heat shock specific genes such as groES, groESch3, groEL, ibpA, serine proteases-degPch1, degPch2 and also with the virulence factor coding gene MviN which codes for a transmembrane protein. Among the 15 selected sRNAs of the saline shock group, a few sRNAs have a significant binding site on serine proteases (degPch1, degPch2) and mviN. Besides, we could infer that the identified sRNAs might regulate several hypothetical proteins. In 2014 López-Leal et al. reported, groESch2, groEL, and ibpA heat shock genes were upregulated in R. etli during heat shock and two serine proteases, viz., degPch1 and degPch2 were significantly over-expressed during saline shock [14]. Based on the results of the present study, we suggest these newly identified sRNAs might regulate the expression of heat and saline shock specific genes. Further, the target mRNAs of these sRNAs were taken for the functional categorization using COG and GO analysis.
In the GO enrichment analysis, most of the target genes were associated with cellular, metabolic and transport processes. COG analysis revealed that most of the target mRNAs of sRNAs of this study were involved in amino acid transport and metabolism, energy production and conversion, post-translational modification, protein turnover, chaperones and cell wall/membrane biogenesis. Particularly, heat shock sRNAs are firmly categorized in post-translational modification, protein turnover and chaperones in COG analysis. Further, we have constructed the GRN of predicted target mRNAs using the biological process GO terms. The transmembrane protein MviN constitutes the central node in the regulatory network in the heat shock condition. It is well   documented that subjecting the cells to heat shock can disrupt the cell membrane integrity. Regulation of cell shape protein MviN was shown to be up-regulated under heat shock condition as compared to control besides the down-regulation of DNA replication proteins dnaA and dnaB [14]. sRNAs identified in the present study have complementary binding sites with these target proteins, which might down or up-regulate the target proteins. Network analysis revealed that many target genes mainly involved in protein folding, cellular amino acid, carbohydrate metabolic processes, signal transduction, cell division, cell cycle and cell wall organization. Under saline shock conditions, many target mRNAs were found to be involved in the metabolic process, transmembrane transport, cell organization, translation and regulation of transcription.

Conclusion
In this study, for the first time, we reported novel sRNAs expressed differentially under stress conditions. The mRNA targets of these sRNAs were identified, functionally classified and found that these sRNAs are involved in different cellular metabolic processes including protein folding. GO network analysis of Rhizobium revealed a new biological role of sRNAs. Several reports are available regarding the sRNA identification but the reports on the biological roles of sRNAs in Rhizobium are quite limited. This work begins to address the new biological insights in sRNAs function and its roles in a bacterial system. It's possible that the above applied genome-wide computational methods can be used to identify the conditional specific sRNAs in other Rhizobium or closely related α-proteobacteria. However, the precise role of sRNAs reported in the preset study needs to be validated experimentally in future studies.

Materials and methods
Genome-wide prediction of sRNAs from Rhizobium etli by using improved sRNAscanner Rhizobium etli complete genome sequence and annotation files were retrieved from the National Centre for Biotechnology Information (NCBI) ftp site. Genome sequences and annotation files were downloaded in Fasta nucleic acid (.fna) and protein data file (.ptt) formats, respectively. Accession numbers of Rhizobium etli with their respective replicons used in our study are listed in the supplementary file 1. In the present study, we employed the improved version of the sRNA scanner to predict conditional sigma factor 32 specific sRNAs. This bioinformatic tool uses PWM matrices of sRNA promoter and rho-independent terminators signals (Supplementary file 2), through sliding window-based genome scans, using consensus sequences of sigma factor promoter binding sites − 35 and − 10 and rho-independent transcription terminator sequences. Sigma factor 32 specific Position weigh matrices were used for identifying sRNAs from the complete bacterial genome using sRNA Scanner [8,16,24]. sRNA Scanner  was used with CSS of 12 and search length with 50-500 nt.
To ensure the non-coding nature of the sRNA, the proteincoding potentials of the transcripts were assessed based on coding potential score (CPS) using the coding potential calculator (http://cpc.cbi.pku.edu.cn/server). Accordingly, CPS score − 1 represents weak non-coding and + 1 means weak coding of the transcript [25]. Transcripts with a true noncoding nature were considered for further annotation of sRNA. Length and GC content of the putative non-coding transcripts were analyzed using customized PERL script.
To refine the data, every sRNA was checked in Rfam database and Bacterial small Small Regulatory RNA Database (BSRD) [26] to identify the already reported sRNAs. The sRNAs were also compared with previous reports to assess and confirm their novelty. Filtered putative non-coding RNAs (sRNAs) were used for further analysis.

Identification of sRNAs from transcriptome
The RNA-seq dataset was obtained from the NCBI Gene Expression Omnibus (GEO) (Accession No: GSM1212456) [14]. The raw reads of R. etli CE3 under three different conditions (control, heat shock, and saline shock) downloaded from the sequence read archive (SRA) database (Accession No.: SRP028924). The SRA tool kit was used for extracting the transcriptome reads from SRA files in FASTQ format [27]. PolyA, polyT and Illumina adapters were removed with cutadapt tool [28]. Sequence quality was analyzed using FastQC. Sequence reads having phred score > 20 were used for further analysis. Trimmed reads were aligned to the genome of R. etli by using Rockhopper tools for transcriptome read counting [29,30]. Based on the alignment data, noncoding transcripts are considered as sRNA. The RPKM (reads per kilobase of transcript per million mapped reads) values of experimental conditions (heat and saline shock) were compared with control for calculating the fold change.
Reads of the coding and non-coding transcripts were separated and aligned to the reference genome. The sRNA sequence was aligned to the genome and visualized using the Integrative genome viewer (IGV). Genomic coordinates of predicted sRNA were extracted from the genome using either Samtools or bedtools. Genomic coordinates of these predicted RNAs are provided in the Rockhopper output file.
Target and secondary structure prediction for sRNAs TargetRNA2 Software was used to predict the mRNA targets for the predicted trans-encoded sRNAs (http://cs. wellesley.edu/~btjaden/TargetRNA2/). TargetRNA2 is a web server that identifies mRNA targets of sRNA regulatory action in bacteria. As input, TargetRNA2 takes the sequence of an sRNA and the name of a sequenced bacterial replicon and it uses a variety of features, including conservation of the sRNA in other bacteria, the secondary structure of the sRNA, the secondary structure of each Table 4 Promoter, terminator and secondary structure of sRNAs identified from the transcriptome data candidate mRNA target and the hybridization energy between the sRNA and mRNA targets [31]. RNAfold web server (http://rna.tbi.univie.ac.at/cgi-bin/ RNAfold.cgi) was used to predict the secondary structure of sRNAs. sRNA FASTA sequences were used for calculating minimum free energy (ΔG) based on the partition function (default parameter) [32].

Functional enrichment analysis of novel putative sRNAs
Novel putative sRNAs were screened by the integration of the sRNAs predicted from the genome and transcriptome. Sigma factor 32 based sRNAs predicted from the genome were blasted against the sRNAs identified from the transcriptome data of shock conditions [14]. Further, selected sRNAs from genome and transcriptome were functionally annotated based on the target of these sRNAs.
Functional categorization of the predicted target mRNAs was done by clusters of orthologous group (COG) analysis using the Eggnog database [R]. Gene ontology (GO) annotations and regulatory relationships among the biological processes were analyzed through the GO regulatory network by using the comparative GO web server [33].