Cucurbitales Genome Research Platform

I. Introduction

What is Cucurbitales Genome Research Platform (CGRP)?

Cucurbitales, one of the most diverse angiosperm orders originated in the Cretaceous, contains at least 2,680 species from 128 genera across 8 families. It has many edible, medicinal, and ornamental crops that are of great significance to human survival and ecological civilization, thus attracting numerous scientists to study its evolution and functions. In recent years, the genomes of Cucurbitales have been rapidly accumulated. Here, we developed the Cucurbitales Genome Research Platform (CGRP: https://cucurbitales.cgrpoee.top/), enabling the in-depth exploration of Cucurbitales genomics. CGRP integrates 213 genomes, including all published genomes of 34 Cucurbitales plants, our newly deciphered Hemsleya chinensis genome from early-diverging Cucurbitaceae, and 28 representative angiosperm genomes Analyzing of 29-CGs and outgroup grape (Vitis vinifera) genome, we annotated 3,761,602 biological function terms and metabolic pathways of 720,404 proteins, and identified 18,367,214 transposable elements (TEs), 86,894 regulatory proteins (RPs), 927,302 duplicated genes (Dupl-types), and 606 N6-methyladenosine modifications (m6As), 12,083,671 gene pairs across 1,208,371 syntenic blocks (SBs) through 900 pairwise genome comparisons, 2,700 genomic synteny dotplots, 270,489 paralogs associated with 7 WGDs. Consequently, we implemented a series of user-friendly query, analysis, and visualization tools and interfaces in CGRP to facilitate the exploration of Cucurbitales genomics using these large-scale results. Notably, the development of DotView, SynView, and DecoBrowse provides three new gateways for using the current synteny data and ancestral genomes to reveal the paleogenome reshuffling and its consequences during the polyploidizations in Cucurbitals. Systematically, we integrated the species encyclopedia, multi omics data, ecological resources, cultivation techniques, relevant literatures, and external database connections, and developed the corresponding query interface for these resources. Considering the mining of new data, we developed ‘one-stop’ comparative genomics toolbox containing 49 window operated bioinformatics tools, of which 15 tools are newly developed by us. Besides, we provided interactive statistical charts, user manuals, and submission ports for the resources of data and tools in CGRP. In short, CGRP is a comprehensive platform with genomic synteny sources as the central gateway, and could be an important community for the exploration of Cucurbitales genomics.

 

II. Datasets and Workflow

Data sources

The Cucurbitales genome research platform contains three plant data: CDS, PEP and GFF3. Genome-wide literature and gene annotations are available for download at Ascensialy of NCBI (https://www.ncbi.nlm.nih.gov/assembly/) and/or CuGenDBv2 (http://cucurbitgenomics.org/v2/).

 

Data analysis pipelines

Identification of polyploidy events. To identify polyploid events, we first performed genome-wide BALSTP (E-value <1e-5, score >100) within and between the studied genomes using the software BALST (Altschul et al., 1990). Then, using CollinearScan software (Wang et al., 2006), the best 10 BLASTP matches were selected for inferring gene splicing regions (blocks) within or between genomes. Where the maximum gap was set to 50 spacer genes and large gene families with more than 50 members were removed from the blocks. The median value of synonymous nucleotide substitutions (Ks) for collocated genes was further used to determine the degree of divergence of the identified blocks. We calculated the Ks values between tandem gene pairs using the Bioperl statistical module and the Nei-Gojobori method (Nei & Gojobori, 1986). We further plotted adjacent gene pairs as dot plots based on genomic location and used different colored dots to distinguish whether the anchor gene pair was the best BLAST hit within/between genomes. We then identified the immediate and paralogous genomic regions within and between genomes based on the generated homology dot plots. Between genomes, a region was identified as an orthologous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with species differentiation; within genomes, a region was identified as a paralogous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with a particular polyploidization event. Finally, we can infer the history of WGD by investigating the ratio of syntenic depths within and between genomes.

Identification of event-related genes and dating of key evolutionary events. Plant genomes evolve at different rates (Cui et al., 2006; Wang et al., 2011), making it difficult to determine the timing of key events in their evolutionary history. Here, we constructed a correction algorithm for redetermining key evolutionary events in monocotyledons. First, based on orthologous and paralogous regions identified within and between genomes, we isolated sets of orthologous and paralogous lineages resulting from species divergence and polyploidy events. Second, we determined the evolutionary rates of key evolutionary events in monocotyledons by performing nuclear function analysis of Ks between these orthologous and paralogous relatives. Finally, we performed several rounds of Ks correction for the evolutionary rates of these events according to different correction bases. The first round of correction was based on the Ks distribution peaks of the differentiation events in monocotyledons and grapes to have the same values. After the first round of correction, there was still a large divergence between the τ and σ events produced by homologous plants. Therefore, similar to the first round of correction, we performed several more rounds of Ks correction based on τ and σ events. Details of the correction process can also be found in our previous articles (Wang et al., 2017; Wang, J et al., 2018; Wang, J et al., 2019b; Wang et al., 2022), and the computational script of the correction algorithm has been stored in Github (https://github.com/wangjiaqi206/corrected-evolutionary-dating).

Comparison of genome fractionation. By comparing the rates of gene retention and loss, we can characterize the degree of divergence between subgenomes produced by different polyploidization events. In which, the gene deletion rate was calculated by dividing the number of collinear gene deletions in the study species by the total number of genes per chromosome in reference genome. The genome retention rate was calculated by dividing the number of the most conserved collinear genes (orthologs retained in both reference genomes) in the study species by the number of relatively conserved tandem genes (orthologs only retained in the main reference genome). In addition, the degree of divergence between event-produced subgenomes can also be inferred by a statistical method we previously developed, the polyploidy index (P-index) (Wang, J et al., 2019a). In addition, previously studies have demonstrated that the P-index ~ 0.3 could be used as a threshold to classify auto- and allopolyploidies (Wang, J et al., 2019a). The reason is that the known and previously inferred allopolyploidies always have larger P-index > 0.3, including that the Brassica napus, Zea mays, Gossypium hirsutum, and Brassica oleracea (Schnable et al., 2011; Chalhoub et al., 2014; Li et al., 2014; Wang, M et al., 2015; Renny-Byfield et al., 2017). While the inferred autopolyploidies of Glycine Max, Populus trichocarpa, and Actinidia chinensis (Murat et al., 2017; Wang et al., 2017; Wang, JP et al., 2018) often have P-index < 0.3.

The pipeline for inferring ancestral karyotypes and evolution. The inference of ancestral genome structure and paleogenome remodelling trajectories is divided into 7 main steps. 1) Genome-wide comparison of the species involved, based on BLAST (Altschul et al., 1990) software, to confirm conserved homologous genes between and within genomes. 2) The homology information obtained from BLASTP was entered into CollinearScan (Wang et al., 2006) or MCScanX (Wang et al., 2012) for collinearity analysis to identify the synteny blocks. 3) Identification of orthologs and paralogs associated with speciation and polyploidy by inter- and intra-genomic comparisons. 4) Identification of conserved ancestral regions (CARs) by the combination of dotplots and gene collinearity between genomes. 5) Identification of ancient chromosomal rearrangements in conjunction with species trees. For example, if the conserved chromosomal regions CARs 1 and 2 are adjacent in the study species A, B, then it is reasonable to assume that CARs 1 and 2 are fused in the ancestor of A and B. If CARs 1 and 2 are not adjacent in study species B, it is difficult to determine the ancestral structure of species A and B. A reference species would then need to be introduced, and if CARs 1 and 2 also adjacent in the reference species R, then the ancestral structure of A and B would still be CAR1-CAR2. In addition, the inference of ancestral chromosomes rearrangements also needs to consider the effects of duplication, and we have modelled the possible scenarios in Then, by identifying and collating all the CAR rearrangements, we can bottom-up infer the ancestral karyotype and its composition of the study species. 7) After determining the ancestral genome, we can identify the fusion patterns and rearrangement trajectories of paleochromosome by comparing the CRAs in the dotplot between the modern and ancestral genome. For example, if the two chromosomes corresponding to the same ancestral chromosome in the study species are structurally different, such as the translocation, then this change should occur after the WGD; and conversely, before the WGD, such as the end-to-end joining fusion (EEJ) and nested chromosome fusion (NCF). The actual process of inferring ancestral genome and paleochromosome remodelling trajectories can be more complex, and requires careful and lengthy verification and validation.

Gene Family Analysis Pipeline. Gene families can be easily identified in IPAP, which has three sequence matching modes, such as Blast, Diamond, and Blast match. these three functions can be used to match target sequences against known protein sequences and thus filter the desired gene families. In addition, there is also a structural domain identification function, which allows easy structural domain prediction of target sequences through the Pfam database. After the gene family sequences are identified, researchers can perform multiple sequence comparisons and then construct phylogenetic trees. Meanwhile, codon and CPG island prediction can be performed in IPAP, and non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) can also be calculated. In addition, researchers can predict and map motifs and gene structures. This greatly facilitates the needs of researchers for gene family analysis.

 

III. Browse

Community and collection of resources

Items

Brief Introduction

Records

Syn-Dotplots

Homologous structure dotplot related to Cucurbitales

2,700

Hierarchical alignments

multi-genome alignment matrix with gene identifiers

225x23,647

Event-related genes

Information on gene pairs associated with Event-related

270,489

Functional genes

Function-related gene family information

501,617

Regulatory Proteins

Regulatory Proteins gene family information

86,894

Annotations

Annotationed biological function terms

3,761,602

Pathways

Detailed information on Cucurbitales-related Pathway

720,404

Syn-Orthogroups

Gene family information for Orthologous Gene

736,523

Duplicated genes

Identification of WGD types of genes

927,302

M6As

*

606

 

IV. FAQ

A. How to download the data in Cucurbitales Genome Research Platform?

All data in the CGRP can be downloaded from the appropriate resource page. Such as genome data, transcriptome data, Pathways, Jbrowse etc..

 

B. How to contact us?

If you meet any troubles or find any bugs when you visit Cucurbitales Genome Research Platform, please email to [email protected], or you can contact us by:

Address info 21 Bohai Road,Caofeidian, Tangshan 063210, Hebei, China

 

C. Citation

Data files contained in the CGRP are free of all copyright restrictions and made fully and freely available for non-commercial use. Users of the data should cite the following articles:

・CGRP: A high-value platform for exploring the genomics of Cucurbitales

・An Overlooked Paleotetraploidization in Cucurbitaceae

・A common whole-genome paleotetraploidization in Cucurbitales