Cucurbitales platform

Cucurbitales Genome Research Platform

I. Introduction

What is Cucurbitales Genome Research Platform (CGRP)?

Cucurbitales, one of the most diverse angiosperm orders originated in the Cretaceous, contains at least 2,680 species from 128 genera across 8 families. It has many edible, medicinal, and ornamental crops that are of great significance to human survival and ecological civilization, thus attracting numerous scientists to study its evolution and functions. In recent years, the genomes of Cucurbitales have been rapidly accumulated. Here, we developed the Cucurbitales Genome Research Platform (CGRP: https://cucurbitales.cgrpoee.top/), enabling the in-depth exploration of Cucurbitales genomics. CGRP integrates 213 genomes, including all published genomes of 34 Cucurbitales plants, our newly deciphered Hemsleya chinensis genome from early-diverging Cucurbitaceae, and 28 representative angiosperm genomes Analyzing of 29-CGs and outgroup grape (Vitis vinifera) genome, we annotated 3,761,602 biological function terms and metabolic pathways of 720,404 proteins, and identified 18,367,214 transposable elements (TEs), 86,894 regulatory proteins (RPs), 927,302 duplicated genes (Dupl-types), and 606 N6-methyladenosine modifications (m6As), 12,083,671 gene pairs across 1,208,371 syntenic blocks (SBs) through 900 pairwise genome comparisons, 2,700 genomic synteny dotplots, 270,489 paralogs associated with 7 WGDs. Consequently, we implemented a series of user-friendly query, analysis, and visualization tools and interfaces in CGRP to facilitate the exploration of Cucurbitales genomics using these large-scale results. Notably, the development of DotView, SynView, and DecoBrowse provides three new gateways for using the current synteny data and ancestral genomes to reveal the paleogenome reshuffling and its consequences during the polyploidizations in Cucurbitals. Systematically, we integrated the species encyclopedia, multi omics data, ecological resources, cultivation techniques, relevant literatures, and external database connections, and developed the corresponding query interface for these resources. Considering the mining of new data, we developed ‘one-stop’ comparative genomics toolbox containing 49 window operated bioinformatics tools, of which 15 tools are newly developed by us. Besides, we provided interactive statistical charts, user manuals, and submission ports for the resources of data and tools in CGRP. In short, CGRP is a comprehensive platform with genomic synteny sources as the central gateway, and could be an important community for the exploration of Cucurbitales genomics.

II. Datasets and Workflow

Data sources

We collected the Latin names of Cucurbitales and associated publications from plaBiPD (Gui et al., 2023), along with the genomes, GFF, and the sequences of CDS and PEP. A total of 185 Cucurbitales genomes were collected from NCBI (https://www.ncbi.nlm.nih.gov/), CuGenDBv2 (Yu et al., 2023), and CNCB (https://www.cncb.ac.cn/) (Supplemental Table 1). Details of these datasets can be searched in the Resource portal of CGRP. For species with multiple genome versions, prioritize selecting the high-quality, latest, or widely used version. Images and descriptions of plants were mainly obtained from Wikipedia (http://wikipedia.org) and Plants of the World Online (https://powo.science.kew.org/). The literature were extracted from PubMed (https://pubmed.ncbi.nlm.nih.gov/). Cucurbitales transcriptomes were obtained from the SRA database (https://www.ncbi.nlm.nih.gov/sra/). Metabolomes were sourced from the NPASS database (https://bidd.group/NPASS/index.php).

Gene annotation pipelines

Raw data processing. First, process the GFF3 data, extract the gene, starting and ending sites, as well as the positive and negative chains in GFF3, and set the new GFF as follows: the first column is the chromosome number, the second column is the starting position of the gene, the third column is the ending position of the gene, the fourth column is the positive and negative chain information, the fifth column is the original id of the gene, and the sixth column is the new id of the gene. It is generally the abbreviation of the species name, followed by the chromosome number and the number of the chromosome where the gene is located. The seventh column is the number on the chromosome where the gene is located. Then, extract the corresponding CDS and PEP based on the processed gff, and simultaneously complete the modification of the gene id.

Gene annotation. Functional annotation of predicted protein-coding genes was carried out using InterProScan. The tool was executed with default parameters, using the protein sequences (PEP) as input to identify functional domains, families, and sites by searching against the InterPro member databases. This analysis provided annotations including Gene Ontology (GO) terms and InterPro entries.

KEGG annotation. The GhostKoala was employed to annotate the KEGG pathways of proteins.

TE annotation. A comprehensive approach combining evidence-based searches and de novo predictions was employed to identify the TEs of Cucurbitales. For evidence-based searches, the RepeatMasker (v4.1.4) was used with default parameters. For de novo predictions, the RepeatModeler (v2.0.3) was applied to generate a consensus library, and the DeepTE (-sp P) was used to reclassify unknown TEs, improving classification accuracy. Then, the results from both methods were integrated using RepeatMasker (v4.1.4)

Regulatory proteins annotation. The identification of regulatory proteins (RPs), including transcription factors (TFs) and transcriptional regulators (TRs), was conducted using the iTAK . The program was run locally with default parameters, using the predicted proteome (PEP file) as input for genome-wide prediction.

Gene duplication annotation. Classification of gene duplication types was performed using the DupGen_finder pipeline. All genes were systematically categorized into five classes: Whole-Genome Duplication (WGD), Tandem Duplication (TD), Proximal Duplication (PD), Transposed Duplication (TRD), and Dispersed Duplication (DSD). The analysis was run using the tool's default parameters.

N6-methyladenosine regulators identification. The m6A regulators includes three functional groups: writers, readers, and erasers (Yue et al., 2019), each associated with different Pfam families. According to the Pfam IDs PF05063, PF17098, PF18408, and PF15912, four families (MTA70, WTAP, HAKAI, and VIRILIZER) were identified among writers. Using PF04146 and PF13532, the IYT521-B family was identified among readers and erasers.

Data analysis pipelines

Identification of polyploidy events. To identify polyploid events, we first performed genome-wide BALSTP (E-value <1e-5, score >100) within and between the studied genomes using the software BALST (Altschul et al., 1990). Then, using CollinearScan software (Wang et al., 2006), the best 10 BLASTP matches were selected for inferring gene splicing regions (blocks) within or between genomes. Where the maximum gap was set to 50 spacer genes and large gene families with more than 50 members were removed from the blocks. The median value of synonymous nucleotide substitutions (Ks) for collocated genes was further used to determine the degree of divergence of the identified blocks. We calculated the Ks values between tandem gene pairs using the Bioperl statistical module and the Nei-Gojobori method (Nei & Gojobori, 1986). We further plotted adjacent gene pairs as dot plots based on genomic location and used different colored dots to distinguish whether the anchor gene pair was the best BLAST hit within/between genomes. We then identified the immediate and paralogous genomic regions within and between genomes based on the generated homology dot plots. Between genomes, a region was identified as an orthologous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with species differentiation; within genomes, a region was identified as a paralogous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with a particular polyploidization event. Finally, we can infer the history of WGD by investigating the ratio of syntenic depths within and between genomes.

Identification of event-related genes and dating of key evolutionary events. Plant genomes evolve at different rates (Cui et al., 2006; Wang et al., 2011), making it difficult to determine the timing of key events in their evolutionary history. Here, we constructed a correction algorithm for redetermining key evolutionary events in monocotyledons. First, based on orthologous and paralogous regions identified within and between genomes, we isolated sets of orthologous and paralogous lineages resulting from species divergence and polyploidy events. Second, we determined the evolutionary rates of key evolutionary events in monocotyledons by performing nuclear function analysis of Ks between these orthologous and paralogous relatives. Finally, we performed several rounds of Ks correction for the evolutionary rates of these events according to different correction bases. The first round of correction was based on the Ks distribution peaks of the differentiation events in monocotyledons and grapes to have the same values. After the first round of correction, there was still a large divergence between the τ and σ events produced by homologous plants. Therefore, similar to the first round of correction, we performed several more rounds of Ks correction based on τ and σ events. Details of the correction process can also be found in our previous articles (Wang et al., 2017; Wang, J et al., 2018; Wang, J et al., 2019b; Wang et al., 2022), and the computational script of the correction algorithm has been stored in Github (https://github.com/wangjiaqi206/corrected-evolutionary-dating).

Comparison of genome fractionation. By comparing the rates of gene retention and loss, we can characterize the degree of divergence between subgenomes produced by different polyploidization events. In which, the gene deletion rate was calculated by dividing the number of collinear gene deletions in the study species by the total number of genes per chromosome in reference genome. The genome retention rate was calculated by dividing the number of the most conserved collinear genes (orthologs retained in both reference genomes) in the study species by the number of relatively conserved tandem genes (orthologs only retained in the main reference genome). In addition, the degree of divergence between event-produced subgenomes can also be inferred by a statistical method we previously developed, the polyploidy index (P-index) (Wang, J et al., 2019a). In addition, previously studies have demonstrated that the P-index ~ 0.3 could be used as a threshold to classify auto- and allopolyploidies (Wang, J et al., 2019a). The reason is that the known and previously inferred allopolyploidies always have larger P-index > 0.3, including that the Brassica napus, Zea mays, Gossypium hirsutum, and Brassica oleracea (Schnable et al., 2011; Chalhoub et al., 2014; Li et al., 2014; Wang, M et al., 2015; Renny-Byfield et al., 2017). While the inferred autopolyploidies of Glycine Max, Populus trichocarpa, and Actinidia chinensis (Murat et al., 2017; Wang et al., 2017; Wang, JP et al., 2018) often have P-index < 0.3.

The pipeline for inferring ancestral karyotypes and evolution. The inference of ancestral genome structure and paleogenome remodelling trajectories is divided into 7 main steps. 1) Genome-wide comparison of the species involved, based on BLAST (Altschul et al., 1990) software, to confirm conserved homologous genes between and within genomes. 2) The homology information obtained from BLASTP was entered into CollinearScan (Wang et al., 2006) or MCScanX (Wang et al., 2012) for collinearity analysis to identify the synteny blocks. 3) Identification of orthologs and paralogs associated with speciation and polyploidy by inter- and intra-genomic comparisons. 4) Identification of conserved ancestral regions (CARs) by the combination of dotplots and gene collinearity between genomes. 5) Identification of ancient chromosomal rearrangements in conjunction with species trees. For example, if the conserved chromosomal regions CARs 1 and 2 are adjacent in the study species A, B, then it is reasonable to assume that CARs 1 and 2 are fused in the ancestor of A and B. If CARs 1 and 2 are not adjacent in study species B, it is difficult to determine the ancestral structure of species A and B. A reference species would then need to be introduced, and if CARs 1 and 2 also adjacent in the reference species R, then the ancestral structure of A and B would still be CAR1-CAR2. In addition, the inference of ancestral chromosomes rearrangements also needs to consider the effects of duplication, and we have modelled the possible scenarios in Then, by identifying and collating all the CAR rearrangements, we can bottom-up infer the ancestral karyotype and its composition of the study species. 7) After determining the ancestral genome, we can identify the fusion patterns and rearrangement trajectories of paleochromosome by comparing the CRAs in the dotplot between the modern and ancestral genome. For example, if the two chromosomes corresponding to the same ancestral chromosome in the study species are structurally different, such as the translocation, then this change should occur after the WGD; and conversely, before the WGD, such as the end-to-end joining fusion (EEJ) and nested chromosome fusion (NCF). The actual process of inferring ancestral genome and paleochromosome remodelling trajectories can be more complex, and requires careful and lengthy verification and validation.

Gene Family Analysis Pipeline. Gene families can be easily identified in IPAP, which has three sequence matching modes, such as Blast, Diamond, and Blast match. these three functions can be used to match target sequences against known protein sequences and thus filter the desired gene families. In addition, there is also a structural domain identification function, which allows easy structural domain prediction of target sequences through the Pfam database. After the gene family sequences are identified, researchers can perform multiple sequence comparisons and then construct phylogenetic trees. Meanwhile, codon and CPG island prediction can be performed in IPAP, and non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) can also be calculated. In addition, researchers can predict and map motifs and gene structures. This greatly facilitates the needs of researchers for gene family analysis.

III. Browse

Community and collection of resources

Items	Brief Introduction	Records
Syn-Dotplots	Homologous structure dotplot related to Cucurbitales	2,700
Hierarchical alignments	multi-genome alignment matrix with gene identifiers	225x23,647
Event-related genes	Information on gene pairs associated with Event-related	270,489
Functional genes	Function-related gene family information	501,617
Regulatory Proteins	Regulatory Proteins gene family information	86,894
Annotations	Annotationed biological function terms	3,761,602
Pathways	Detailed information on Cucurbitales-related Pathway	720,404
Syn-Orthogroups	Gene family information for Orthologous Gene	736,523
Duplicated genes	Identification of WGD types of genes	927,302
M6As	*	606

IV. FAQ

A. How to download the data in Cucurbitales Genome Research Platform?

All data in the CGRP can be downloaded from the appropriate resource page. Such as genome data, transcriptome data, Pathways, Jbrowse etc..

B. How to contact us?

If you meet any troubles or find any bugs when you visit Cucurbitales Genome Research Platform, please email to [email protected], or you can contact us by:

Address info 21 Bohai Road,Caofeidian, Tangshan 063210, Hebei, China

C. Citation

Data files contained in the CGRP are free of all copyright restrictions and made fully and freely available for non-commercial use. Users of the data should cite the following articles:

・CGRP: A high-value platform for exploring the genomics of Cucurbitales

・An Overlooked Paleotetraploidization in Cucurbitaceae

・A common whole-genome paleotetraploidization in Cucurbitales