Cenicaña advances in the construction of the sugarcane genome

The genome is similar to a library where all the genetic information of an organism is stored.

The genome is similar to a library where all the genetic information of an organism is stored.

Information, instead of being stored in books, is stored in chromosomes, which are written with four chemical bases (adenine, guanine, thymine and cytosine) that make up DNA.

These bases determine the instructions within DNA and, similar to how the letters of the alphabet are combined to make words, they can be combined to form genes. Genes carry the information that determines our traits, that is, aspects or characteristics of how we are and that were transmitted by our parents.

Defining a catalog of the genes present in the genome helps to understand the biology of the organism. In this way, it is possible to know its evolutionary process, its relationship with other species and its resistance or susceptibility to different types of diseases, pests or abiotic conditions. In addition, the genome is a frame of reference to understand how variable genes are between organisms of the same species or of different species.

Genetic variability between individuals of a group or a population is observed by changes in some of the chemical bases or in a set of them. These variants are known as molecular markers and have applications in genetic improvement.

For example, in the case of crops, a set of molecular markers identified in a population can be used to find correlations with characteristics of agronomic interest. Because the genome of an organism does not change significantly during its life cycle, molecular markers become predictors of the presence or absence of the characteristics of interest, without having to wait for the culture to complete its life cycle. This and other applications make the information present in the genome valuable in plant breeding processes.

The puzzle game

The assembly of a genome can be seen as a puzzle game, where the tiles correspond to sequences of organic compounds (nucleotides) and the assembled puzzle, to the genome. There are millions of chips (sequences) that must be assembled correctly to obtain the original reference.

To assemble a genome it is necessary to extract a sufficient amount of DNA from the plant, which is subjected to a process called sequencing. This process consists of determining the order of each nucleotide on the genome, fragmenting the DNA into millions of sequences (chips) and without knowing the position of any of them. Due to the large amount of information required to assemble a puzzle of these dimensions, it is necessary to use computational algorithms that allow the order of the sequences to be coupled until the genome is reassembled. However, this procedure does not guarantee the total reconstruction of the genome because some files may remain incomplete, either due to sequencing errors or because, given the complexity of the genome, many almost identical files are found, mainly in the regions that are repeated in the genome.

The procedure has been used successfully in the reconstruction of the human genome and plant species such as the cruciferous Arabidopsis thaliana, rice, soybeans, corn, beans and cassava. The information present in the human genome contributes to understanding different types of diseases and the mutations linked to them, among thousands of applications. In the case of crops, it is used in plant improvement programs to support the selection processes of new varieties.

What are we doing in Cenicaña

At the Research Center we have advanced in the construction of the sugarcane genome based on the DNA extracted from the variety CC 01-1940.

The variety CC 01-1940 is a hybrid obtained by Cenicaña from the cross between CCSP 89-1997 (mother) and CC 91-1583 (father); it was selected in humid environments.

A. To generate the first version of the CC 01-1940 genome, 100.5 Gbp (1 Gbp = 1,000,000,000 bp) were used, obtained through PacBio reads, a new generation sequencing technology or NGS (Next Generation Sequencing). The sequences were characterized by having an average length of 12 Kbp (1 Kbp = 1000 bp).

B. The assembly process was carried out in three stages, using Canu Assembler bioinformatics technology. In the first stage, the sequencing errors in each of the PacBio reads were eliminated. In the second stage, other nucleotide sequences were discarded that, after the first correction, continued to present poor quality.

C and D. In the third stage, with the final readings, an assembly graph was drawn up and the contigs. The assembly graph provides a mathematical representation of the way nucleotide sequences are aligned by their common regions, while the contigs correspond to the assembly generated from these alignments.

In the evaluation of the assembly, the following assembly metrics were taken into account: the number of contigs generated, the median size of the contigs o N50, the total assembled size and the error rate. The values ​​obtained help to verify the quality of the assembly in relation to the continuity of the with yous. In general, it can be said that a low number of contigs favors the continuity of the assembly, and that a large size of contigs favors a low number of them.

According to these metrics, in this first version of the CC 01-1940 variety genome, 75,684 contigs were generated with a median size or N50 equal to 22,455 bp, for a total assembly size of 1224 Mbp and an error rate of 3.5%.

These results indicate that contigs generated are of high quality, although the size of the median indicates that it is necessary to improve their continuity.

The overall size of the assembly shows that it was possible to reconstruct a close reference to the monoploid genome of sugarcane, that is, a basic representation of the complete sugarcane genome was achieved.

By flow cytometric analysis, the size of the genome of the variety CC 01-1940 was determined at approximately 1019 Gbp.

The assembly seeks to “put the puzzle together” in such a way that its final size is as close to the real size of the genome.

With this objective in mind, Cenicaña is working on the second version of the sugarcane genome from the sequencing of a new type of libraries known as Hi-C, Illumina technology from NGS, which provide information about the proximity of DNA sequences and, consequently, help to improve the size of contigs to convert them into larger sequences or scaffold (scaffolding). Furthermore, reconstructing a genome at the chromosome level often requires the construction of genetic maps to guide assembly with greater precision, especially in complex species such as sugarcane.

The sugarcane genome, variety CC 01-1940, in its second version, will be used in Cenicaña as a reference genome for the identification of molecular markers with potential in the genetic improvement program of the Research Center.


Flow cytometry: Single cell analysis technique that uses laser light and detection devices to infer the number of cells in a sample, their size, shape, and other characteristics.

Contig: reading DNA in the form of nucleotide sequences that align with each other by their common regions and that together represent a consensus region of the genome.

Monoploid genome: it corresponds to the basic representation of a complete genome that is made up of a minimum number of chromosomes.

Assembly graph: It is the mathematical and computational representation of the alignments of the fragments or sequencing reads.

HI-C: methodology that allows sequencing adjacent regions of DNA within and between chromosomes, to improve the continuity of an assembly. Sequencing reading: corresponds to a section of a DNA fragment that contains a nucleotide sequence.

NSW: acronym of Next generation sequencing or next-generation sequencing technologies.

Nucleotides: They are organic molecules formed by the covalent union of a five-carbon monosaccharide, a nitrogen base, and a phosphate group.

Base pairs or bp: unit referring to the pairs of nucleotides linked together by hydrogen bonds, represented by bp (base-pair). A 100 bp contig is made up of 100 nucleotide pairs; 1 kbp (kilo base pairs) = 1000 bp; 1 Mbp (mega base pairs) = 1,000,000 bp.

Scaffolds: region or fragment of DNA formed by the overlap of the contigs.


JHON HENRY TRUJILLO. Systems and computer engineer, linked to Cenicaña as a doctoral student in Bioinformatics (Engineering with an emphasis in Computer Science, Universidad del Valle)
JHON JAIME RIASCOS. Biologist, Biotechnologist, Ph.D. - Cenicaña.

Information letter 
Year 6 / Number 1 / July 2018Full text in version: