What it does
GapCloser reduces the size of any gaps present in scaffolds generated by SOAPdenovo2 or another assembler by using the abundant pair relationships of short reads.
System Requirement
GapCloser works on large plant and animal genomes, but it also works well on bacterial and fungal genomes. Its use of memory is associated with the number of reads, the number of unique kmers in the reads, the number of gaps and the scaffold sizes. The processing time of GapCloser also depends on the number of gaps, their sizes and the number of reads. With respect to the assembly of the YH genome which was 3 GB in size, peak memory usage by GapCloser was determined to be about 200 GB and GapCloser required about 1 day to process the YH dataset.
Outputs
Two outputs are produced by GapCloser:
2. A fill file providing information about the gaps in the scaffolds. The first column is the starting position of a gap in the output sequence. The second column denotes the end position. The third and fourth columns are the length of sequences extending separately from the left and right boundaries of a gap. The status of the gaps can be seen from the fifth column of the file. If a gap was closed through the overlapping of Kmers, the flag is set to 1, otherwise it is set to 0. The sixth column shows the length of the gap sequence with relative high accuracy. The seventh column provides the original gap size. The eighth column is the final gap size. If the gap was closed then the value of the fifth column is 1 which is the value of the length of gap sequence. Otherwise, the value is equal to the value of the seventh column or is 1 bp longer than it was when the value of the seventh column is 1.
FAQ
What pair ends will be used for gap filling?
GapCloser mainly uses read pairs of short and medium insert sizes, although the long insert paired end reads over 2K bps in length may also help. It is recommended that the reads be corrected before gap filling to reduce memory usage and improve the accuracy of gap sequences produced at this stage.
What is the sequence quality produced during gap filling?
The sequence quality is statistically lower than that of the sequences on both sides of the gaps.