What it does
Corrector is one of two programs which is used to correct sequencing errors based on the kmer frequency spectrum (KFS). Since it assumes that most low frequency kmers have been generated by sequencing errors, the key to its error correction functionality is to distinguish the rate of the low and high frequency kmers. The use of larger Kmer sizes provides better results but conversely is more computationally intensive. In order to produce a more accurate result, the trimmed length and deletion ratio is balanced with the accuracy level. A practical kmer size should be chosen based on the size of the genome.
Note that 30X data is preferred for calculation of the kmer frequency spectrum.
When kmer size is less than 17 bp, KmerFreq AR and Corrector AR should be used because the processing will be faster than using HA versions. Memory usage will also be less than 16GB (15mer, 1G; 16mer, 4G; 17mer, 16G) for KFS construction. Also, KmerFreq AR supports space-kmer in KFS construction and Corrector AR supports Duo-kmer (consecutive and space kmer) in the correction process.
When kmer sizes larger than 17bp are to be processed, the HA versions of KmerFreq and Corrector should be used since less memory is required for KFS construction.
Outputs
Each lane will generate two pair.fq files containing pair-end reads and one single.fq file containing single-end reads. If Corrector HA has been configured not to process single-end reads, this file will not be present. Finally, one pair.single.stat file containing statistical information will be produced.
For each read file, there is one cor.stat file containing statistical information for each file.
For each reads list file, there is one QC.xls file, containing quality control information.
Memory usage
Memory usage is related to the number of high frequency (greater than low frequency cutoff) kmer species. The peak value of memory usage can be estimated roughly by this formula: HighFreqKmerSpeciesNumber * 8 Byte.
Further information
When calculating the KFS, 30X data is preferred.
Remember that the ASCII shift of quality default value (Quality_shift -Q) is 64. You should check the file and make sure this option is correctly set.
Low-frequency kmers regions will be interpreted as sequencing errors, and will be corrected or removed in the final result. However, the whole genome shotgun sequencing will generate random reads across the genome, and some regions will have very low coverage. These regions will be removed. You should consider what this effect may have on your data when interpreting the final assembly.