What it does
SOAP performs efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The program is designed to handle short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology. SOAP is compatible with numerous applications, including single-read or pair-end resequencing, small RNA discovery and mRNA tag sequence mapping. SOAP supports multi-threaded parallel computing, and has a batch mode for query multiple data sets.
Single-end sequencing
SOAP will allow a certain number of mismatches or one continuous gap when aligning a read onto a reference sequence. The best hit of each read which has the minimal number of mismatches or the smallest gap is reported. For multiple equal best hits, the user can instruct SOAP to report all hits, a random one, or disregard all of them. Since the typical read length is 25-50 bp, hits with too many mismatches are unreliable and are hard to distinguish with random matches. By default, the program will allow at most two mismatches. With regards to two haplotype genome sequences, the occurrence of single nucleotide polymorphism is much higher than that of small insertions or deletions, so ungapped hits have precedence over gapped hits. For gapped alignment, only one continuous gap with a size ranging from 1 to 3 bp is accepted, while no mismatches are permitted in the flanking regions to avoid ambiguous gaps. The gap could be either insertion or deletion in the query or reference sequence.
An intrinsic characteristic of sequencing technology is the accumulation of errors during the sequencing process. Reads always exhibit a higher number of sequencing errors at the 3'-end, which sometimes inhibits them from being aligned onto reference sequences. To deal with this problem, SOAP can iteratively trim several basepairs at the 3'-end and redo the alignment, until hits can be detected or the remaining sequence is too short for specific alignment.
Pair-end sequencing
This methodology involves sequencing both ends of a DNA fragment. A pair of reads will always have a settled relative orientation and approximate distance between each other on the genome. This characteristic can significantly improve the accuracy of resequencing mapping, and is a powerful method for detecting structural variants including copy number variations (CNVs), rearrangements and inversions. SOAP is able to align a pair of reads simultaneously. A pair will be aligned when two reads are mapped with the right orientation relationship and proper distance. In a similar fashion to single-read alignment, a certain number of mismatches are allowed in one or both reads of the pair. For gapped alignment, gap is only permitted on one read, and the other end should match exactly.
Output file format
Each line corresponds to an alignment hit between a read and its reference, and has the following columns:
System Message: WARNING/2 (<string>, line 61)
Literal block expected; none found.Column Description ---------- -------------------------------------------------------- 1 Id An identifier for read sequence. 2 Seq Full read sequence. It will be converted to the complementary sequence if mapped on the reverse chain of the reference. 3 Qual Sequence quality. 4 Numhits Number of best hits. Reads with no hits are ignored. 5 a/b A flag used for pair-end alignment that distinguishes which file the read is belongs to. 6 Length Read length. If aligned after trimming, it will report the length of the trimmed read. 7 +/- Denotes whether alignment occurs on the direct(+) or reverse(-) chain of the reference. 8 Chr Identifier of the reference sequence. 9 Location Location of the first base pair on the reference, counted from 1. 10 Types Type of hit associated with read 0: exact match 1-100: RefAllele->OffsetQueryAlleleQual": number of mismatches, followed by detailed mutation sites and switch of allele types. Offset is relative to the initial location on reference. 'OffsetAlleleQual': offset, allele and quality. Example: "2 A->10T30 C->13A32" means there are two mismatches, one on location+10 of reference, and the other on location+13 of reference. The allele on reference is A and C respectively while query allele type and its quality is T,30 and A,32. 100+n Offset": n-bp insertion on read. Example: "101 15" means 1-bp insertion on read, start after location+15 on reference. 200+n Offset: n-bp deletion on read. Example: "202 16" means 2-bp deletion on query, start after 16bp on reference.
Algorithm
SOAP1 initially loads reference sequences into memory and then creates hash tables for seed indexing. For each query, a search is performed for seeded hits followed by the alignment of the read to the reference.
1. Load in reference sequences. In contrast to Eland and Maq which load reads into RAM, SOAP stores the reference sequences into memory. Two bits are used for each base, so one byte can store 4 bps. In theory, it will need L/4 bytes for reference with total sequence size L.
2. Suppose a read is split into 4 parts; a, b, c and d. Two mismatches will be distributed on at most two of the 4 parts at the same time. So if we use the combination of two parts as seed, and check for mismatches in the remaining parts, it will be able to get all hits with up to 2 mismatches. There are six combinations - ab, ac, ad, bc, bd and cd, and essentially 3 types of seeds - ab, ac, ad. So we build 3 index tables. To save memory, we set a skip of 3-bp on the reference. The strategy is essentially the same as that used in the Eland and Maq program.
3. Look up table. We used look up table to judge how many mismatches are present between a reference and read. To have best efficiency, the table used 3 bytes to check a fragment of 12-bp on a time. The table occupied 2^24=16Mb RAM.
4. Search for hits. Identical hits are identified first. If no hits are found then 1-mismatch hits will be picked up followed by 2-mismatch hits. Finally, gapped hits are identified.
Important notes
1. The use of very short sequences should be avoided in the reference genome. The program will break if there are reference sequences which are shorter than the query reads so these should be removed.
Citation
If you use SOAP in your research, we would appreciate the citation of its paper:
Ruiqiang Li, et. al. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008 24: 713-714.
More information
For test data and more detailed information, click here.