Sequence Alignment and Analysis
Sequence alignment is crucial for any kind of evolutionary relationships analyses. The method has its usefulness in extracting functional and even tertiary structure information from the amino acid sequence of proteins. Since evolutionary relationships assume that a certain percentage of the amino acid residues in a protein sequence are conserved, the simplest way to assess the relationships between two sequences would be by counting the number of identical and similar amino acids. This is done by sequence alignment. The number of identical and similar amino acid residues may then be compared to the total number of amino acids in the protein and the resulting number is called the percentage of sequence identity or sequence similarity, depending on whether we compare the identical or similar amino acids. Similarity means that the presence of amino acids with similar physic-chemical characteristics, like positively charged Lys and Arg, or hydrophobic Leu and Val, etc. Substitution of amino acids by chemical equivalents in a sequence often does not have any dramatic consequences when the 3D structure or protein function is concerned. For example, Leu and Val will be equally tolerated within a hydrophobic core, assuming that there is place for the slightly longer side chain of leucine. The same applies to Lys and Arg, which are usually located on the surface of proteins and primarily interact with solvent or with the acidic side chains of Glu or Asp. The same applies for other amino acids of similar physicochemical characteristics.
However, to be able to count the number of identities and similarities, we first need to align the sequences against each other, and we also need some rules describing how this alignment should be done. The computer program, which makes the sequence alignment following a certain algorithm, will try to align the maximum number of identical or similar amino acid residues against each other.
By other words, we need some rules which would allow us to assess the importance of different replacements, for example, when counting the percentage of sequence similarity. In addition, it is quite common that a sequence, when compared to sequences of other members of a family, have some extra inserted residues (insertions), or some residues may be missing (deletions). This can be seen, for example, when a group of bacterial sequences is compared against a group of eukaryotic sequences. Sometimes even larger segments or a whole domain may be inserted into or deleted from a protein. Depending on how we handle these insertions and deletions, different sequence alignments may be generated. By other words, the computer program that generates the alignment will need some criteria to distinguish between different possible alignments to be able to choose the best one.
The score of the alignment can be assessed, for example, by a simple expression:
(Score) S= number of matches - number of mismatches
Everything looks nice, except that to maximize the number of matches, we introduced a gap (marked by a dash in the first sequence). A gap in one of the sequences simply means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. They are called gap penalties. Each time the program introduces a gap it triggers a penalty score, which reduces the total score of the alignment. However, this would make the whole thing meaningless, unless gap introduction will rise the score by a value that is higher than negative effect of the penalty. By this simple way we can limit the number of gaps and increase their significance. The value of gap penalties is a parameter which can be changed during the alignment, thus controlling the number, length and position of the gaps. At the next page we will continue the discussion of the way we can construct a sequence alignment.
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ; Gish; Miller; Myers; Lipman (October 1990). "Basic local alignment search tool". Journal of Molecular Biology 215 (3): 403-10. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712
2. Altschul SF, Madden TL, Schäffer AA, et al. (September 1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Research 25 (17): 3389-402. doi:10.1093/nar/25.17.3389. PMC 146917.PMID 9254694
3. Li W, McWilliam H, Goujon M, et al. (June 2012). "PSI-Search: iterative HOE-reduced profile SSEARCH searching". Bioinformatics 28 (12): 1650- 1651.doi:10.1093/bioinformatics/bts240. PMC 3371869. PMID 22539666
4. Rucci E (July 2015). "An energy-aware performance analysis of SWIMM: Smith-Waterman implementation on Intel's Multicore and Manycore architectures". Concurrency and Computation: Practice and Experience. doi:10.1002/cpe.3598
5. Kent, W. J. (2002). "BLAT---The BLAST-Like Alignment Tool". Genome Research 12(4): 656-664. doi:10.1101/gr.229202. ISSN 1088-9051. PMC 187518.PMID 11932250
6. Rizk, Guillaume; Lavenier, Dominique (2010). "GASSST: global alignment short sequence search tool". Bioinformatics 26 (20): 2534-2540.doi:10.1093/bioinformatics/btq485. PMC 2951093. PMID 20739310.