|
Wconsensus (version 5c) |
This program determines consensus patterns in unaligned sequences. The major difference between "wconsensus" and "consensus" is that this program will determine the width of the pattern being sought. The algorithm is based on a matrix representation of a consensus pattern. Each row corresponds to one of the letters of the relevant alphabet---e.g., 4 rows in the case of DNA. Each column corresponds to one of the positions within the pattern. The elements of the matrix are determined by the number of times that the indicated letter occurs at the indicated position based on the words summarized by the pattern.
Matrices are constructed by sequentially adding additional words to previously saved matrices. During each cycle, only the most significant matrices are saved. The maximum number of matrices to save is determined by the "-q" option (see section 1 below). In practice, less matrices are ultimately saved because many of the matrices initially saved are identical to each other.
The program can use 3 different criteria for deciding to stop adding additional words to the saved matrices:
1) Each sequence has contributed exactly one word to the saved matrices (the default).
2) The saved matrices contain a maximum allowable number of words (set with the -n and -N options).
3) The program has completed a designated number of cycles since finding the current most significant alignment (set with the -t option).
* This latter criteria is used in addition to criteria 1 and 2 to terminate the program sooner.
The significance of a matrix is initially measured by its information content. A higher information content indicates a rarer pattern and a more desirable matrix. The information content of alignments having different widths are compared after subtracting from each position the average information and a multiple of the standard deviation expected from an arbitrary alignment of random sequences. The program also estimates for each matrix a p-value, which is the probability of observing the particular information content or higher in an arbitrary alignment of random words having a length equal to the matrix's width. The ultimate statistical significance of a matrix is determined by multiplying the p-value by the approximate number of possible alignments, containing the designated number of sequences and having the observed width. We refer to this product as the expected frequency of the matrix alignment. The expected frequency allows the comparison of matrices summarizing differing numbers of sequences and having differing widths.
To identify an overall best alignment, it is necessary to determine the alignments using various multiples of the standard-deviation correction to the information content (set with the -s option). As the standard-deviation correction is increased, less positions will tend to be in the resulting alignments. The overall best alignment is the one having the lowest expected frequency. We have found standard-deviation corrections of 0.5, 1, 1.5, and 2 to be useful starting values.
The program can print two different lists of matrices. The first list contains the matrices having the highest adjusted information from each cycle, ordered by decreasing statistical significance (i.e., increasing expected frequency). In general, this first list will contain the most interesting alignment. The second list contains the matrices saved after the final cycle of the program, also ordered by decreasing statistical significance. In general, this latter list will be useful when the user wishes each sequence to contribute exactly one word to the final alignment (i.e., when the -n and -N options are not used).
In the program's output, the words contained in each matrix are listed in the order of their occurrence in the input sequences. The order is indicated by "integer|integer". The first integer is simply a sequential count of the words, and the second integer indicates during which cycle the word was added to the matrix. The location of a word is indicated by "integer/integer". The first integer indicates which sequence contains the word, and the second integer indicates where in that sequence the word is located. If the first integer is preceded by a minus sign, then the complementary word is the one included in the matrix.
The output of the program is sent to the standard output. The input files---those containing the actual sequences and those indicated by the "-f", "-a", and "-i" options---can contain comments according to the following convention. The portion of a line following a ';', '%', or '#' is considered a comment and is ignored. Comments can begin anywhere in a line and always end at the end of the line. The one minor exception is that, to avoid ambiguity, comments in the list of sequences (see the "-f" option below) must be preceded by a blank space when not occurring at the beginning of a line.
f) -q integer:
g) -s number:
the maximum number of matrices to save between cycles of the program---i.e., the queue size (default: save 200 matrices). This option can also be changed while the program is running: see section 5 below.
the number of standard deviations to lower the information content at each position before identifying information peaks (required). A range of values should be tried. For example, try values of 0.5, 1, 1.5, and 2. The overall best alignment is the one having the lowest expected frequency.
a) -d option:
use the designated prior probabilities of the letters to override the observed frequencies. By default, the program uses the frequencies observed in your own sequence data for the prior probabilities of the letters. However, if the "-d" option is set, the prior probabilities designated by one of the next 3 options are used. If the "-d" option is not set, the next 3 options are still used to determine the sequence alphabet, but any prior probability information is ignored.
* The next three options are mutually exclusive (default: "-a alphabet").
b) -a filename:
Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the relative prior probabilities of the letters. For nucleic acids, this might be be the genomic frequency of the bases; however, if the "-d" option is not used, the frequencies observed in your own sequence data are used. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible.
* POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
* POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
file containing the alphabet and normalization information. [Use "-af" when using the VMS operating system]
* Alphabet modifier indicating whether ascii alphabets are case sensitive---the following option is mutually exclusive with the "-i" option (default: ascii alphabets are case insensitive).
* Options for handling the complement of nucleic acid sequences--- the 3 options in this section are mutually exclusive.
a) -CS: ascii alphabets are case sensitive. [Use "-as" when using the VMS operating system]
b) -c0: ignore the complement (the default option)
c) -c1: include both strands as separate sequences
d) -c2: include both strands as a single sequence (i.e. orientation unknown)
[ONLY IMPLEMENTED FOR WHEN THE "-n" and "-N" OPTIONS ARE NOT USED!!!!]
a) -pr1: save the top progeny matrices regardless of parentage.
b) -pr2:
c) -l (lowercase L):
d) -n integer:
e) -N integer:
f) -m integer:
g) -t integer:
h) -pg0: do NOT permit terminal gaps (the default).
i) -pg1: permit penalized terminal gaps---i.e. terminal deletions.
* The "-q", "-n", and "-t" options can be changed after the program starts by placing the new options in a file called "changes." suffixed with the process identification number---the PID number listed at the beginning of the program's output. For example a file called "changes.10568" might contain "-q10 -n50 -t2". The "-n" option can change the maximum number of words in the alignments even if it was not used at the beginning of the program, although it will not permit a sequence to contribute more than one word to the alignment unless the "-n" or "-N" option was used on the command line. If the "-t" option was not used when the program was started, this option will only keep track of alignments beginning with the cycle during which it is first initiated.
try to save the top progeny matrices for each parental matrix (the default). This option prevents a strong pattern found in only a subset of the sequences from overwhelming the algorithm and eliminating other potential patterns. This undesirable situation can occur when a subset of the sequences share an evolutionary relationship not common to the majority of the sequences. This option corresponds to the original "consensus" algorithm (Stormo and Hartzell, 1989, PNAS, 86:1183-1187; Hertz et al. 1990, CABIOS, 6:81-92).
seed with the first sequence and proceed linearly through the list. This option results in a significant speed up in the program, but the algorithm becomes dependent on the order of the sequence-file names. This option corresponds to the original "consensus" algorithm (Stormo and Hartzell, 1989, PNAS, 86:1183-1187; Hertz et al. 1990, CABIOS, 6:81-92).
repeat the matrix building cycle a maximum of "integer" times and allow each sequence to contribute zero or more words per matrix. [Use "-n1" when using the VMS operating system]
repeat the matrix building cycle a maximum of "integer" times and allow each sequence to contribute one or more words per matrix. [Use "-n2" when using the VMS operating system]
the minimum distance between the starting points of words within the same matrix pattern; must be a positive integer; can only be used when the "-n" or "-N" option is also used. If the integer is a 1, then there is no restriction on the overlap. (default: 1).
terminate the program "integer" cycles after the current most significant alignment is identified (default: terminate only when the maximum number of matrix building cycles is completed).
* -pg1 cannot currently be used with the -n option.
Input sequence can be either of FASTA format or Consensus Format. If FASTA format sequences are given, the program will convert them to Consensus format internally before running the Consensus or Patser program.
Do not explicitly give the complements of nucleic acid sequences. If needed, the complementary sequence is determined by the program. Whitespace, periods, dashes (unless part of an integer when the "-i" option is used), and comments beginning with ';', '%', or '#' are ignored. When using letter characters (i.e., with the "-a" and "-A" alphabet options), integers are also ignored so that the sequence file can contain positional information. When using integer characters (i.e., with the "-i" alphabet option) the integers must be separated by whitespace.
Sequences surrounded by slashes (/) do not contribute to the generation of the patterns; thus, a portion of a sequence can be ignored without disrupting the overall numbering of the sequence. A double slash (//) would indicate a discontinuity in the sequence. A '/' at the beginning or the end of a sequence will cause the sequence to be marked as non-circular even if the sequence's name is marked with a "-c" (see the "-f" option in section 1). The effect of the single slashes can also be created with the "-i" and "-e" modifiers in the file containing the names of the sequences (see the "-f" option in section 1). When slashes and the "-i" and "-e" modifiers are all used, the intersection of permissible positions is analyzed.
Sequences that follow their name in the file indicated by the "-f" option must be enclosed between backslashes (\) (i.e., the actual sequence must be preceded and followed by a backslash). However, if the sequence is contained in a separate file, do NOT use a '\'.