|
Patser (version 3b) |
This program scores the L-mers (subsequences of length L) of the indicated sequences against the indicated alignment or weight matrix. The elements of an alignment matrix are simply the number of times that the indicated letter is observed at the indicated position of a sequence alignment. Such elements must be processed before the matrix can be used to score an L-mer (e.g., Hertz et al., 1990, CABIOS, 6:81-92). A weight matrix is a matrix whose elements are in a form considered appropriate for scoring an L-mer.
Each element of an alignment matrix is converted to an element of a weight matrix by first adding pseudo-counts in proportion to the a priori probability of the corresponding letter (see option "-b" in section 1 below) and dividing by the total number of sequences plus the total number of pseudo-counts. The resulting frequency is normalized by the a priori probability for the corresponding letter. The final quotient is converted to an element of a weight matrix by taking its natural logarithm. The use of pseudo-counts here differs from previous versions of this program by being proportional to the a priori probability.
Version 3 of this program differs from previous versions by also numerically estimating the p-value of the scores. The p-value calculated here is the probability of observing a particular score or higher at a particular sequence position and does NOT account for the total amount of sequence being scored. P-values are estimated by the method described in Staden, 1989, CABIOS, p. 89--96. The relative value for each element of the weight matrix is approximated by integers in a range determined by the "-R" and "-M" options (section 6 below). The p-value is calculated for each possible integer score and the values are stored. The actual scores for the sequences are determined from the true weight matrix. The true scores are converted to their corresponding integer values and their p-values are looked up.
Matrices can be either horizontal or vertical. In a horizontal matrix, the columns correspond to the positions within the pattern, and the rows correspond to the letters. Each row begins with the corresponding letter (or integer, if the "-i" option is used). In a vertical matrix, the rows correspond to the positions within the pattern, and the columns correspond to the letters. The first row contains the letters (or integers, if the "-i" option is used) corresponding to each column. In both types of matrices, spaces, tabs, and vertical bars (|) are ignored. The output of the "consensus" and "wconsensus" programs consists of horizontal alignment matrices.
The input files can contain comments according to the following convention. The portion of a line following a ';', '%', or '#' is considered a comment and is ignored. Comments can begin anywhere in a line and always end at the end of the line. The output of this program is sent to the standard output.
this file (default: read from the standard input) contains the names of the sequences. The corresponding sequence may follow its name if the sequence is enclosed between backslashes (\). Otherwise, the sequence is assumed to be in a separate file having the indicated name.
In the sequences, whitespace, slashes (/), periods, dashes (unless part of an integer when the "-i" option is used), and comments beginning with ';', '%', or '#' are ignored. When using letter characters (i.e., with the "-a" or "-A" alphabet option), integers are also ignored so that the sequence file can contain positional information. When using integer characters (i.e., with the "-i" alphabet option) the integers must be separated by whitespace.
A "-c" preceding the name of a sequence file indicates that the corresponding sequence is circular.
a) -s : with this option on, the fixed prior probability of 0.25 will be used in program.
Otherwise, the observed frequencies from sequences will be used.
b) -a filename: file containing the alphabet and normalization information.
c) -i filename
d) -A alphabet_and_normalization_information
* the three options in this section are mutually exclusive (default: "-a alphabet"). The a priori probabilities mentioned below are used when converting an alignment matrix to a weight matrix.
Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number
(default: 1.0). The normalization is based on the relative a priori probabilities of the letters. For nucleic acids, this might be be the genomic frequency of the bases or the frequencies observed in the data used to generate the alignment. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement
d) -CS: ascii alphabets are case sensitive.
e) -CM: ascii alphabets are case insensitive, but mark the location of lowercase letters by printing a line containing their locations. This option is useful when lowercase letters indicate a functional landmark such as a transcriptional start in a DNA sequence.
* Alphabet modifiers indicating whether ascii alphabets are case sensitive---the two options in this section are mutually exclusive with each other and with the "-i" option (default: ascii alphabets are case insensitive).
: same as the "-a" option, except that the symbols of the alphabet are represented by integers rather than by letters. Any integer permitted by the machine is a permissible symbol.
: same as "-a" option, except information appears on the command line (e.g., -A a:t 3 c:g 2).
Input sequence can be either of FASTA format or Consensus Format. If FASTA format sequences are given, the program will convert them to Consensus format internally before running the Consensus or Patser program.
Do not explicitly give the complements of nucleic acid sequences. If needed, the complementary sequence is determined by the program. Whitespace, periods, dashes (unless part of an integer when the "-i" option is used), and comments beginning with ';', '%', or '#' are ignored. When using letter characters (i.e., with the "-a" and "-A" alphabet options), integers are also ignored so that the sequence file can contain positional information. When using integer characters (i.e., with the "-i" alphabet option) the integers must be separated by whitespace.
Sequences surrounded by slashes (/) do not contribute to the generation of the patterns; thus, a portion of a sequence can be ignored without disrupting the overall numbering of the sequence. A double slash (//) would indicate a discontinuity in the sequence. A '/' at the beginning or the end of a sequence will cause the sequence to be marked as non-circular even if the sequence's name is marked with a "-c" (see the "-f" option in section 1). The effect of the single slashes can also be created with the "-i" and "-e" modifiers in the file containing the names of the sequences (see the "-f" option in section 1). When slashes and the "-i" and "-e" modifiers are all used, the intersection of permissible positions is analyzed.
Sequences that follow their name in the file indicated by the "-f" option must be enclosed between backslashes (\) (i.e., the actual sequence must be preceded and followed by a backslash). However, if the sequence is contained in a separate file, do NOT use a '\'.
Matrix Input can be either of single pattern matrix or multiple pattern matrices with name specified by ">" and name of matrix. The following is the example of Matrix input.
1) Example 1
A | 0 0 0 1 3 2 0 1 1 0 0 0 0 5 1 5
C | 0 0 1 0 1 0 4 0 0 3 0 0 4 0 3 0
G | 0 2 0 3 0 1 1 4 1 1 4 0 1 0 0 0
T | 5 3 4 1 1 2 0 0 3 1 1 5 0 0 1 0
2) Example 2
> Matrix 1
A | 0 0 0 1 3 2 0 1 1 0 0 0 0 5 1 5
C | 0 0 1 0 1 0 4 0 0 3 0 0 4 0 3 0
G | 0 2 0 3 0 1 1 4 1 1 4 0 1 0 0 0
T | 5 3 4 1 1 2 0 0 3 1 1 5 0 0 1 0
> Matrix 2
A | 0 0 0 1 3 2 0 1 1 0 0 0 0 5 1 5
C | 0 0 1 0 1 0 4 0 0 3 0 0 4 0 3 0
G | 0 2 0 3 0 1 1 4 1 1 4 0 1 0 0 0
T | 5 3 4 1 1 2 0 0 3 1 1 5 0 0 1 0