Transcription factor binding sites (TFBSs), often
short and degenerate, are computationally difficult to identify without being
overwhelmed by false-positive calls. Since the structure of the genomic sequence
is heterogeneous, we develop a Local Markov Model (LMM) to assign probabilistic
significance to each TFBS candidate with respect to its local sequence
context. We show that the p-value for a TFBS candidate under the LMM can
be computed exactly and efficiently. We apply LMM to large-scale human binding
site sequences in situ and found that, compared to current popular methods,
LMM analysis can reduce false positive errors by more than 50% without compromising
sensitivity.
|