GALF-G Logo  

GALF-G --Discovering Multiple Realistic TFBS Motifs Based on a Generalized Model
Genetic Algorithm with Local Filtering - Generalized


Overview

  • GALF-G is a novel method to discover multiple realistic TFBS motifs in DNA sequences. It is based on the generalized model which evaluates the input width range simultaneously and the meta-convergence framework to discover multiple (overlapping) motifs simultaneously. The proposed GALF-G (G for generalized) is extended from the GA-based GALF-P with the generalized model and meta convergence to handle multiple and possibly overlapping motifs.

Download

  • Win32 Version: The program can be run directly in command line mode on Windows platforms after extraction by WinRAR or WinZip.
  • To run: Click Start->Run...->type cmd->OK to go to cmd (command line) mode. Go to the directory containing GALFG.exe and type GALFG.exe

Usage

  • Click to see the full GALF-G Help

  • Basic Usage: GALFG.exe -i input_file -o output_file [[-w width] [-minw min_width -maxw max_width] [-prior distribution]]

    Compulsary ones (input and output):
      -i input_file   the fasta file to read in
      -o output_file   the file to store the result details
    Motif width(s) [(Motif width(s))...]:
      -w width   the motif width when it is able to be specified, e.g. when users have prior knowledge on an expected width. By inputting -w width only without the -minw and -maxw, the user should have an exact and fixed width in mind in that situation.
      -minw min_width -maxw max_width   they together specify the width range of interest, and have to be input together. The maximal range to handle is 10.
      -prior distribution   the distribution for the motif widths in the range; choices: unif (uniform; default), pois(Poisson) and cstm (customized)

     

    The above set of arguments (Compulsary ones and Motif width(s)), have to be input in order and before the other optional parameters


  • Motif Config Usage: GALFG.exe -i input_file -o output_file [(Motif width(s))...] [-n Types\-b Beta\-fmode Mode]

      -n Types   number of motif types (default: 5)
      -b Beta   threshold for the similarity test (default: 0.3; can be set towards 1 for highly different motifs (e.g. 0.5), and towards 0 for close motifs (e.g. 0.1))
      -fmode Mode   the assumption of motif distributions; options:
            ANOPS (default; any number of occurrence per sequence);
            OOPS (one occurrence per sequence);
            ZOOPS (zero or one occurrence per sequence);
            MOOPS (more than one occurrence per sequence)

     

    The above optional parameters can be input in independent orders (after Motif width(s)), and later repetitive parameters will overwrite the former.


  • Repeat Mask Usage: GALFG.exe -i input_file -o output_file [(Motif width(s))...] [-mskact MASKACT -msklen MASKLENGTH -mskoff MASKOFF]

      -mskact MASKACT   the activation switch for repeat pattern (e.g. AAAAAAAA (single repeats), ATATATAT (double repeats)...) removal (0: off, default; 1: on).
    If -vote is off, VOTINGTH and VOTINGW will be ignored
      -msklen MASKLENGTH   the length for a repeat pattern to be removed (default: 6);
      -mskoff MASKOFF   the expansion offset flag (0: off; 1: on, default). If MASKOFF is on, then we will remove maxw-1 nucleotides before the repeat pattern to avoid any overlap of the candidate instances

     

    The above set of arguments (Compulsary ones and Motif width(s)), have to be input in order and before the other optional parameters


  • Extreme Voting Usage: for extreme datasets and poorly conserved TFBSs (e.g. Tompa et al Benchmark)

      Details   Warning: this is not a normal mode for the program. It only serves to investigate the extreme cases for an explorary purpose.


Result Formats

  • First lines include the running command and parameter settings.
  • Motif # shows the different output motifs ranked by their final fitness; (Slot[x] means its ranking before (optional) instance refinement)
  • Fit: the final fitness; ic: information content; instance_num: the number of TFBS instances in the output motif; width: the core width of the motif; offset: the offset of the core width (out of the maximal width from zero)
  • Instance:
    sequence number (from 0) \tab sequence comment in the fasta file \tab position of the TFBS (offest already added) \tab the subsequence of the TFBS (with the core width (6) in the below examples)
    Example:   3   >M26773 |446-451:CAACTG|   146   CAACTG
    Example:   4   >M86232 |447-452:CACTTG|   35    cagttg
    Example:   4   >M86232 |447-452:CACTTG|   147   CACTTG (The same sequence as the previous but a different instance)


Examples

  • Examples with explanations
  • Results for the improved eukaryotic benchmark: 8_10_Mask2.zip (Running parameters: -i input -o output -minw 8 -maxw 10 -fmode ZOOPS -len 4000 -n 1 -mskact 1 -msklen 6 -mskoff 1)

Supplementary Materials


Contact

Email: tmchan at cse dot cuhk dot edu dot hk

Last update: 26/07/2009