University of Göttingen | Faculty of Biology | Inst. of Microbiology and Genetics | Dep. of Bioinformatics | Kmacs Submission Form

kmacs - the k Mismatch Average Common Substring Approach
to alignment-free sequence comparison

Introduction

kmacs is a new approach to alignment-free sequence comparison. While most alignment-free methods rely on exact word matches, kmacs uses a distance measure based on inexact substing matches. To define the distance between two DNA or protein sequences, kmacs estimates for each position i of the first sequence the longest substring starting at i and matching a substring of the second sequence with up to k mismatches. It defines the average of these values as a measure of similarity between the sequences and turns this into a symmetric distance measure. (This can be regarded as a generalization of the average common substring (ACS) approach (Ulitsky et al., 2006)). Kmacs does not compute exact k-mismatch substrings, since this would be computational too costly, but approximates such substrings. Details of this heuristic is described in the references cited below.

Availability

The web server returns a distance matrix for the input sequences.

Usage

Sequence input

Sequences can be uploaded as a (multiple) FASTA sequence file. The size of the sequence file must not exceed 10MB and 500 sequences.
Both DNA and protein sequences are supported. DNA Example:
>Sequence1
ATGATGAGTAGT
>Sequence2
AAATTGTGGTGTGTC
>Sequence3
CGATCATCGTA

Mismatches

Systematic test runs on real and simulated sequence sets indicate significant improvements for values between 4 and 10. However, also larger values might lead to further improvements. Note that for the special case k=0 kmacs exactly calculate the average common substring approach (Ulitsky et al., 2006).

Program Output

The output of the program is a distance matrix in Phylip format.

Example

A simple test example is given here

Alternative approach to alignment-free sequence comparison

Another approach for alignment-free sequence comparison using spaced k-mers can be found here.

Contact

For comments, or if you encounter any technical issues, please send an email to: lhahn(at)biologie.uni-goettingen.de or chris.leimeister(at)stud.uni-goettingen.de

Reference

Scientific publications using kmacs should cite: