Unit cost edit distance between a haploid and a reference guided recombination of two diploids
To evaluate variation calling methods one can start with a reference genome and a set of mutations. From these one can create an artificial donor diploid genome:
Variation calling can then be applied, and the result compared to the actual variations chosen when the diploid was generated. But this comparison has to deal with the problem of invariance:
Here both predictions are actually the same genome but the prediction 1 appears to be closer to the truth if one looks only at the variations. One way to avoid this problem is to look at the actual genome strings. This reveals that both the predictions are the same, and that they align perfectly with the donor diploid if we allow recombinations (here indicated by lowercase):
This kind of alignment to a best possible recombination of the diploids is what DAlign does. DAlign gives the unit cost edit distance between the haploid and the best recombination of the diploids to use as an evaluation metric.
Reference guided recombination
To create a reference P guided recombination of A and B, one first needs a mappings from A to P and from B to P.
The mapping for A would be 0112456 which tells that A contains an extra character in position 1 and that the character in position 3 in P is missing from A. The mapping for B would be 012345. If we have the knowledge that reference P is 7 characters long, we can tell that B is missing the last character from P. Note that this mapping does not tell whether the characters in A or B are actually the same as in P, just that they map to certain positions.
A reference guided recombination of A and B would now be a following kind of a string. If to a certain position in P there is a mapping from only A or B, then the character to that position is taken from the string that had the mapping. If both strings have a mapping to P, then the character can be taken from either one of the strings.
The tool dalign is writen in C and uses no external dependencies (excluding GNU make). Running 'make dalign' should build an executable 'dalign' which takes the following parameters in the given order:
- diploid_1.txt - A path to a file containing the first diploid genome. The file should not contain any extra headers or characters in addition to the actual genome string. The allowed alphabet of the string is ASCII.
- diploid_2.txt - A path to a file containing the second diploid genome.
- map_1.txt - A path to a file containing the mapping between the first diploid and the reference genome. This file should contain positive integers, one per line.
- map_2.txt - A path to a file containing the mapping between the second diploid and the reference genome.
- haploid.txt - A path to a file containing the haploid genome. The format of the file is the same as with the diploids.
- reference_length - The length of the reference genome. This should be a positive integer. The maximum value in both map files should not exceed this.
To run the test suite of the tool, issue 'make test && ./test'. This requires that the C test framework Check is installed.
To run the benchmark used in the paper, issue 'make test && ./benchmark input.txt n' where input.txt contains the test data of your choice, in the same format as the genome files for dalign. The program will report the average time over n iterations of the algorithm with the same input.