Generating the databases for Yaminabe2
Two databases are used for Yaminabe2:
- database with metainformation about packages, versions, checksums, download locations, origin, and so on. This is created using the database creation script of the Binary Analysis Tool (http://www.binaryanalysis.org/).
- database with exploded Git information. This is created using the gittlsh.py program. For this script it is important to make sure that the information in the file sourceverify.config is correct, especially locations of databases, locations of Git URLs and repositories and priorities/importance, which can differ per person.
It is then invoked as follows:
$ python gittlsh.py -c /path/to/configuration/file
To update the script simply make sure that the Git repositories are updated (git pull) and rerun the same command.
For the Linux kernel the first run might take quite long (5 or 6 hours). It is very much recommended to use a ramdisk to store the Git repositories because the script is very I/O intensive.
Running the TLSH compare scripts
There are two scripts that can compute TLSH checksums:
- The first script compares tags from two Git repositories and computes the TLSH score.
- The second script compares a directory of source code to all data in all branches of many Git trees (a Git "forest").
- The tag file used for gittreecompare.py consists of several rows of data,tab separated. The first row has the Git URLs of two Git repositories, each subsequent row has Git tags from the Git repositories.
- The script will check how far the Git tag in the first column is removed from the tag in the second column. Depending on the situation it might be useful to also look at the reverse, for example if the second repository contains many files that are not in the first repository, as it is not a symmetric problem. In the "results" directory the results of a few test runs are stored. These tests have been done both ways and yield different scores.
- Running the gittreecompare.py script is simple:
$ python gittreecompare.py -c /path/to/configuration/file -t /path/to/tag/file
- The sourceverifier.py script compares a directory of source code to a directory containing data from a Git forest (all tags from possibly multiple Git trees) that have been stored out of band.
- With this method the factor 'time' goes out of the door and instead the closest revision of a file in multiple Git repositories is searched.
- The TLSH score is then computed. For identical files the score is 0, for files that cannot be found or which are too small a maximum score of 400 is added to the score.↲
- This archive contains following script, as well as README, LICENSE and other files
- gittlsh.py : script to explode Git repositories and store metadata like SHA256 and TLSH checksums out of band
- gittreecompare.py : script to compare two tags in Git repositories and compute a TLSH score
- sourceverifier.py : script for both the Yaminabe and Yaminabe2 projects
- sourceverify.config : configuration file used for the Python scripts
- ELC2016 presentations []