Generating the databases for Yaminabe2
Two databases are used for Yaminabe2:
- database with metainformation about packages, versions, checksums, download locations, origin, and so on. This is created using the database creation script of the Binary Analysis Tool (http://www.binaryanalysis.org/).
- database with exploded Git information. This is created using the gittlsh.py program. For this script it is important to make sure that the information in the file sourceverify.config is correct, especially locations of databases, locations of Git URLs and repositories and priorities/importance, which can differ per person.
It is then invoked as follows:
$ python gittlsh.py -c /path/to/configuration/file
To update the script simply make sure that the Git repositories are updated (git pull) and rerun the same command.
For the Linux kernel the first run might take quite long (5 or 6 hours). It is very much recommended to use a ramdisk to store the Git repositories because the script is very I/O intensive.
Running the TLSH compare scripts
There are two scripts that can compute TLSH checksums:
- The first script compares tags from two Git repositories and computes the TLSH score.
- The second script compares a directory of source code to all data in all branches of many Git trees (a Git "forest").
- The tag file used for gittreecompare.py consists of several rows of data,tab separated. The first row has the Git URLs of two Git repositories, each subsequent row has Git tags from the Git repositories.
- The script will check how far the Git tag in the first column is removed from the tag in the second column. Depending on the situation it might be useful to also look at the reverse, for example if the second repository contains many files that are not in the first repository, as it is not a symmetric problem. In the "results" directory the results of a few test runs are stored. These tests have been done both ways and yield different scores.
- Running the gittreecompare.py script is simple:
$ python gittreecompare.py -c /path/to/configuration/file -t /path/to/tag/file
- The sourceverifier.py script compares a directory of source code to a directory containing data from a Git forest (all tags from possibly multiple Git trees) that have been stored out of band.
- With this method the factor 'time' goes out of the door and instead the closest revision of a file in multiple Git repositories is searched.
- The TLSH score is then computed. For identical files the score is 0, for files that cannot be found or which are too small a maximum score of 400 is added to the score.↲
SCANNING 41434 files 1084 FILES NOT FOUND IN DATABASE COMPUTING AND COMPARING TLSH OF FILES NOT FOUND IN DATABASE CLOSEST REVISION FOR drivers/watchdog/gpio_wdt.c IS 8a7b76be691fa30c7650b8e08aae8a7990c93779 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 21 CLOSEST REVISION FOR drivers/clk/shmobile/clk-mstp.c IS 752b5ed5f6998e118626feea7375782c4cf5aad6 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 27 CLOSEST REVISION FOR sound/soc/sh/rcar/rsnd.h IS b4c83b171557815a0b31a36805900cc9f21c9ee4 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 50 CLOSEST REVISION FOR drivers/gpu/drm/rcar-du/rcar_du_drv.c IS 6e0c6e1895b9fff3cdb6ef746ee3d8dd4e852f40 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 214 CLOSEST REVISION FOR sound/soc/sh/rcar/ctu.c IS 76ca9970322118610681af5f929aba62f346082b FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 318 CLOSEST REVISION FOR sound/soc/sh/rcar/ssi.c IS e7d850dd10f4e61b728495a87ce096509843315f FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 85 (snip) NO MATCH FOR drivers/thermal/rcar_gen3_thermal.c NO MATCH FOR drivers/soc/renesas/rcar_pm_sysc.c NO MATCH FOR include/linux/soc/renesas/rcar_pm_sysc.h NO MATCH FOR include/dt-bindings/clock/r8a7795-clock.h NO MATCH FOR drivers/media/platform/vsp1/vsp1_dl.c (snip) ======================================= SUMMARY ======================================= FILES SCANNED: 41434 FILES FOUND IN UPSTREAM RELEASE: 40350 FILES NOT FOUND IN UPSTREAM RELEASE: 1084 TOTAL DISTANCE: 19323 IDENTICAL FILES IN GIT: 921 NOT MATCHED IN GIT: 27 UNDETERMINED IN GIT: 1 0-60: 94 61-150: 22 over 150: 19
- This archive contains following script, as well as README, LICENSE and other files
- gittlsh.py : script to explode Git repositories and store metadata like SHA256 and TLSH checksums out of band
- gittreecompare.py : script to compare two tags in Git repositories and compute a TLSH score
- sourceverifier.py : script for both the Yaminabe and Yaminabe2 projects
- sourceverify.config : configuration file used for the Python scripts
- ELC2016 presentations []