Difference between revisions of "Yaminabe2"

From eLinux.org
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 28: Line 28:
 
# sourceverifier.py
 
# sourceverifier.py
  
* The first script compares tags from two Git repositories and computes the TLSH score. The second script compares a directory of source code to all data in all branches of many Git trees (a Git "forest").
+
* The first script compares tags from two Git repositories and computes the TLSH score.  
 +
* The second script compares a directory of source code to all data in all branches of many Git trees (a Git "forest").
  
* The tag file used for gittreecompare.py consists of several rows of data,tab separated. The first row has the Git URLs of two Git repositories, each subsequent row has Git tags from the Git repositories. The script will check how far the Git tag in the first column is removed from the tag in the second column. Depending on the situation it might be useful to also look at the reverse, for example if the second repository contains many files that are not in the first repository, as it is not a symmetric problem. In the "results"
+
* The tag file used for gittreecompare.py consists of several rows of data,tab separated. The first row has the Git URLs of two Git repositories, each subsequent row has Git tags from the Git repositories.  
directory the results of a few test runs are stored. These tests have been done both ways and yield different scores.
+
* The script will check how far the Git tag in the first column is removed from the tag in the second column. Depending on the situation it might be useful to also look at the reverse, for example if the second repository contains many files that are not in the first repository, as it is not a symmetric problem. In the "results" directory the results of a few test runs are stored. These tests have been done both ways and yield different scores.
  
 
* Running the gittreecompare.py script is simple:
 
* Running the gittreecompare.py script is simple:
Line 37: Line 38:
 
<pre>
 
<pre>
 
$ python gittreecompare.py -c /path/to/configuration/file -t /path/to/tag/file
 
$ python gittreecompare.py -c /path/to/configuration/file -t /path/to/tag/file
 +
</pre>
 +
 +
* The sourceverifier.py script compares a directory of source code to a directory containing data from a Git forest (all tags from possibly multiple Git trees) that have been stored out of band.
 +
* With this method the factor 'time' goes out of the door and instead the closest revision of a file in multiple Git repositories is searched.
 +
* The TLSH score is then computed. For identical files the score is 0, for files that cannot be found or which are too small a maximum score of 400 is added to the score.↲
 +
 +
== Comparison output ==
 +
<pre>
 +
SCANNING 41434 files
 +
1084 FILES NOT FOUND IN DATABASE
 +
COMPUTING AND COMPARING TLSH OF FILES NOT FOUND IN DATABASE
 +
 +
CLOSEST REVISION FOR drivers/watchdog/gpio_wdt.c IS 8a7b76be691fa30c7650b8e08aae8a7990c93779 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 21
 +
 +
CLOSEST REVISION FOR drivers/clk/shmobile/clk-mstp.c IS 752b5ed5f6998e118626feea7375782c4cf5aad6 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 27
 +
 +
CLOSEST REVISION FOR sound/soc/sh/rcar/rsnd.h IS b4c83b171557815a0b31a36805900cc9f21c9ee4 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 50
 +
 +
CLOSEST REVISION FOR drivers/gpu/drm/rcar-du/rcar_du_drv.c IS 6e0c6e1895b9fff3cdb6ef746ee3d8dd4e852f40 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 214
 +
 +
CLOSEST REVISION FOR sound/soc/sh/rcar/ctu.c IS 76ca9970322118610681af5f929aba62f346082b FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 318
 +
 +
CLOSEST REVISION FOR sound/soc/sh/rcar/ssi.c IS e7d850dd10f4e61b728495a87ce096509843315f FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 85
 +
 +
(snip)
 +
 +
NO MATCH FOR drivers/thermal/rcar_gen3_thermal.c
 +
NO MATCH FOR drivers/soc/renesas/rcar_pm_sysc.c
 +
NO MATCH FOR include/linux/soc/renesas/rcar_pm_sysc.h
 +
NO MATCH FOR include/dt-bindings/clock/r8a7795-clock.h
 +
NO MATCH FOR drivers/media/platform/vsp1/vsp1_dl.c
 +
 +
(snip)
 +
 +
=======================================
 +
              SUMMARY
 +
=======================================
 +
FILES SCANNED: 41434
 +
FILES FOUND IN UPSTREAM RELEASE: 40350
 +
FILES NOT FOUND IN UPSTREAM RELEASE: 1084
 +
TOTAL DISTANCE: 19323
 +
IDENTICAL FILES IN GIT: 921
 +
NOT MATCHED IN GIT: 27
 +
UNDETERMINED IN GIT: 1
 +
0-60: 94
 +
61-150: 22
 +
over 150: 19
 
</pre>
 
</pre>
  
Line 42: Line 90:
 
== archives ==
 
== archives ==
 
- [[File:yaminabe2-0.2.tar.gz]]<br>
 
- [[File:yaminabe2-0.2.tar.gz]]<br>
 +
- [[File:yaminabe2-0.3.tar.gz]] (md5=98cffc0b5d5325e0d37ee6184a94ad9c)<br>
 +
 
- This archive contains following script, as well as README, LICENSE and other files<br>
 
- This archive contains following script, as well as README, LICENSE and other files<br>
 
* gittlsh.py : script to explode Git repositories and store metadata like SHA256 and TLSH checksums out of band<br>
 
* gittlsh.py : script to explode Git repositories and store metadata like SHA256 and TLSH checksums out of band<br>
Line 48: Line 98:
 
* sourceverify.config : configuration file used for the Python scripts<br>
 
* sourceverify.config : configuration file used for the Python scripts<br>
  
 +
== prebuild database ==
 +
- [[https://dl.dropboxusercontent.com/u/35792169/yb2/kernelgit.sqlite3.xz kerneldb.sqlite3.xz]] (md5:af8096d2a702a186212b97b3017bf698)<br>
 +
- [[https://dl.dropboxusercontent.com/u/35792169/yb2/kerneldb.sqlite3.xz kernelgit.sqlite3.xz]] (md5:a6f03a790a6a7a85ec11eef4a9ac4146)<br>
  
== prebuild database ==
+
== Presentations ==
- [[https://dl.dropboxusercontent.com/u/35792169/yb2/kernelgit.sqlite3.xz kerneldb.sqlite3.xz]]<br>
+
* ELC2016 presentations [[http://events.linuxfoundation.org/sites/events/files/slides/elc2016_munakata.pdf]]
- [[https://dl.dropboxusercontent.com/u/35792169/yb2/kerneldb.sqlite3.xz kernelgit.sqlite3.xz]]<br>
 

Latest revision as of 10:48, 5 April 2016

Introduction

Instlation

TLSH install

yaminabe2 execution

Generating the databases for Yaminabe2

Two databases are used for Yaminabe2:

  • database with metainformation about packages, versions, checksums, download locations, origin, and so on. This is created using the database creation script of the Binary Analysis Tool (http://www.binaryanalysis.org/).
  • database with exploded Git information. This is created using the gittlsh.py program. For this script it is important to make sure that the information in the file sourceverify.config is correct, especially locations of databases, locations of Git URLs and repositories and priorities/importance, which can differ per person.

It is then invoked as follows:

$ python gittlsh.py -c /path/to/configuration/file

To update the script simply make sure that the Git repositories are updated (git pull) and rerun the same command.

For the Linux kernel the first run might take quite long (5 or 6 hours). It is very much recommended to use a ramdisk to store the Git repositories because the script is very I/O intensive.

Running the TLSH compare scripts

There are two scripts that can compute TLSH checksums:

  1. gittreecompare.py
  2. sourceverifier.py
  • The first script compares tags from two Git repositories and computes the TLSH score.
  • The second script compares a directory of source code to all data in all branches of many Git trees (a Git "forest").
  • The tag file used for gittreecompare.py consists of several rows of data,tab separated. The first row has the Git URLs of two Git repositories, each subsequent row has Git tags from the Git repositories.
  • The script will check how far the Git tag in the first column is removed from the tag in the second column. Depending on the situation it might be useful to also look at the reverse, for example if the second repository contains many files that are not in the first repository, as it is not a symmetric problem. In the "results" directory the results of a few test runs are stored. These tests have been done both ways and yield different scores.
  • Running the gittreecompare.py script is simple:
$ python gittreecompare.py -c /path/to/configuration/file -t /path/to/tag/file
  • The sourceverifier.py script compares a directory of source code to a directory containing data from a Git forest (all tags from possibly multiple Git trees) that have been stored out of band.
  • With this method the factor 'time' goes out of the door and instead the closest revision of a file in multiple Git repositories is searched.
  • The TLSH score is then computed. For identical files the score is 0, for files that cannot be found or which are too small a maximum score of 400 is added to the score.↲

Comparison output

SCANNING 41434 files
1084 FILES NOT FOUND IN DATABASE
COMPUTING AND COMPARING TLSH OF FILES NOT FOUND IN DATABASE

CLOSEST REVISION FOR drivers/watchdog/gpio_wdt.c IS 8a7b76be691fa30c7650b8e08aae8a7990c93779 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 21

CLOSEST REVISION FOR drivers/clk/shmobile/clk-mstp.c IS 752b5ed5f6998e118626feea7375782c4cf5aad6 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 27

CLOSEST REVISION FOR sound/soc/sh/rcar/rsnd.h IS b4c83b171557815a0b31a36805900cc9f21c9ee4 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 50

CLOSEST REVISION FOR drivers/gpu/drm/rcar-du/rcar_du_drv.c IS 6e0c6e1895b9fff3cdb6ef746ee3d8dd4e852f40 FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 214

CLOSEST REVISION FOR sound/soc/sh/rcar/ctu.c IS 76ca9970322118610681af5f929aba62f346082b FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 318

CLOSEST REVISION FOR sound/soc/sh/rcar/ssi.c IS e7d850dd10f4e61b728495a87ce096509843315f FROM git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git WITH DISTANCE 85

(snip)

NO MATCH FOR drivers/thermal/rcar_gen3_thermal.c
NO MATCH FOR drivers/soc/renesas/rcar_pm_sysc.c
NO MATCH FOR include/linux/soc/renesas/rcar_pm_sysc.h
NO MATCH FOR include/dt-bindings/clock/r8a7795-clock.h
NO MATCH FOR drivers/media/platform/vsp1/vsp1_dl.c

(snip)

=======================================
               SUMMARY
=======================================
FILES SCANNED: 41434
FILES FOUND IN UPSTREAM RELEASE: 40350
FILES NOT FOUND IN UPSTREAM RELEASE: 1084
TOTAL DISTANCE: 19323
IDENTICAL FILES IN GIT: 921
NOT MATCHED IN GIT: 27
UNDETERMINED IN GIT: 1
0-60: 94
61-150: 22
over 150: 19

Resources

archives

- File:Yaminabe2-0.2.tar.gz
- File:Yaminabe2-0.3.tar.gz (md5=98cffc0b5d5325e0d37ee6184a94ad9c)

- This archive contains following script, as well as README, LICENSE and other files

  • gittlsh.py : script to explode Git repositories and store metadata like SHA256 and TLSH checksums out of band
  • gittreecompare.py : script to compare two tags in Git repositories and compute a TLSH score
  • sourceverifier.py : script for both the Yaminabe and Yaminabe2 projects
  • sourceverify.config : configuration file used for the Python scripts

prebuild database

- [kerneldb.sqlite3.xz] (md5:af8096d2a702a186212b97b3017bf698)
- [kernelgit.sqlite3.xz] (md5:a6f03a790a6a7a85ec11eef4a9ac4146)

Presentations

  • ELC2016 presentations [[1]]