apatf - levenshtein distance in cyber security

Levenshtein distance what?

If you work or are just interested in cyber security you will have most likely encountered situations in which you would have loved to be able to automatically identify the percentage wise difference of two files or strings. A good example is a login or error page check. Imagine you are fuzzing a web application and you try to differentiate a successful injection from an error page.

You can compare the size of the response body for all the responses you get. That’ll give you an indicator if the responses you get back during your fuzzing run are similar. But, what if the content changes while the size stays the same? You could also generate a hash / checksum from the response body and compare them - same issue. If there’s just a timestamp, a CSRF token or anything else in the page which changes every time the page is requested or periodically, you will falsely identify the same reply as being different.

What can we do about that?

There are different approaches to this problem. The one I started using more than a decade ago is the levenshtein distance. I won’t explain how it works as you can read about that on wikipedia. But, I’ll share my implementation with you and show you how it works.

A closer look at the problem

Let’s have a look at an example. Assume we want to verify if https://gitlab.com/users/sign_in is stable. Means, we want to check if the response we get back from this page is always the same given we request it with the same parameters (none in this case):

e-axe@kaylee:/tmp$ for i in {1..5}; do curl https://gitlab.com/users/sign_in > tmp$i ; sleep 1; done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18183    0 18183    0     0  18183      0 --:--:-- --:--:-- --:--:-- 30818
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18183    0 18183    0     0  18183      0 --:--:-- --:--:-- --:--:-- 31296
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18183    0 18183    0     0  18183      0 --:--:-- --:--:-- --:--:-- 31622
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18183    0 18183    0     0  18183      0 --:--:-- --:--:-- --:--:-- 34050
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18183    0 18183    0     0  18183      0 --:--:-- --:--:-- --:--:-- 32068
e-axe@kaylee:/tmp$ ls -al tmp*
-rw-rw-r-- 1 e-axe e-axe 18183 Mar  3 17:43 tmp1
-rw-rw-r-- 1 e-axe e-axe 18183 Mar  3 17:43 tmp2
-rw-rw-r-- 1 e-axe e-axe 18183 Mar  3 17:43 tmp3
-rw-rw-r-- 1 e-axe e-axe 18183 Mar  3 17:43 tmp4
-rw-rw-r-- 1 e-axe e-axe 18183 Mar  3 17:43 tmp5
e-axe@kaylee:/tmp$ md5sum tmp*
111d4922c91cb421ed52fdad5ff16b51  tmp1
108deafe433f75eeb688733bd3d12b75  tmp2
31f017b05ac897a4c57b4fbd0ad1082c  tmp3
1390ad6224eed4049d2b6cf88514f30d  tmp4
c7d07614cd0e5b9064cf02060bd82fdc  tmp5

We request the same URL five times and write the response (without headers) into files (tmp{1..5}). You can see that the files all have the exact same size (18183 bytes) but the md5sums are different. Why’s that?

e-axe@kaylee:/tmp$ diff tmp1 tmp2
42c42
< <meta name="csrf-token" content="Rm42oFw3RWNQiXwnlZBWyal+TMXGE0kj0SCy1JVyCmbGzVixET2sgDkNIaJvzkcUKaJfmfYW8rp8eU9VSk0THg==" />
---
> <meta name="csrf-token" content="FzdSE5RHw5HEMwpzttMG0glo/MFJDFAR49jKyvZUVCaHSINXXtrVDkg6KxV0E0Emhm1srZDzSvBfWbvrUbxvzQ==" />
131c131
< <form class="new_user gl-show-field-errors" aria-live="assertive" id="new_user" action="/users/sign_in" accept-charset="UTF-8" method="post"><input name="utf8" type="hidden" value="&#x2713;" /><input type="hidden" name="authenticity_token" value="PvKMxibgf913fzQhD7+Vk0npiEEHiRzlEHPnYgVR42++UeLXa+qWPh77aaT14YROyTWbHTeMp3y9Khrj2m76Fw==" /><div class="form-group">
---
> <form class="new_user gl-show-field-errors" aria-live="assertive" id="new_user" action="/users/sign_in" accept-charset="UTF-8" method="post"><input name="utf8" type="hidden" value="&#x2713;" /><input type="hidden" name="authenticity_token" value="2kVqbM1JoBZ6O+NId8yquTHyOB+YQ1evinN49Ms+seRKOrsoB9S2ifYywi61DO1Nvveoc0G8TU428gnVbNaKDw==" /><div class="form-group">
158c158
< <form class="new_new_user gl-show-field-errors" aria-live="assertive" id="new_new_user" action="/users" accept-charset="UTF-8" method="post"><input name="utf8" type="hidden" value="&#x2713;" /><input type="hidden" name="authenticity_token" value="lVNqJaQ+RKcuB9sh8/E2igiMZoJawOP6h4UZcwlmNSUV8AQ06TStREeDhqQJrydXiFB13mrFWGMq3OTy1lksXQ==" /><div class="devise-errors">
---
> <form class="new_new_user gl-show-field-errors" aria-live="assertive" id="new_new_user" action="/users" accept-charset="UTF-8" method="post"><input name="utf8" type="hidden" value="&#x2713;" /><input type="hidden" name="authenticity_token" value="Ww4l5mF+Xnazwx1xpChvyr9v5x4OXk7ucA/bnvffaIzLcfSiq+NI6T/KPBdm6Cg+MGp3ctehVA/Mjqq/UDdTZw==" /><div class="devise-errors">

As you can see in the diff output for two of the responses, the response includes CSRF tokens as well as tokens called authenticity_token. These tokens change every time the response in generated by the server.

apatf

I wrote the first version of apatf around 2007 during a web application penetration test project which had over 1.000 applications in scope for automatic testing. apatf became one piece of the overall framework we developed back then on the job. Please don’t ask me what the name stands for - I just can’t remember anymore ;)

Let’s see what apatf makes out of it:

e-axe@kaylee:~/tmp$ ./apatf.rb
USAGE: SCRIPT OPTION ARGUMENT1 [ARGUMENT2] [PARAMTER=VALUE]
  Options:
    compare - call it with two resource uris (http/s|file) and get the distance
    stable  - call it with one resource uri (http/s|file) see if its stable
  Paramters:
    threshold  - any value from 0.0 to 1.0; default: 0.10; means 10% diff is fine
    count      - the amount responses to fetch

e-axe@kaylee:~/tmp$ ./apatf.rb compare file:///tmp/tmp1 file:///tmp/tmp2
VERBOSE: fetched 18182 bytes
VERBOSE: fetched 18182 bytes
0.013639863601363987

e-axe@kaylee:~/tmp$ ./apatf.rb stable https://gitlab.com/users/sign_in
VERBOSE: fetched 18183 bytes
VERBOSE: fetched 18183 bytes
VERBOSE: response 1 has lev(0.013804102733322334)
VERBOSE: fetched 18183 bytes
VERBOSE: response 2 has lev(0.013804102733322334)
VERBOSE: fetched 18183 bytes
VERBOSE: response 3 has lev(0.013749106308089974)
VERBOSE: fetched 18183 bytes
VERBOSE: response 4 has lev(0.013694109882857615)
VERBOSE: fetched 18183 bytes
VERBOSE: response 5 has lev(0.013694109882857615)
STABLE (all responses were within the threshold [0.1])

apatf says the URL is stable because all responses are in a levenshtein distance of 0.1 (that’s the default threshold). In the listing you can also see a run of apatf against two of the files (tmp1 and tmp2) which hold the responses we fetched before. As the algorithm can take some time to calculate the distance of strings which are bigger than a few bytes I implemented it in C and compiled a ruby extension as well as provide a perl module with inline C. Both can be pulled from my legacy github account over here: https://github.com/mytty/apatf

Let me know if you have any questions, feedback or general comments in the respective twitter thread over here:
https://twitter.com/mytty_project/status/1103668705005879297