Your Levenshtein distance module told you something was amiss. Now what?

21 Oct

Your Levenshtein distance module told you something was amiss. Now what?

in Blog, Perspectives, Technology

by Karthik Krishnan

The widely used Levenshtein distance string metric, named after Russian scientist Vladimir Levenshtein who introduced this concept in 1965, has some interesting applications in cyber security.


Simply put, the Levenshtein distance is the minimum number of character changes (i.e., insertions, deletions or substitutions) needed to morph one word into another. The Wikipedia entry for Levenshtein distance has a great example, reproduced below, illustrating how “kitten” is changed to “sitting”. The distance value is three, meaning the morphing cannot occur in less than three changes:

Kitten → Sitten (substitution of "s" for "k")
sittEn → sittIn (substitution of "i" for "e")
sittin → sittinG (insertion of "g" at the end)

You may be wondering how this is relevant to cyber security. Well, consider that cyber criminals can use HTTP or email domains that are similar to popular sites (e.g., instead of to make the HTTP transaction or email look normal at cursory glance. The limited amount of attention most users pay to the URL of a transaction or the source domain of an email is enough of an opening for an attacker.

For example, let us assume AcmeGizmo, a fictitious company, uses Okta for their authentication service. An attacker knows about this and spearphishes employee “Bob” via a targeted email from The email would recommend that Bob change his password and provide him with a URL from where to initiate the process. Bob, not realizing that Okta is spelled incorrectly (i.e., with an extra “o”), would assume that this was a legitimate request and click on the URL. The result – Bob gets spearphished and the cyber criminal potentially downloads a malicious piece of software onto Bob’s machine and gains control of it. This could lead to the attacker posing as the employee, and eventually gaining additional access privileges to steal sensitive corporate information from elsewhere within AcmeGizmo (by the way, read my previous blog to learn more about the different stages in a typical multi-stage attack).

Now back to the Levenshtein distance and how it pertains to cyber security. A security analytics product could use the Levenshtein distance in machine learning-based analytics to perform a distance analysis. It would flag that there was an email from that was made to appear as if it came from, showing that it could potentially be an attack. How? The analytics would have picked up on the fact that the Levenshtein distance for (a spurious site) compared to (a legitimate site) was one, and therefore could be a sign of a malicious email. Great. A security analyst would receive an alert because there was suspicious activity associated with user “Bob” as a result of a calculated Levenshtein distance of one for domain “”.

If you were the analyst, the first thing you’d do (after having Googled “Levenshtein distance” to learn more about it) would be to go through your investigation process, which would roughly be as follows:

  • What email did Bob receive – and from whom?
  • Did Bob click on the link?
  • What was the result of clicking on the link?
  • How many other employees did Bob forward the link to?
  • How many of these employees clicked on the link?
  • What was the result of this infection in terms of the actual damage or intellectual property theft?

As you can see, an analyst has to quickly move from the detection of the attack to a full-fledged investigation that requires access to a lot of forensic information. This is often one of the biggest challenges of deploying a security analytics product, because detecting an anomaly is often a very small percentage of the capabilities that a security team expects. The ability to tie the investigation capabilities with forensics that allows analysts to pivot into answering all the questions detailed above is vital. This requires solutions to maintain lengthy historical records of employee activity, which have been efficiently indexed, to support the rapid answering of these questions. And by rapid, I mean at most hours, not weeks. Platforms that blend the analytics capabilities with forensics, allowing analysts to quickly pivot from detection to being able to efficiently investigate and triage the incident, are the ones that will win in the market. Without this type of platform, your analysts will continue swimming in a sea of alerts without the necessary context to take action.


Tags: Blog, Perspectives, Technology