Detecting the Bad from the Good…

UPDATE 4/3/12: I worked with Joel Elser on Snort-sigs mailing list to develop the below signature. However there’s been some concern around system resources of Regexing every GET request to the internet.  I’m thinking I might have to adjust the rule to exempt .com and .net TLDs.  Less effective I know but at least it won’t kill the sensor. This technique is probably better for offline static analysis of logs then realtime IDS.  Damballa has two good papers on their work around detecting DGA (Domain generation algos) and how they haven’t gone away now that Conficker is out of the news.  Links
alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:”WEB-MISC http header with 9 or more consonants”; flow:to_server,established; content:”GET”; http_method; content:”Host: “; http_header; pcre:”/^Host:s[tnrshdlfcmgpwbvkxjyqz0-9]{9,}$/Hi”; metadata:service http; classtype:bad-unknown;)
UPDATE 10/28/12: The above sig does not match unless there are NO vowels in the host header.  Please see an updated post here
In security monitoring it’s your job to use your creativity to design rules and dashboards so you can identify evidence of malicious activity.  In general there are two strategic ways to do this, detect the bad (blacklisting) or detect the good (whitelisting).  It doesn’t take a genius to realize that whitelisting is the more effective strategy overall, although its much harder to implement.  Even so blacklisting is still useful to use for increasing defense in depth.   Speaking about getting creative to detect the bad guys, a recent thought I had is looking for one of the tactics they use to avoid being taken offline.  Namely registering and using a large number of domains so that content filtering and whitehat organizations can’t keep up.  You could see this during the Conficker worm battles, where early versions were programmed to connect to 250 domains a day, and when the Conficker Cabal launched an effort to pre-register those names later worm versions came out in direct response with algorithms for 50,000 domains per day.
The flaw I see in the “I can register more domains, faster then you can” tactic is often the make up of those domains. By their nature they frequently are a collection of puesdo random letters not valid strung together words in your language, like a normal domain.
For Example: (consonants in a row in red)
This makes them fairly easy to detect when browsing through your event stream.  However, it is obviously impossible to manually watch HTTP logs looking for these non-conforming domains all day. So it made sense to me to look for what doesn’t conform to the majority of good URLs.  Because if you are looking for the proverbial needle in a haystack, you need to apply a tatic to everything that will find what doesn’t follow the rules of being a needle. So you could start a fire to burn the haystack (needle and all) which would leave the needle relatively unharmed.  Or spread the haystack over a magnetic strip which only the needle would stick too.   An example of this with Infosec defense would be detecting host names in the HTTP headerWeb Logs that do not conform to the rules of your language, which is American English in my world.  I’ve been considering recently how to do this and what made sense to me is “number of constants w/o a vowel between them”.
To do this you need the rules of your language which you can get through the study of phonotactics. Which is how to “deal with restrictions in a language on the permissible combinations of phonemes” wiki here.  In particular I’m looking for consonant clusters, which some languages don’t even allow. The rules of American English are very few words have more then 5 consonants in a row.  Now domains often string words and even numbers together like I do in So to test and fine tune my theory, I created a .php script that will loop through a file of domains each on a newline and compare them against a Perl Compatible Regex meant to find a string of x number of constants and/or numbers in a row.  To do this, I collected 1.215 million domains from
1. (proven malicious domains)
2. (recently expired domains could be good or bad)
3. (top 1 million websites on the web)
As you can see the file is made up mainly of “good domains” from Quantcast because this filter is meant to be a “if it hits you need to investigate” kinda technique, so false positive rates need to be near zero.  The table below shows an idea on the true positive/false positive ratio.  I recommend you play with the variables as YMMV.   My best experiences were to exclude the Y on 8 or less constants in a row.
Domains.txt contained
1,215,000 total domain names
NOTE:  Y is sometimes a vowel in American English so how you handle it depends on your false positive (FP) tolerance.
Without “Y”as a vowel With “Y” as a vowel
6 or more matches

12,827 matches total
some FPs:,
6 or more matches

23,653 matches totals
some FPs:,
7 or more matches

3,979 matches total
some FPs:,
7 or more matches

7,128 matches total
some FPs:,
8 or more matches

1,615 matches total
some FPs:
8 or more matches

2,461 matches total
some FPs:
9 or more matches

846 matches total
some FPs: (could be “bad”, but legit english)
9 or more matches

1,127 matches total
some FPs:,
Often in info sec after creating a control to mitigate risk, you immediately have to create exceptions for the real world.    In this case the exception would be around domains that use random letter strings in the host part of the domains because that is part of the HOST header in the HTTP protocol.  FPs I’ve found are mostly cloud providers such as
So in closing, there will be false positives you can’t account for in an exception list and it will not detect all bad domains, but maybe it’s another tool in the box for the defenders. Could the bad guys re-write their domain registration scripts quickly to register domains with dictionary words strung together to beat the regex?  Yeap.  But a major goal in the defense of anything is forcing your adversary to change their tactics in the eternal game of cat and mouse; while raising the bar required to successfully defeat you specifically. To paraphrase an old joke, it doesn’t matter if you outrun the bear or the other guy; either way you survive.
I have the .php script and domains.txt linked below, if anyone is interested. Suggestions on my script welcome.
PS.  the order of the letters in the regex look random but in fact are arranged by what several sites suggested is the most accepted letter frequency in the average English text. Don’t know if that will help the scripts efficiency but figured it couldn’t hurt.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s