Using Big Data to Fight Phishing How Data Mining Can Pinpoint the Spam in the Haystack
Using Big Data to Fight Phishing

Spear-phishing campaigns are becoming localized, small and capable of slipping past spam filters, and current efforts to mitigate risk aren't doing the job, says Gary Warner, director of research for computer forensics at the University of Alabama at Birmingham.

One big problem: Conventional phishing prevention practices assume that DMARC - the Domain-based Message Authentication, Reporting and Conformance initiative that aims to standardize how e-mail receivers perform e-mail authentication - will be the ultimate answer, say Warner and Greg Coticchia, CEO of Malcovery Security, an anti-phishing technology company recently spun off from the university.

"The whole premise behind DMARC is that if I sign my outbound e-mail and someone receives an e-mail that's from my domain but hasn't been signed by me, they will know that they should reject it," Coticchia says in an interview with Information Security Media Group [transcript below]. "The problem is, if I send you an e-mail and I say it's from, the consumer doesn't know that and are two different places. In fact, some of these banks already do business from 150 different domain names."

Warner says another key issue is the siloing of systems and departments at most organizations.

"The fraud analyst group is one place," he explains. "The network defenders are in another place; the perimeter defenders are in another place." As a result, these different departments aren't coming together to facilitate enterprise security intelligence to identify trends.

Enterprise security intelligence, Warner says, requires the use of big data and the application of data-mining principles to find that proverbial "needle in a haystack."

Right now, "the larger the organization, the greater chance there is that there are silos," especially within and among security and fraud response teams, he says. As a result, those departments are not able to adequately consume intelligence, he says.

During this interview, Warner and Coticchia discuss:

  • Why standard countermeasures have proven ineffective when it comes to mitigating spear-phishing risks;
  • Why DMARC will never be a silver-bullet; and
  • The role of big data in fighting phishing.

At UAB, Warner focuses on the problems faced by cybercrime investigators in law enforcement and elsewhere. He also serves as chief technologist at Malcovery. Earlier, Warner was IT director for a publicly traded energy company. For the past six years, he has been active in the FBI's InfraGard program. He also has served on the national board of the Energy ISAC and currently serves as a Microsoft Security MVP.

Coticchia has more than 25 years of experience in high-tech products and services. He previously served as CEO and co-founder of eBillingHub, now part of Thomson Reuters. He teaches business-to-business marketing and entrepreneurial leadership at the University Of Pittsburgh Katz School Of Business.

Spin-Off Company

TRACY KITTEN: Can you give us a brief overview of Malcovery and what it does?

GREG COTICCHIA: Gary's work at the University of Alabama at Birmingham was really renowned in the areas of phishing, spam and malware, and close to $3 million of the research had been put into the technology ... that helped identify the source and nature of cyber-attacks. It's a very valuable thing.

In today's world, as you know, we have a perimeter. We have multiple layers of defense that are really starting to crumble as a variety of new technologies are brought into the office, between tablets and mobile and just the structure of data. As a result of that, we have to be much smarter in our security technology. Malcovery is based upon all the technology that Gary and his team developed at UAB so that we could actually identify the source and nature of those cyberthreats. In today's world, in many cases you're playing "whack-a-mole," just dealing with the symptoms, and we're dedicated to the idea that if you can find the root source, the root cause, you can be much more effective in today's world.

Biggest Phishing Mistakes

KITTEN: What is the biggest mistake the online world is making, where phishing prevention is concerned?

GARY WARNER: I think there are a few things. Going back to the idea of this eroding perimeter, people really are trying to rely on old ways of dealing with the problems - end-user education, training, or some type of web-filtering black list, or worse yet, still relying on ineffective ... mitigation services in which the end-user organization, whether it's a financial company, retail company or government organization, calls an organization outside their business to help them take down a phish. That has proven to be ineffective over and over again because they're really just dealing with the symptoms and playing "whack-a-mole" with the problems of phishing. Whereas if you can build the right countermeasures in by identifying the source and nature, you can actually stop or prevent future attacks.

COTICCHIA: The important thing to realize is that the average attacker is going to keep coming back to attack the same institution until that institution has put in an effective countermeasure. Ignoring the fact that the same criminal has hit you 100, 200 or even 500 times is just silly. We aren't learning from the attacks that we've experienced in the past. We're treating every attack as if it's the first time this has ever been seen, and that's really what's at the core of our research program and in our products at Malcovery. How do we learn from the past incidents to help build a more effective countermeasure moving forward?

Using Big Data

KITTEN: How is big data being used to help attack phishing schemes?

WARNER: We sometimes say at the university that what we're dealing with is the intersection of criminal justice and big data. The idea is in our database of phishing sites - we call it our phishing intelligence system, or Phish IQ - we have 550,000 documented confirmed phishing sites. We've studied those. We have machine-learning algorithms that have gone through and figured out what do we know about these past attacks to the point that when a new phishing site comes up, we can say, "Here's the history of all the sites that this attack is related to. Here are some characteristics about that attacker and here are some indicators that, if you're the financial institution, would help you to tie a particular withdrawal from a particular customer's account to a specific criminal activity or even a specific actor." That ability to tie together the financial loss to the particular phishing site, or even to say that this whole string of phishing sites is the same criminal, is really a key to how we learn.

We do the same thing with malware. It's not that there's a new computer virus and we need a signature for that virus. If you understand the infrastructure of how that new virus performs with relation to the command-and-control servers and the malware drop sites, where are the points that we could mitigate an entire family or entire generation of the same malware rather than dealing with each incident as if it's the first time we've ever seen it and writing a fresh signature from scratch? There are better ways to address these things, but you have to have the history and the learning that goes with that history in order to address them in a new way.

DMARC Initiative

KITTEN: How or why does the DMARC initiative fall short?

COTICCHIA: First, let me say we're big fans of DMARC. We wish them well. We hope that it will be widely adopted. But it's a little bit naïve to just say DMARC is the answer to phishing. The problem is that consumers really don't know or care where e-mail comes from. The whole premise behind DMARC is that if I sign my outbound e-mail and someone receives an e-mail that's from my domain but hasn't been signed by me, they will know that they should reject it. The problem is if I send you an e-mail and I say it's from, the consumer doesn't know that and are two different places. In fact, some of these banks already do business from 150 different domain names, so it fails.

A good friend of ours at LinkedIn calls these "cousin domains," the domains that aren't addressed by DMARC, because if the consumer believes that this may be a valid domain name and yet there's not a corresponding DMARC record, it's going to be delivered because DMARC only blocks if there's a corresponding DMARC record and the signature doesn't match. If there's no record, the e-mail is delivered. So all I have to do is slightly alter your domain name or put something that to a consumer would be realistic, and the e-mail slides right through.

Enterprise Security Intelligence

KITTEN: What role is enterprise security intelligence playing and can you define it?

COTICCHIA: What we're really talking about here is the intersection of big data defined in terms of terabytes, or more, of ... evidentiary data that we can take a look at, so that you can apply data-mining principles to those evidentiary data sources and provide that, combined with unique analytics, to be able to get results and to be able to find that needle in a haystack.

Now, the second part of that is, it's one thing to have the intelligence; it's one thing to be able to use big data and these techniques against that data to be able to find the cause of a particular security threat that shows up as phishing, malware or spam. But it's another thing to make it actionable.

... There are a lot of organizations that say, "Give me the countermeasures that I can stop this or prevent this from happening." Other organizations want to work with law enforcement, and they want to put together the information to help go after those criminals, and we can actually at Malcovery help in both cases. We can help them give the network operations and security professionals the information to stop or prevent that. Or for major financial institutions, payment processors or e-tailers, we can help them take that next step and create the case for law enforcement to go after the bad guys.

Phishing, Malware Research

KITTEN: Gary, what kind of research is your team doing at the university?

WARNER: One of the research areas that we're really trying to address is on the computer science side: How do we look at malware in new ways so that you don't have to rely on signatures? One of the ideas is basically taking the malware apart and looking for common components that are part of the malware that would help us to indicate that something is a new threat - not really a signature but a map, if you will, of how that malware functions internally and being able to recognize similar data structures in new executable files that may come across. Whether anyone has a signature for it or not, we should be able to recognize the structure of the file as a potential threat.

We also do research on the criminal justice side. It's almost like the economics of cybercrime. A lot of big things that we have seen published by the analyst community are really based on very poor data sources. If we really want to know the impact of malware, if we really want to know what industries are targeted by malware or phishing attacks, we need to take the log data and do proper analysis of that. ... If there's a phishing site, how many people visited that phishing site? Where did those people visit from? How soon after the beginning of that attack did the phishing site visitors begin to show up and give their credentials? How many people who came into the site actually completed the phisher's questionnaires and sent their data out? All of that stuff can be easily analyzed by looking at the log files, and yet we almost never do that.

We end up publishing analyst reports and talking about the rate at which money is stolen out of phishing sites; but it's almost done in surveys. Cisco did a fantastic study last year, "E-mail Security: This Time it's Personal." There were statements in there about the impact malicious e-mail campaigns, but the source of that data was they asked a bunch of CISOs questions about malware. I would rather go to the raw log data and get truth rather than opinions of highly ranked and very intelligent people. sometimes asking managers survey questions is not the right way to really learn the true impact of malware and phishing attacks.

Evolution of Phishing Attacks

KITTEN: How have you seen some of these phishing schemes and malware attacks evolve over the last 12 to 18 months?

COTICCHIA: There's always the prime targets in the financial sector, in retail, online or e-tailing, and government organizations. Those tend to be, along with ISPs, the four primary segments of the marketplace that are being attacked. APWG, the Anti-Phishing Working Group, recently announced an increase in the attacks, the capability of attacks and sophistication of the attacks. There hasn't been a lessening of anything. Criminals are becoming more sophisticated. This is what's so important about our working relationship with Gary. This is at the intersection of criminal justice and network security -putting those two things together to be able to understand how a phisher is going to use e-mail to start "casing your joint" and looking at behavior, and then actually unleashes a particular phish so it can be more successful. Those are the kinds of the things that we study so that we can make sure that we can see that early on and help organizations actually prevent that before the damage is done.

WARNER: I would just add that the big trend that we're seeing right now is the attacks are smaller in volume and more targeted. This is because the adversary knows more about their targets. Where in the past we may have seen spam campaigns with millions of messages being sent out for phishing lures, what we may see today is a campaign that only goes out to 500 people, but all of them are customers of yours. The fact that the criminals often now already know who your customers are, what their balances are and what their potential range of balances are, and they can target the high-wealth individuals who bank at [a certain bank] - that targeting has a couple of problems associated with it. A lot of malware signature writers base their prioritization of what signatures they create on the volume of the malware that's received. If there's a very low-volume malware, it doesn't get a high-priority response for writing a new signature, in the same way the phishing response is often based on public spam traps. If we see the phishing e-mail in a wide distribution, it's going to show up in everyone's spam traps and we'll be able to begin mitigation.

If your targeted customers are the only ones who see the e-mail, you need to rely even more heavily on the intelligence from your consumer base. When they report, "I got this funny e-mail," we need to make sure that the second that hits your inbox, it's processed, evaluated and determined whether or not it's a phish, whether it's part of a major campaign or if it's a one-off, and begin the mitigation process for the intelligence-gathering process immediately.

Biggest Challenges

KITTEN: Are there any additional thoughts you would like to share?

WARNER: One of the biggest challenges we have as an intelligence provider is corporations saying, "I can't use that intelligence because we're so fragmented within our security or response organization." The fraud analyst group is in one place; the network defenders are another place; the perimeter defenders are in another place. And yet to fully take advantage of enterprise security intelligence, all of those groups need the real-time ability to integrate data from the other groups or from an external party such as ourselves and then share data back. How did that impact us? Was that an effective counter measure?

Right now, unfortunately, the larger the organization, the greater chance there is that there are silos within the security and fraud response area that prevent them from being able to consume intelligence. I think that's the thing that we need to really focus on. If enterprise security intelligence is the answer, we have to appropriately restructure our security response in the corporations to be able to consume and take advantage of that intelligence.

COTICCHIA: It's much like what we saw in our government organizations about sharing data post 9/11, where there were these large, siloed organizations and a lack of cooperation among the government organization. You see that replicated. It's somewhat human nature. ... That ability to share that data and share it in the right way is going to really benefit an individual corporation and its efforts to take the intelligence and make it actionable, put in the right countermeasures and take the right steps.

Around the Network