Tuesday, March 24, 2015

100 Days of Malware

It's now been a little over 100 days since I started running malware samples in PANDA and making the executions publicly available. In that time, we've analyzed 10,794 pieces of malware, which generated:
  • 10,794 record/replay logs, representing 226,163,195,948,195 instructions executed
  • 10,794 packet captures, totaling 26GB of data and 33,968,944 packets
  • 10,794 movies, which are interesting enough that I'll give them their own section
  • 10,794 VirusTotal reports, indicating what level of detection they had when they were run by malrec
  • 107 torrents, containing downloads of the above
I've been pleased by the interest malrec has generated. We've had visitors from over 6000 unique IPs, in 89 different countries:

The Movies

There's a lot of great stuff in these ~10K movies. An easy way to get an idea of what's in there is to sort by filesize; because of the way MP4 encoding works, larger files in general mean that there's more going on on-screen (though only up to a point – the largest ones seemed to be mostly command prompts scrolling, which wasn't very interesting). I took it upon myself to watch the ones between ~300KB and 1MB, and found some fun videos:

Several games:


Anime-like game

"Wild Spirit"

Some fake antivirus:

Extortion attempts were really popular:

Though they didn't always invest in fancy art design:

Extortion (low-rent)

Download managers and trojaned installers were very popular:

Broken trojaned installer

Download manager

Some used legitimate-looking documents to disguise nefarious intentions:

Some maritime history

German Britfilms

Chinese newspaper

Finally, there's the weird and random:

Trust Me, I'm a Doctor


Not pictured: there was even one sample that played a porn video (NSFW) while it infected you; I guess it was intended as a distraction?

Antivirus Results

After malrec runs a sample in the PANDA sandbox, it checks to see what VirusTotal thinks about the file, and saves the result in VirusTotal's JSON format. From this, we can find out what the most popular families in our corpus are. For example, here's what McAfee thought about our samples:

And Symantec:

In the graphs above, "None" includes both cases where the AV didn't detect anything and cases where the sample hadn't been submitted to VirusTotal yet. You therefore should probably not use this data to try and draw any conclusions about the relative efficacy of Symantec vs McAfee. I'd like to go back and see how the detection rate has changed, but unfortunately my VirusTotal API key is currently rate-limited, so running all 10,794 samples would be a pain.

Bugs in PANDA

Making recordings on such a large scale has exposed bugs in PANDA, some of which we have fixed, others of which need further investigation:
  • One sample, 5309206b-e76f-417a-a27e-05e7f20c3c9d, ran a tight loop of rdtsc queries without interrupts. Because PANDA's replay queue would continue filling until it saw the next interrupt, this meant that the queue could grow very large and exhaust physical memory. This was fixed by limiting the queue to 65,536 entries.
  • Our support for 64-bit Windows is much less reliable than I would like. Of the 153 64-bit samples, 45 failed to replay (29.4%). We clearly need to do better here!
  • We see some sporadic replay failures in 32-bit recordings as well, but they are much more rare: out of the 10,641 32-bit recordings we have, only 14 failed to replay. I suspect that some of these are due to a known bug involving the recording of port I/O.
  • One sample, MD5 1285d1893937c3be98dcbc15b88d9433, we have not even been able to record, because it causes QEMU to run out of memory; if you'd like to play with it, you can download it here.


With our current disk usage (1.2TB used out of 2TB), I'm anticipating that we'll be able to run for another 50 days or so; hopefully by then I'll be able to arrange to get more storage.

Meanwhile, I've started to do some deeper research on the corpus, including visualization and mining memory for printable strings. I'm looking forward to getting a clearer picture of what's in our corpus as it grows!