Reverse engineering a JavaScript obfuscated dropper
Nowadays one of the techniques most used to spread malware on windows systems is using a JavaScript (js) dropper. A js dropper represents, in most attack scenarios, the first stage of a malware infection.
It happens because Windows systems allow the execution of various scripting language using the Windows Script Host (WScript). This means that, through JavaScript, it could be executed a system call to the underlying operating system.
Using a js dropper as the first stage of a malware infection is a method that allows malware authors to bypass NIDS, HIDS and endpoint anti-malware more easily than dropping a single binary containing the whole malicious logic. At the same time, JavaScript obfuscation is much easier to implement than binary obfuscation. Furthermore, it is harder to detect an obfuscation pattern to create a signature for the malicious dropper due to the dynamicity of the scripting language.
Considering that, same js obfuscation techniques are also used for obfuscating exploit for browsers and to prevent code theft and reuse. In this article, I will show the different obfuscation techniques and a real practical example to understand the importance of knowing how to reverse an obfuscated js dropper because most of the time they are obfuscated by a chain of one or more techniques.
JavaScript code obfuscation techniques
Obfuscating JavaScript code will complicate the static analysis of the malicious code. The main aim of the obfuscation is to make the understanding of the code logic harder while leaving the behavior of the code unchanged.
Through dynamic analysis you could observe the malicious behavior of the code, but only if some conditions are true. If those conditions are never met, you cannot spot the malicious behavior of the code with the dynamic analysis. A condition could be a check if the actual environment of execution is not virtualized and if this condition is false, the code will not execute. This happened with the js dropper I analyzed. Let's look at more detail.
Manual reversing becomes significant when dynamic analysis cannot help the analyst.
According to that research [1], it is possible to identify the basic JavaScript obfuscation techniques categorized as following:
- Randomization Obfuscation
In this type of obfuscation, some elements of JavaScript codes are inserted or changed without changing the semantics of the code.
Common techniques used in this category are white space randomization, comments randomization, and variable and names randomization.
- Data Obfuscation
The main aim of these techniques is to convert variable or constant values into the computational results of one or server variables or constants.
- Two main techniques belong to this category: string splitting and keyword substitution.
- String splitting consists in converting a string into a concatenation of several substrings.
- Keyword substitution consists in placing a JavaScript keyword in a variable and then uses that variable instead of the JavaScript keyword.
- Encoding Obfuscation:
Normally, there are three ways to encode original code. The first way is to convert the code into escaped ASCII characters, Unicode or hexadecimal representations. The second method uses customized encoding functions, where attackers usually use an encoding function to create the obfuscated code and attach a decoding function to decode it during execution.
Also, some standard encryption and decryption methods can be employed to do JavaScript obfuscation. For example, JScript.Encode is a method created by Microsoft to encode JavaScript code. It can be used to protect source code as well as to evade detection.
- Logic Obfuscation
This type of obfuscation technique is to manipulate the execution flow of JavaScript codes by changing the logic structure, without affecting the original semantics. There are two ways to implement logic structure obfuscation. One way is to insert some instructions which are independent of the functionality. The other one is to add or change some conditional branches, such as if ...else, switch ... case, for, while, etc.
The power of those techniques comes when they are combined, deobfuscating each technique separately could be easy for an analyst. Instead, if they are chained together, and there are also dynamic parameters in the chains, they could be hard to analyze.
Real case scenario
Recently I had the opportunity to analyze a js dropper that used some custom obfuscation function that obfuscates a common string splitting obfuscation technique (String.fromCharCode) and this needed dynamic parameter to execute properly in turn. It also has encoding obfuscation for the variable names, and the custom obfuscation function contains a lot of junk code.
For the deobfuscator script implementation, you can choose any scripting language that can support regex, in this case, I used Python.
The dropper was sent in a .rar archive, inside the .rar I had the .jse dropper.
The SHA256 hash of the .rar is:
0d72340c876292dcdc8dfa5b3b1cc7b6010902a2d28b5b15c8c35a3a284e7d35
The SHA256 hash of the .jse is:
652566914671a9d5fb5ad0b75b6c9023fa8c9cff2c2d2254daad78ba40c14e0b
Step 1: Decoding of the script
Opening the .jse dropper, I quickly recognized that it is encoded with the JScript.Encode function provided by Microsoft, following the example code:
I used a binary found online to decode the script [2] in the .js format, and I obtained the obfuscated dropper decoded:
Step 2: Deobfuscating the encodings obfuscation
To figure out what is the obfuscation technique used in the dropper, I need to beautify the code through some online beautifier [3], because the original dropper contains half a million characters on a single line, following a small piece of code as a result:
Note that all the deobfuscation steps are done on the original js dropper and not on the beautified code. The beautified code is used just to have better readability.
As you can see, fthe first obfuscation technique used is the encodings obfuscation with the Unicode notation.
To decode it in a more readable format I used the following regex to parse the characters:
(u0d{3})+
I defined a callback function for the replacement of each occurrence. I used the callback function as a replacement because I needed a dynamic replacement based on the parsed data, following the Python code:
Its utf-8 encoding will replace every occurrence of the match.
Note that encoding is not obfuscation; it is used as an obfuscation technique just to make variable names harder to understand.
Step 3: Deobfuscating the main obfuscation pattern
Consequently, I started to look around and try to identify the chaining of the basic obfuscation techniques. It was not an easy task because the .js contained more than 16 thousand lines of code. The following an example of the code:
When dealing with this huge obfuscated dropper, it is impossible to identify each obfuscation technique and replace it manually.
So, to deobfuscate it, I needed to understand the obfuscation pattern logic and wrote a script that would have automated the deobfuscation.
The chaining of one or more obfuscation techniques will create an obfuscation pattern.
The obfuscation techniques used in the obfuscation patterns are basic techniques that I explained in Chapter 2.
After some time spent in debugging the code above, I found that the functions used are an obfuscated version of the "return String(fromCharCode(staticParam, dynamicParam))" function.
This is an implementation of the string splitting data obfuscation technique.
After that, I focused on finding the obfuscation pattern to automate the deobfuscation process through a Python script. In the end, I found the following obfuscation pattern:
So just to give an example, a 'c' character is represented by this code:
As you can see, this obfuscation pattern has different layers of obfuscation.
Identifying the obfuscation pattern is a crucial step in reversing JavaScript obfuscated code because once you have identified it, you have done 80% of the deobfuscation task. It is the most time-intensive activity, but it is worth for the automation process.
Subsequently, I needed just to implement a script to parse the obfuscation pattern and to replace codes where needed.
So, from this point on it is just a matter of making a regular working expression that can parse the obfuscation pattern I identified.
It is not a hard task, but it just requires some time and testing. In my case, I used a text editor (Notepad++[4]) to debug the regex I was creating before executing it in the Python script.
Finally, I had the following regex to parse the obfuscation pattern:
(|x27x27+){w+:.*?,[A-Za-z0-9]+:[A-Za-z0-9]+,[A-Za-z0-9]+:function(.*?){.*?}.*?}[.*?](.*?x27Codex27,d{1,3})]
Observing the obfuscation pattern identified you can see that the obfuscated function data_obfuscation.string_splitting(staticParam1 + dynamicParam1)) has two parameters that I need to parse to deobfuscate the data. The two parameters will be 2 numbers, one that will be static and the other that will be dynamic. The first one can be found near the function return parameters, so for example in the above code will be 96 and the second is passed as an argument in the function call, and it is 3. 96+3=99 is the ASCII decimal code for the char 'c.'
To parse the static parameter, I used the following regex:
returnsString.*((d{1,3})+pi);
Instead, to parse the dynamic parameter I used:
].*?,x27Codex27,(d{1,3}))
To parse it correctly I used the capture group functionalities in Python, and I had the following code:
Step 4: Tuning of the script
As every programmer knows, every written script must be tuned to work as you expect.
In fact, the running of the above script did not deobfuscate the data with the best readability I wanted to reach as an analyst.
This happened because the regex was matching just the average case of the block parsing.
Debugging the script, I noticed that the code block of the obfuscation pattern was used in different places (i.e. between if statements, in try catch blocks, in while loops and in variable definitions). Each one ended with different closing parenthesis or punctuation and some of them, that were less common and I did not notice, broke the regex match. So, to tune my script I needed just to improve the original regex and restore the ending character of every matched case.
The new regex to parse all the occurrence of the obfuscation pattern became:
(|x27x27+){w+:.*?,[A-Za-z0-9]+:[A-Za-z0-9]+,[A-Za-z0-9]+:function(.*?){.*?}.*?}[.*?](.*?x27Codex27,d{1,3})(=|!|)|,|*|]|+|s|;)
Moreover, I needed to change also the callback function for the replacement routines covering all the ending cases. Then I had the following script:
Finally, running the tuned script, I got a good deobfuscated and readable version of the dropper containing 241 lines of code, following an image with a piece of code:
Step 5: Analyzing the sample
In this step I analyzed the clear and readable code, it was just a matter of JavaScript understanding.
I focused on identifying potential malicious functions to extract meaningful IOCs.
In the deobfuscated version I quickly spotted some useful IOCs that can prevent this threat to spread (and also to detect the threat once run), like for example the two files .exe and .gop dropped on the disk in the %TEMP% folder:
Moreover, the holding of all the file names found on every available disk with a specific extension in a file called "saymyname.txt" in the %TEMP% directory:
Moreover, also, a c2 server where the main payload will be downloaded:
These could represent a good set of IOCs to use to prevent the threat to spread all over your internal network.
However, I wanted to go deeper and understand two aspects: the first, why the sample did not run in my VM? Second, is it dynamic the algorithm behind the c2 server to serve the main payload or could I just simply download the .exe?
Scouting in the code, I saw a suspicious "if" with interesting conditions, following the code:
From the above snippet of code, I saw that the dropper checks for specific process names running and specific owner of the processes.
In a first place, it enumerates the running processes and then it stores some information in an array. The information held by this array will be process name, executable path, owner domain, and owner name.
After that, it will check if in the array so generated are containing some strings that could mean that the environment of execution is a VM, sandbox or a reverse engineering machine.
That is the reason this dropper did not run in my VM.
To answer my second doubt, I tried to visit the c2 server page I found, but obviously, the .exe will not be served that easily.
Then I decided to look for the HTTP connection call in the code. I noticed that the URL contained some dynamic parameters that were generated through some "for" iterations tricks, following the whole code:
The final URL must be formatted as above otherwise the c2server will not serve the binary payload. The variable "hashere" needed as the third parameter of the PHP page is generated dynamically using a for a trick to complicate analysis. If this parameter is wrong, the payload served won't unpack and execute because it would miss the parameter for the unpacking routine of the binary payload. Some other parameters are static (i.e. "param_1") and others are random (i.e. "lastParam").
Finally, commenting out the code for the HTTP connection and adding a debug print function I could obtain a URL that can allow the download of a runnable binary payload:
Running this script in cscript will print the URL:
Navigating to that page I downloaded the main binary payload that was encoded in base64, hash SHA256: 1572b8cc6dc1af0403bf91e24b20f9c39f6722258329b0bafa89f300989393f5.
The binary downloaded seems to be a variant of the banking Trojan Zusy (also known as Tinba).
Conclusion
In conclusion, I would say that knowing how to deal with obfuscated JavaScript dropper can help a security analyst to provide effective IOCs to prevent or detect the threats. It has needed just a good knowledge of regular expression, good knowledge of scripting language and of course also knowledge of JavaScript; then it is just a matter of identifying the obfuscation pattern and automate the deobfuscation process.
As you have noticed there aren't a lot of js deobfuscators available online. That is because it is really hard to write a general deobfuscator that can deobfuscate everything due to the dynamicity of the language.
On the other hand, as an analyst, you cannot rely only on dynamic analysis because sometimes it cannot help.
Become a certified reverse engineer!
A future work that I would develop is a semi-automated js deobfuscator that will recognize a deobfuscation pattern and guide the analyst in the deobfuscation process.