Reverse-Engineering Arrays
Introduction
Whenever we would like to reverse-engineer a function, we need to know exactly how the function is being called: its calling convention, number of parameters, parameter types, parameter values, etc.
Become a certified reverse engineer!
After the Ida analyzes the program, it will create comments for known parameters being passed to known functions. The function names will also be preserved and an automatically generated name will not be assigned to that function. An example of such a function is GetCurrentDirectoryA , a function call we can see in the picture below:
We can see that the address of a function GetCurrentDirectoryA is being pushed into register edi. Then we're moving a hexadecimal value 0x104 into register esi. Let's ignore the jump instruction for now, since it's not important at the moment. Then we're loading some address to the register eax, which is currently unknown, and pushing that address to the stack as an lpBuffer parameter. After that we're pushing the register esi as parameter nBufferLength to the stack and calling the GetCurrentDirectoryA function. If we go to the MSND website and take a look at the GetCurrentDirectoryA function prototype, we will see the following:
We can see that Ida has correctly identified the names of the parameters that we're pushing on the stack right before calling the GetCurrentDirectory function. If we take a look at the explanation of the function, we'll figure out that the nBufferLength parameter specifies the length of the buffer for the current directory string, including the null character. The lpBuffer parameter holds a pointer to the buffer that receives the current directory string. If the function succeeds, the return value specifies the number of characters that are written to the buffer, not including the terminating null character. If the function fails, the return value is zero. To get extended error information, call GetLastError [1].
We saw that Ida automatically recognized the parameters that were passed to the GetCurrentDirectoryA function, which can be a great help when reverse-engineering a binary. But we must also mention that Ida doesn't always know how to identify the parameters being passed to known functions, so from time to time we'll have to rely on our own knowledge to identify those parameters.
In the next part of the tutorial we'll present a few basic programs and their disassembled versions to show how the higher-level C++ code is compiled into lower-level assembly code. First we must present a few basic programs in C++ we'll use to compile into their binary form, which we'll later analyze. We'll present the C++ code that uses arrays in different situations and then reverse-engineer it in Ida.
Global Arrays
We know that arrays are contiguous blocks of memory, but we must differentiate between the locations where the arrays are stored. The arrays can be stored in a global scope of the program, on the stack, or on the heap.
The first program that stores the array in a global scope of the program, written in C++, is presented below:
[cpp]
#include
int a[10];
int main(int argc, char **argv) {
for(int i=0; i a[i] = i;
}
return 0;
}
We can see that the program is very simple; first we're creating an array in the global scope of the program, which we're iterating in the main function of the program and assigning each element their corresponding index. At the end the array will look like this: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 (the index starts at 0). If we compile and run the program right now, it won't do anything, since we're not printing the array on the screen. An example of this can be seen on the picture below where we first compiled the program into the array1 executable and later run it on the system:
If we open the array1.exe executable in Ida now, the program will be analyzed and Ida will present the start method, which can be shown below:
We can see that the start method initializes the stack and then calls the sub_401000 function to do its work. In order to further declare what the executable actually does, we need to examine that function. It's good if we can first present the graph of how the functions are called in the current executable; the xrefs graph can be seen in the picture below:
In the C++ code, we know that we're assigning the number 11 to the ninth element of the array, so it would be a good idea to search for immediate value 0xB (the number 11 in hexadecimal representation) using the Search - Search immediate functionality. The following window will be presented to us:
In the "Value to search" box we entered the value 0xB and checked the "Find all occurrences," then pressed the OK button. Ida will look for all 0xB immediate constants throughout the program and will display a view notifying us about them. The view will look like on the picture below:
We can see that many of the immediate 0xB constants were found, but we're looking for such a constant in the .text section of the program, so only the first five options are really relevant. Most probably, the fifth option isn't the one we're looking for, since it's comparing the 0xB constant to the value stored in register eax, and we're looking for an assignment of the constant 0xB to some value. Thus, we need to scroll over only the four options that are left; we'll quickly find out that it's the forth option that we're looking for and it's located at the 0x004013C3 virtual address. The whole code of the function that also holds the 0x004013C3 location is presented on the picture below:
The function's name is sub_40138C and it is being called from the function sub_401000 (notice the cross reference). Actually, this is the function that we're looking for, because the code presented above loops through the array and assigns the appropriate values to each element of the array. The graphical representation of the code above is presented on the picture below:
On the loc_4013B7, we can see that we're comparing the value stored at the [esp+20+var_4] to a constant 9, which is exactly our comparison of a for loop in C++ code. If the number stored at that address doesn't equal to 9, then the program execution is redirected to loc_4013A4. That subsection increments the specified value by 1 and continues the looping process. When the loop is done, we're jumping to the last block on the picture above, where we're storing the 0xB constant in the eax register and overwriting the last entry in the array. Then we're storing the offset to the cout function on the [esp+20+var_20] and printing the last entry of the array. At the end we're returning 0 and quitting the program.
If we right click on the dword_405020 variable and select "Jump in a new window," the disassembly of that virtual address will open in a new window as we can see in the picture below:
We can see that we're referencing the dword_405020 variable that is part of the .bss section that holds the variables which can be allocated at compile-time rather than run-time. We know that the .bss section contains the uninitialized global variables that are declared outside any function. After the function is done executing, the memory will look as we can see in the two pictures below; the first picture presents the hexadecimal view and the second picture presents the disassembly view of the memory in question:
We can see that both pictures present an array of integers, where the numbers are being increased from 1 to 9 and saved into contiguous memory locations. But why are three out of four bytes marked as zero? It's because each integer is 4 bytes in size, but we're only using 1 byte, since the numbers are very small, so the other three bytes must be left at zero. If we take a look at the instruction that writes the number to the specified memory locations, we can see that it's using the "dword_405020[eax*4]" to index the right element in an array. The usage of eax*4 indicates that each element in an array is 4 bytes long.
Local Arrays
In the previous example we saw how the array is being accessed and written to when using global arrays where the compiler knows its address at compile-time; but if we try to use a local array, the virtual address to the start of the array is not known in advance, only at run-time.
Let's present the same program we used in the previous example, but move the array inside the main function, so the whole program will look like this:
[cpp]
#include
int main(int argc, char **argv) {
int a[10];
for(int i=0; i a[i] = i;
}
std::cout << a[9] << std::endl;
return 0;
}
If we compile and run the program again, it will look like the picture below:
We can see that we compiled the program with g++ compiler and when we run it, the program output the number 11 as it should. If we now disassemble the program in Ida and find the function that initializes and declares the array, we will find something like the picture below:
Notice the difference in assigning the values to the array. In the previous example, the assignment operations were as follows:
[plain]
mov ds:dword_405020[eax*4], edx
mov ds:dword_405044, 0Bh
The current assignment operations are the following:
[plain]
mov [esp+eax*4+40h+var_2C], edx
mov [esp+40h+var_8], 0Bh
Before we used the global variable for which the space has already been assigned at compile-time: the dword_405020 variable. But with the local array there is no default variable that is assigned at compile-time. Instead, the space in memory is assigned dynamically at run-time. At the latter example we can see that the index to the array is calculated with [esp+40h+eax*4+var_2X] index, which is a clear indication that the ESP register is also used to define the exact memory location, so the array must be local and declared on the stack. When the global array was used, the dword_405020 variable was used to define the exact virtual address of the memory location, but here it's the ESP register. The var_2C is a local variable that holds a negative number that needs to be added to the ESP virtual address to get the address of the array on the stack. When assigning 0xB constant to the ninth element of the array, the var_8 local variable is used, which is used to calculate the exact address of the ninth element on the local array variable on the stack. The var_2C local variable holds the value of -0x2C, while the local variable var_8 holds the value -0x8. The [esp+eax*4+40h+var_2C] is evaluated as [esp+14h], [esp+18h], [esp+1Ch], etc, while the [esp+40h+var_8] is evaluated as [esp+0x38]. This makes perfect sense and exposes all the virtual addresses of all the elements of local array. The first element a[0] is located at [esp+14h], the second argument a[1] is located at [esp+18h], etc, and the last argument a[9] is located at [esp+38h].
Heap Arrays
There's one more place where we can allocate arrays: on the heap. To do that, we must introduce the new keyword into the C++ program. If we rewrite the program so it will use the heap for storing the array, the actual code will look like the one below:
[cpp]
#include
int main(int argc, char **argv) {
int *a = new int(10);
for(int i=0; i a[i] = i;
}
std::cout << a[9] << std::endl;
return 0;
}
Notice that the use of new keyword operation reserves space on the heap at run-time. The picture below presents the compiling and running the program the same way as we already did in the previous two examples.
When we load the array3.exe executable into Ida, we can quickly locate the relevant function inside it, since the program is basically the same as before. The graphical overview of the relevant code is presented in the picture below:
In the first block above, we're initializing the stack and then moving the value 0x28 to the stack, which is the first and only parameter to function Znaj, which symbolizes the call to the new keyword. The 0x28 bytes is exactly 40 bytes, which is: 10*4=40 bytes (10 elements of the array). After the initialization, the virtual address is stored in eax register, which is then saved into the [esp+20h+var_8] variable on the stack. The [esp+20h+var_4] value holds the index to the array, which is first initialized at 0 and then increased to 9. The "mov [eax], edx" instruction saves the current index value on the address returned by the new operation, which is a memory region on the heap.
Conclusion
We've seen different uses of arrays in assembly language. The easiest way to figure out we're dealing with an array is noticing the use of eax*4 index, which increases the array index in each iteration by 4.
References:
[1]: GetCurrentDirectoryA function, accessible on http://msdn.microsoft.com/en-us/library/windows/desktop/aa364934(v=vs.85).aspx.
Become a certified reverse engineer!
[2] Chris Eagle, The IDA Pro Book: The unofficial guide to the world's most popular disassembler.