The Apriorit network security team always looks for more efficient and productive ways of processing data, monitoring system workflows, and analyzing shellcode to detect suspicious activity. We improve existing approaches and create new ones, trying to improve the detection of zero-day attacks while avoiding unnecessary false positives.
In this article, we talk about approaches you can use to improve runtime algorithms for zero-day threat detection, focusing on our own solution – the Diana Dasm disassembler, developed by the lead of our network security team, Victor A. Milokum.
We describe the main pros and cons of using Diana Dasm and Diana Processor for shellcode analysis and suspicious activity detection with the help of partial emulation of functions at runtime.
We also describe two methods of implementing breakpoints in memory to monitor read, write, and execute access to user-mode memory and provide a practical comparison of these methods.
Function hooking is a popular technique for monitoring the execution of code or particular functions without changing program source code. There are several well-known libraries that provide basic APIs for function interception, including the open source Mhook library and Microsoft Detours.
While it’s possible to use function hooks to influence the course of program execution, this technique is nearly useless when it comes to tracking access to certain memory regions. Function hooks can’t be used for tracking memory access from an injected DLL or shellcode.
Also, when monitoring and analyzing shellcode it’s crucial to keep the impact on performance as low as possible. There are two things that help us achieve this goal:
- Partial emulation of function execution
- The use of a lightweight virtual CPU such as Diana Processor
Diana Processor is a lightweight open source emulator of processor commands that can help us better understand and analyze the nature of zero-day threats. Using both Diana Processor and partial execution of functions, we can improve the well-known zero-day threat detection algorithms that are included in the Enhanced Mitigation Experience Toolkit (EMET) developed by Microsoft.
We’ve created our own solution, Memory Access Monitor (MAM), which tracks memory access to particular memory regions and restricts access to certain regions of process memory from other processes or from the Windows kernel. MAM is based on the breakpoints in memory approach and plays a significant part in detecting suspicious code execution at runtime and preventing exploitations.
Breakpoints in memory are an effective tool for monitoring read, write, and execute access requests to specific memory regions. They provide you with a notification mechanism and the ability to track and control any memory access requests at runtime.
Breakpoints in memory can be used to
- monitor or restrict access to certain memory regions;
- restrict other processes from accessing process memory;
- monitor and log modules or functions that access process memory;
- analyze program execution threads and monitor memory changes without a debugger;
- virtualize memory access.
There are two common types of breakpoints in memory:
- Hardware breakpoints
- Software breakpoints
Let’s look closer at each of these two types.
Some CPUs offer hardware breakpoints in memory so that the CPU itself monitors memory access and reports on it. CPUs that support this functionality contain eight special debug registers (DR0 to DR7) to control these breakpoints.
This method was used in an early version of Microsoft EMET to implement Export Address Filtering. Hardware breakpoints in memory have a number of constraints that can be considered both advantages and disadvantages depending on the task at hand.
When it comes to Memory Access Monitor, hardware breakpoints have three main limitations:
- The number of memory addresses is limited to four registers, so only four breakpoints can be used simultaneously.
- The size of the memory region to be monitored is limited to either 1, 2, 4, or 8 bytes.
- Each thread of the process has its own set of DR0–DR7 registers, so you need to set a breakpoint in memory for each thread separately. This involves implementing
- a mechanism for tracking the creation of new threads;
- a mechanism for setting the values of the DR0–DR7 registers for running threads.
Meanwhile, this method provides us with two important advantages:
- Hardware breakpoints in memory function at the CPU level, so the overhead while working with memory is minimal.
- Hardware breakpoints in memory are thread-safe, apart from the delay between the creation of a thread and setting the values of the DR0–DR7 registers. All threads can access memory simultaneously, and at the same time each of these memory access requests will be monitored.
You can see an example of implementing hardware breakpoints in memory here.
While this method is suitable for debugging purposes, it can’t be used for threat detection algorithms due to its limitation to four breakpoints and small memory capacity. Besides, it can’t restrict access to process memory from other processes.
When implementing software breakpoints, you need to ensure that you can
- get notifications about access requests to protected memory (before the memory is actually accessed);
- allow execution of memory operations once all required checks have been passed.
In Windows, there’s a special page attribute, PAGE_GUARD, that can help you accomplish both of these tasks. You can use this attribute along with vectored exception handling to get notifications about all access requests to protected memory.
The main advantage of using the PAGE_GUARD attribute is that once the attribute is cleared, you can get uncontrolled access to process memory.
On the other hand, this method has several drawbacks:
- The PAGE_GUARD attribute can be applied only to a whole memory page. So when you need to monitor access to, say, a 10-byte structure, you’ll need to monitor an entire page with the PAGE_SIZE of 4096 bytes. And if these 10 bytes you need to monitor are located on two adjacent memory pages, both of these pages, with a combined PAGE_SIZE of 8192 bytes, will need to be monitored.
- Notifications are received only once, after the first attempt to access an address within a guarded memory page. After that, the system clears the PAGE_GUARD modifier and lifts the guarded status from the monitored page.
- You have to execute only one current instruction and then immediately restore the PAGE_GUARD attribute.
- Until the PAGE_GUARD attribute is restored, other threads can access the protected memory region without any restrictions.
The first problem can be solved with the help of a trap flag. A trap flag allows you to execute one current processor instruction and then generates an EXCEPTION_SINGLE_STEP exception. This exception will be intercepted by the same vectored exception handler which, in turn, will restore the PAGE_GUARD attribute for the memory page.
Figure 1 below illustrates how PAGE_GUARD and trap flag work together.
Here are the main steps of the PAGE_GUARD and trap flag workflow:
- One of the threads requests read, write, or execute access to protected memory.
- As the protected page has the PAGE_GUARD attribute, the CPU generates a memory access exception.
- The system processes this exception and calls all registered vectored exception handlers with the EXCEPTION_GUARD_PAGE status.
- Our registered vectored exception handler checks whether this page belongs to our protected memory pages. If it doesn’t, the vectored exception handler allows the system to regain control and generates an EXCEPTION_CONTINUE_SEARCH exception (the exception isn't handled).
- The vectored exception handler calls the registered callback, notifying external code about the memory access event.
- With this callback, the external code can perform certain checks, for instance analyze the call stack or context thread.
- After these checks, the callback gives control back to the vectored exception handler.
- The vectored exception handler copies the attributes of the original page to add the PAGE_GUARD attribute to them later.
- The vectored exception handler sets a trap flag to execute only one current processor instruction that requests access to memory.
- The vectored exception handler generates the EXCEPTION_CONTINUE_EXECUTION exception (the exception is processed and the system can continue execution).
- The system applies all changes to the current thread context and renews thread execution starting from the same instruction that triggered EXCEPTION_GUARD_PAGE.
- Since memory access is temporarily allowed for everyone, the CPU has no obstacles and processes the instruction successfully.
- Thanks to the trap flag, the CPU generates an exception after processing one instruction. The system calls the vectored exception handler with the EXCEPTION_SINGLE_STEP status.
- The vectored exception handler restores the PAGE_GUARD attribute using the information that was saved in step 8.
- The vectored exception handler generates the EXCEPTION_CONTINUE_EXECUTION exception, which means the exception has been processed and the system can continue execution.
- The program continues its normal execution.
This scheme describes the processing of one processor instruction that attempts to access a protected memory region.
During the execution of steps 3 through 14, the protected memory page does not have the PAGE_GUARD attribute. This gives an opportunity for other threads to get uncontrolled access to this memory.
This approach is implemented in the latest version of EMET for the Export Address Filtering and Export Address Filtering Plus protections.
In the next section, we describe how to implement fast breakpoints in memory with the help of a virtual CPU. The following approach is our own solution and, depending on the researcher’s purposes, can be used for both memory access monitoring and emulating suspicious actions in an isolated environment based on shellcode analysis.
To address the issue of uncontrolled memory access, we need to remove the PAGE_GUARD attribute and substitute it with another attribute, such as PAGE_NOACCESS. In this case, all threads will always get the ACCESS_VIOLATION exception, which can be handled by the same vectored exception handler.
Now the main question is how we can allow actual memory access after passing all the checks. If we decide to restore the original page attributes, we’ll get the same issue with uncontrolled memory access from the parallel threads. If we decide to pause all other threads except the current one, we’ll get a significant deterioration in performance.
A possible solution is to create a shadow memory page containing all the attributes of the original page. Let’s look closer at this method.
Creating shadow memory pages
A shadow memory page is basically a duplicate of an original memory page (see Figure 2). You can create shadow memory pages by adding a memory region to the monitoring list. However, you have to add this new memory region before setting the PAGE_NOACCESS attribute for the memory pages.
As a result, when shadow memory pages are deleted from the monitoring list (right after clearing PAGE_NOACCESS), the original page data is restored.
In this way, the shadow page stores all original page data including page attributes, but the program will refer to this data via the address of the original memory page. All the vectored exception handler has to do is substitute the original page address with the shadow page address, but only for processing one instruction.
However, using a trap flag to redirect access to the shadow page is challenging, as you need a fitting disassembler to search for the right register or memory with the original page address.
In the next section, we describe how to use a particular processor — Diana Processor — for this purpose.
Diana Dasm and Diana Processor
Diana Dasm is a small and fast disassembler that can be used by Windows kernel developers. It’s a lightweight C disassembler with a flexible architecture that has its own full processor instruction emulator called Diana Processor and supports emulation of both x86 and x64 instructions.
The execution of one processor instruction looks pretty simple:
Here, the DianaProcessor_Init function is called only once when DianaProcessor is initialized. DianaProcessor_ExecOnce is called to emulate one processor instruction, to which the virtual RIP/EIP register points from Diana Processor.
Diana Processor has its own set of virtual processor registers where all real processor instructions are executed. All read and write requests, including the reading of the current instruction that the virtual RIP/EIP register points to, are processed via the DianaRandomReadWriteStream interface:
Therefore, when processing the DianaProcessor_ExecOnce function, all memory access requests from Diana Processor are bound to be handled only via functions of the DianaRandomReadWriteStream interface.
Here are the three main steps required to emulate the execution of one processor instruction from the vectored exception handler:
- Load the current thread context (CPU registers at the moment of exception generation) into Diana Processor. This information is passed to the vectored exception handler function in the format of the IN OUT parameter:
- Emulate the execution of one current processor instruction via DianaProcessor_ExecOnce.
- Apply the execution results by rewriting the CPU registers in ExceptionInfo::ContextRecord with the ones received from Diana Processor.
The entire process seems quite simple. Now let’s see how it works in practice.
Let’s see if we can use Diana Processor for a thread-safe implementation of breakpoints in memory. Figure 3 below shows a basic scheme for this process.
The first seven steps of the process are similar to the regular PAGE_GUARD and trap flag workflow described in one of the previous sections. So we’ll start from the eighth step of the workflow:
- Load the current stack context to Diana Processor, execute one processor instruction, and apply the execution results by changing the thread context (i.e. changing the IN OUT parameter of the ExceptionInfo function in the vectored exception handler).
- The vectored exception handler returns the EXCEPTION_CONTINUE_EXECUTION status, which means the exception has been processed successfully and the program can continue executing.
- The operating system applies the modified thread context and passes control over using the address from RIP/EIP.
- The program continues execution starting from the next instruction, as the previous instruction has been successfully processed via Diana Processor by the vectored exception handler.
As you can see, this approach provides us with exactly what we were looking for – a thread-safe implementation of breakpoints in memory. In addition, all protected memory now contains the PAGE_NOACCESS attribute, making it inaccessible from the Windows kernel and from other processes.
Of course, you can still change the page attributes either from kernel mode or via VirtualProtectEx, but the page will contain irrelevant data as all changes are applied only to the shadow page.
Limitations of the approach
It’s noteworthy that implementing Memory Access Monitor with the help of Diana Processor has a number of limitations. For instance, while the approach with the PAGE_NOACCESS attribute and Diana Processor works fine with common debuggers, a couple of steps are required:
- The debugger has to be connected before memory pages come under protection.
- The protected pages will be inaccessible when being viewed via the Memory window.
Also, the current implementation can monitor access to executable memory, but only for code that isn’t used in the vectored exception handler. Therefore, if we decide to set protection for a page that contains the EnterCriticalSection function, we’ll get infinite recursion.
Diana Emulator can emulate only one processor instruction per call by the vectored exception handler. Otherwise, Diana Processor could make too many changes to register stack pointer (RSP)/ extended stack pointer (ESP) during emulation and the data could be overwritten, as both the emulator and emulation process are executed on the same thread stack.
When calling the VirtualQuery function whose address belongs to the protected memory region, the function will return the PAGE_NOACCESS attribute. However, in this case it would be appropriate to return the original page attributes. So far this feature hasn’t been implemented, however.
When calling the VirtualProtect function whose address belongs to the protected memory region, the function will apply new attributes to the shadow page instead of the original, but only if the NtProtectVirtualMemory hook is set and control is passed to mam::hooks::NtProtectVirtualMemory.
Finally, when generating an exception for the access right violation to the shadow page, the address will point to the shadow page instead of the original to which the code previously referred.
Now let’s look at some test results to see which approach works more efficiently.
To see which approach is better, we ran benchmark speed tests for both PAGE_GUARD and trap flag and PAGE_NOACCESS and Diana Processor implementations. Performance measurements were carried out on Windows 10 (64-bit) with an Intel i5 6400 CPU (4 cores) for 1 Mb of protected memory.
First, we tested the average speed of memory access of one thread for both approaches (see Table 1).
Table 1. Average speed of memory access of one thread within five minutes
|read memcpy||read 8 bytes||write memcpy||write 8 bytes|
|PAGE_NOACCESS and Diana Processor||5732 KB/s||2815 KB/s||5601 KB/s||2806 KB/s|
|PAGE_GUARD and trap flag||2805 KB/s||1414 KB/s||2799 KB/s||1416 KB/s|
As you can see, read and write operations take the same amount of time to execute. However, the approach with Diana Processor is two times faster due to processing only one exception per instruction instead of two exceptions.
Now let’s see how fast these approaches can handle two threads (see Table 2).
Table 2. Average speed of memory access of two threads within five minutes
|read memcpy||read 8 bytes|
|PAGE_NOACCESS and Diana Processor||9453 KB/s||4584 KB/s|
|PAGE_GUARD and trap flag||9,009,056 KB/s||6,181,597 KB/s|
As we mentioned earlier, the approach with the PAGE_GUARD attribute and trap flag is prone to uncontrolled access in the event of multithreaded memory access requests. Our tests confirmed this, showing a significant increase in the memory read speed. The approach with PAGE_NOACCESS and Diana Processor, on the other hand, is thread-safe and allows you to control memory access at any moment. However, while working with multiple threads, the memory read speed also decreases because of Diana Processor synchronization.
By employing a virtual CPU, you can significantly improve and at the same time simplify your runtime algorithms. Diana Processor is an efficient tool for partial emulation and analysis of code execution.
While the approach we’ve introduced using shadow pages and implementing Diana Processor and the PAGE_NOACCESS attribute still has a number of limitations, it shows better results when compared to a more common implementation of the PAGE_GUARD attribute and a trap flag.
At Apriorit, we have a team of passionate cybersecurity experts and kernel developers who can assist you in building truly secure solutions and solving the most challenging tasks. Feel free to contact us by filling out the form below or using the live chat on this page.