|
Once we faced the need to investigate how Samsung
cellular phones work; we required some information from them,
which is not documented (and will never be, for sure). So what
this article is about are the interesting points our reverser had
met while working with Samsung cellular phones firmware.
Reversing of Insertions for ARM-based Mobile Phones
I have managed to research insertions of all Samsung's
generations, including CDMA (except for the smartphones only). In
every Samsung phone the ARM-compatible processor with a set of
ARM7TDMI commands is used. Insertions are built on the basis of
three OS: RTCX, RTK, Nucleus, and compiled on different
compilers. I have seen insertions compiled on ADS (SDT) and
IAR.
On forums people call Samsung's generations in different way:
somebody divides them into Gumi/Suvon (2 cities in Korea), others
give code names - "Sysol", "Agere", "VLSI", "Conexant" and
"Ancient". I have come to a conclusion that it's more correctly
to divide them according to the phone processor.
| Processor |
Models |
| OM6357 (aka Sysol) |
E100, E700, E720, E800, E820, S50x, X100, X460,
X60x |
| M46 (aka Conexant) |
A100, A110, A200, A300, A400, M100, T208 |
| SkyWorks (aka Conexant) |
C100, C108, C110, P510, P518 |
| ONE-C (aka VLSI) |
R2XX, Nxxx, Txxx(except for T208) |
| Trident (aka Agere) |
Dxxx, Qxxx, Sxxx(except for S50x), Vxxx, C200,
E105, E310, E400, E600, E710, E810, X105, X400, X42x, X450
etc. |
| MSMxxxx |
all CDMA |
Hope, I haven't made any mistake in this list. :))
Accordingly to the list, insertions within the same generation
are very similar and, to be honest, sometimes they are twins at
all (with extremely slight changes). For example, in X100 there
are obvious traces of E100/E700/X600 - why then there is a code
for working with the second display, camera and IRDA, which it
didn't have in a whole life?
Naturally, OS is the same for the whole generation:
OM6357 - RTK
M46 - RTCX OSE
SkyWorks - RTCX OSE
ONE-C - RTK
Trident - Nucleus
MSMxxxx - don't know exactly, it might be any OS from Qualcomm.
It's just clear that they are collected to ADS/SDT.
If you are going to investigate the low level, then SDK from
corresponding OS will be to the point. Another helpful thing is
the symbolical information which can be met in some insertions
archives. Sometimes you can come across the insertions with .lst,
.sym, .map, .out files, containing the information, extremely
useful in researches. In particular, such files occur in almost
all C100, S500 insertions. When talking about the other models,
the situation is worse and you have to content yourself with
symbols signatures, made for insertion of the same generation.
For example, for M46 I have managed to find just one insertion
with symbols and it was from A110. But signatures made from it
perfectly lie down on A200, A300 etc.
Interpretation of the symbolical information
MAP format
.map files contain the information on modules included in the
insertion and look like
Base Size Type RO? Name
0 20 CODE RO AAA_vectors from object file obj/isr.o
20 38e8 CODE RO C$$code from object file../../src/t9latin.o
3908 30 CODE RO C$$code from object file obj/mmi_date.o
3938 5a4 CODE RO C$$code from object file hw_slow.o
3edc 874 CODE RO C$$code from object file rtkgo.o
etc.
where
Base - displacement in an insertion file.
Size - length.
Type - region type.
RO? - region access type.
Name - original file name, part of which was included in the
insertion
How all this can be interpreted? For example, this way:
starting with displacement 20, there is a block of the code
(CODE) 38e8 length - it's an access to Read Only block. The fact
that block has CODE attribute is far from being means that the
WHOLE area is filled with a code. Actually, it is a code plus
data, just as if the block has DATA type it does not mean that it
is necessary to make it all by data in IDA.
Without the names/symbols file this information can be used
only for determination of insertion code size (i.e. to not get
into the graphics). Therefore, we will better examine SYM
format.
SYM Format
.sym files are the mines of information. They look like:
Symbol Table
AAA_vectors$$Base 000000
AAA_vectors$$Limit 000020
VectorMap$$Base 1006a3c
VectorMap$$Limit 1006a60
isr$$Base 12774c
isr$$Limit 127bb0
gl_MaskIT 1000078
Rtk_RegionCount 100564c
rtk_WorthItSched 10056a0
Rtk11_Schedule 11f5c8
etc.
It is a little bit easier here, because the name-address
correspondence exists. But as for the addresses, there are some
secrets - a set of names exists, containing $ sign and having the
special status. Symbols with $$Base at the end indicate the
beginning of virtual address space area, $$Limit indicates the
end. I.e. here we have the information on segments. It is
possible to make a memory map of these segments and see how the
parts of binary code are being thrown to different addresses.
Building memory map should be started with such symbols:
Image$$RO$$Base 000000
Image$$RO$$Limit 1afef4
Image$$RW$$Base 1000000
Image$$RW$$Limit 107dad4
Image$$ZI$$Base 1006a60
Image$$ZI$$Limit 107dad4
RO - Read Only, indicates code addresses.
RW - Read/Write i.e. it is RAM.
ZI - Zero Initialized. RAM, which is being stuffed with zero
values when mobile phone is turned on.
Thus segments can be easily created on these addresses. Now we
look further:
AAA_vectors$$Base 000000
AAA_vectors$$Limit 000020
C$$code$$Base 000020
C$$code$$Limit 127310
C$$code$$__call_via$$Base 127310
C$$code$$__call_via$$Limit 127320
Example$$Base 127320
Example$$Limit 127324
HAL_boot$$Base 127324
HAL_boot$$Limit 12735c
RtkCode$$Base 12735c
RtkCode$$Limit 127408
SysSupportCode$$Base 127408
SysSupportCode$$Limit 12744c
boot$$Base 12744c
boot$$Limit 127654
clib$$Base 127654
clib$$Limit 12774c
isr$$Base 12774c
isr$$Limit 127bb0
C$$constdata$$Base 127bb0
C$$constdata$$Limit 1afef4
C$$data$$Base 1000000
C$$data$$Limit 1005a38
Stacks$$Base 1005a38
Stacks$$Limit 1006a3c
VectorMap$$Base 1006a3c
VectorMap$$Limit 1006a60
C$$zidata$$Base 1006a60
C$$zidata$$Limit 107dad4
In this interesting way they go one after another. If you
wish, it is possible to divide them into segments to
corresponding addresses, but this is merely a logic division.
Moreover, in .sym file these lines are scattered badly. And more
sooner or later a question appears: why the code size is 1afef4,
if length of insertion file is 1b6950? Where to put the rest 6a60
byte? We look again on the initial memory map:
Image$$RW$$Base 1000000
Image$$RW$$Limit 107dad4
Image$$ZI$$Base 1006a60
Image$$ZI$$Limit 107dad4
RAM ends on 107dad4 address, block 1006a60-107dad4 is zero
initialized, hence there is a question: what does initialize the
1000000-1006a60 block, which size is exactly 6a60? Absolutely
right, those odd bytes. If analyse the OS start code, then in the
RAM initialization procedure you will find the same copying.
In the newer insertions there is a chance to come across the
next inscriptions:
Load$$IRAM$$Base 639a74
Image$$IRAM$$Base 2010000
Image$$IRAM$$Length 0015a4
They should be understood this way: data of 15a4 length are
being loaded from 639a74 file displacement to the 2010000
address.
We continue the analysis of symbols with the $ sign:
x$litpool$ - Literal Pool, pieces of the data from functions.
At the end of many functions indexes, lines, constants are
placed, and x$litpool$ specifies the beginning of such
constants.
x$litpool_e$ - Literal Pool end.
$T is merely for debugger. Indicates the addresses where the
PC register change take place. So, at these addresses transition
commands BL/BEQ/B/BX etc. are placed.
$$- addresses where there is a change of ARM/THUMB state.
There are also C$$code symbols, but I haven't found what it
is.
Other names without $ sign are the names for constants and
functions. They can be freely used.
If the archive with an insertion contains both MAP and SYM, it
is an ideal variant - when you set a name taken from SYM it is
possible to check up whether it lays in the code area by using
data from MAP. If yes, we may freely indicate it as code not
being afraid, that code/data will be determined in IDA
incorrectly.
LST Format
It's a real paradise for a reverser, in these files lays all
at once. They consist of five parts:
Image Symbol Table - symbols... their meaning I have not
understood yet
Local Symbols - everything is clear from the name
Global Symbols - .sym file analogue.
Memory Map of the image - memory map! All at once!
Image component sizes - .map file analogue
The information is so detailed, that even the processor mode
for each function is specified.
OUT Format
Have met it only in the Nucleus-based insertions. Here can be
tlink.out and tsymb.out files:
tsymb.out - ordinary SYM
think.out - MAP file to which almost useless linker
information is added.
Now when we are armed with the symbolical information we can
load the insertion in IDA.
What to do if there are no symbols at all
"When there is no toothbrush at hand..." Yes, we take IDA,
emulating debuggerand brains in the hands. IDA is "must have".
The emulating debugger for ARM, called Trace32, can be taken
here.
First of all, we load the insertion in IDA to 0 address. I.e.
the whole insertion is being loaded to default addresses. Then
look what is on 0 address.
BOOT:00000000 B ResetHandler
BOOT:00000004 B loc_3B4
BOOT:00000008 B loc_410
BOOT:0000000C B loc_42C
BOOT:00000010 B loc_488
etc.
The code in any case begins with 0 address. In all Samsungs
and, as I guess, not only in Samsungs an insertion begins with
the interruption vectors. These are eight B commands in ARM
state, i.e. 8 vectors. 0 address is a vector of null interruption
or insertion start/restart. This zero interruption simply starts
the mobile phone and thus handler leads to the system loader:
BOOT:00000048 ResetHandler; CODE XREF: BOOT:loc_0 _ j
BOOT:00000048 MRS R0,CPSR
BOOT:0000004C BIC R0,R0,#0x1F
BOOT:00000050 ORR R0,R0,#0x13
BOOT:00000054 ORR R0,R0,#0xC0
BOOT:00000058 MSR CPSR_cxsf,R0
BOOT:0000005C LDR R3,=(InitialHWConfig+1)
BOOT:00000060 MOV LR,PC
BOOT:00000064 BX R3
If the jump from zero address goes to the non-existent address
it means that the rest part of the code is mapped to some other
addresses. It's easy to determine to which exactly. For example,
we have such beginning:
BOOT:00000000 B 0x4003CE
And there is no code on the 4003CE address. We look on 3CE
displacement and see an ARM-code. It means the rest part of
insertion is displaced on 0x400000. So we have to copy piece of
insertion with interruption handlers, load them to zero address
and then load an insertion from 400000 address. Now our code is
in the right place. We go further. It is necessary to find out
where are the RAM and area of input/output ports. The ports are
usually either in the end (addresses from about e0000000 and
higher) or in the beginning of the memory (up to 0x200000),
depending on where the insertion is being loaded. There can be
several RAM areas. First of all, we see ports initialization:
BOOT:00000588 MOV R1,#1
BOOT:0000058A LDR R0,=0xE0006000
BOOT:0000058C LSL R1,R1,#0x1B
BOOT:0000058E STR R1,[R0]
BOOT:00000590 STR R1,[R0,#0x10]
BOOT:00000592 STR R1,[R0,#0x20]
BOOT:00000594 LDR R1,=loc_20102
BOOT:00000596 LDR R0,=0xE0003040
BOOT:00000598 STR R1,[R0, #4]
BOOT:0000059A LDR R1,=0x20003
BOOT:0000059C STR R1,[R0, #8]
BOOT:0000059E LDR R0,=0xE0003000
BOOT:000005A0 MOV R1,#0xC
BOOT:000005A2 STR R1,[R0,#0x24]
I.e. since around E0000000 there is an area of input/output
ports. Its size doesn't exceed the size of segment and therefore
it's possible to create a segment of 0x10000 size. Now we go
further. In any insertion there are RAM area which is initialized
by zero values and the area which is filled by initial settings
which are taken from an insertion. We are looking for copy
cycles, so we need the debugger.
Here we see copying:
BOOT:000000D4 LDR R0,=0x63B018
BOOT:000000D8 LDR R1,=0x1000000
BOOT:000000DC LDR R3,=0x1045B38
BOOT:000000E0 CMP R1,R3
BOOT:000000E4 BEQ loc_F8
BOOT:000000E8
BOOT:000000E8 loc_E8; CODE XREF: BOOT:000000F4 _ j
BOOT:000000E8 CMP R1,R3
BOOT:000000EC LDRCC R2,[R0],#4
BOOT:000000F0 STRCC R2,[R1],#4
BOOT:000000F4 BCC loc_E8
The block is being copied from 63B018 address to 1000000
address of insertion. The length is 45B38.
This is the first RAM area. Now we look for the second one,
whose zero initialization should be nearby:
BOOT:000000F8 LDR R1,=0x11ED9E4
BOOT:000000FC MOV R2,#0
BOOT:00000100 CMP R3,R1
BOOT:00000104
BOOT:00000104 loc_104; BOOT:00000108 _ j
BOOT:00000104
BOOT:00000104 STRCC R2,[R3],#4
BOOT:00000108 BCC loc_100
Indeed, there is a stuffing with zero values in the area from
1045B38 to 11ED9E4, so here we have the second part. If there are
any areas, then there will certainly be zero or copy
initialization. Other memory pieces can be found only
analytically, but we have got the basis already.
The further research depends on the presence of
symbols/signatures. If yes, then everything comes to looking for
the necessary function in the names list. What to do if not?
First of all, it is necessary to determine approximate code
bounds and, if possible, to find functions in the code. The most
primitive and effective way is to search for a push command with
which 60 % of insertion code begins. Insertion code usually
consists of Thumb code on 90 %, so we should look for B5 byte
(push) and try to define it as the code in IDA. Insertion code
usually takes less than 50 % of the whole size, the rest part is
for graphics and language resources. Else I can say that very
often at the end of the code there are copyrights lines, a kind
of "Samsung corp. 199x-200x ARM ADS 1.2".
Some code has been revealed, around 20% were harmed by IDA
itself, because it often can't cope with THUMB/ARM transition.
And now we have to take anything left lying around loose,
i.e. what had been left by programmers. And what they had left?
Trace and Assert. And any trace and assert doesn't go without
sprintf/printf. We have to find it. It's easy - we should just
look for the "%s" line. We need that which obviously contains a
pattern of the error message. With xref we find where this line
is used and it will be exactly sprintf, followed by Trace or
Assert. Now, with basing on the error messages, we can name the
functions. I.e. walking with xref to the Trace/Assert function,
we can find output of more than half of mistakes. Further
functions naming is possible by searching the following
words:
Bad
Fail
Incorrect
Invalid
Error
Memory
File
Null
No
Critical
Abnormal
etc.
This way we will find some more error output functions. Thus
we will gradually gain the information, not being based on
anything except for the insertion. |