This article describes a sample parser of reg files using the Boost Spirit Parser Framework. The main aim of this article is to share the experience, and that's why here, I'll show not so much the solution for the concrete task or the usage of the concrete technology as the way the development process goes. In particular, we'll discuss why we use the curtain libraries and make one or another solution.
I would like to thank the people who developed the following projects - they made the implementation of this project easier:
Background and history of this task
There was a project in which I took part and we needed to test the working of a parser for Windows hive Registry files. These files are stored in binary representation and the structure of such a file is not documented by Microsoft. But, by means of research, my colleagues managed to clear out this structure, and after that, the question of verifying the parser work appeared.
To perform testing, I decided to use the functionality of exporting of Registry in two formats: hive and reg. Thus, I could obtain two different files for the same Registry key and after that check the working of the Windows hive Registry file parser.
The structure of the Registry file - I'll give an example below - is very similar to the structure of the ini file, so you can use standard Windows functions for reading values in this file. But the problem is that, functions work very slow for big files, and that is why this parser was developed - a parser for reg files where I use the Boost Spirit Parser Framework. The reasons why standard Windows functions are slow will be considered below in this article.
What is a reg file?
Let's consider the general view of a reg file structure first, and some special complicated cases will be considered as necessary.
I've taken the following material from here http://en.wikipedia.org/wiki/Windows_Registry.
.REG files (also known as Registration entries) are text-based human-readable files for storing portions of the Registry. On Windows 2000 and later NT-based Operating Systems, they contain the string Windows Registry Editor Version 5.00 at the beginning and are Unicode-based. On Windows 9x and NT 4.0 systems, they contain the string REGEDIT4 and are ANSI-based. Windows 9x format .REG files are compatible with Windows 2000 and later NT-based systems. The Registry Editor on Windows on these systems also supports exporting .REG files in Windows 9x/NT format. Data is stored in .REG files in the following syntax:
[<:Hive Name>\<:Key Name>\<Subkey Name>] "Value Name"=<Value type>:<Value data>
Example 1 (different types):
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft] "Value A"="<String value data>" "Value B"=hex:<Binary data> "Value C"=dword:<DWORD value integer> "Value D"=hex(7):<Multi-string value data> "Value E"=hex(2):<Expandable string value data>
Example 2 (real):
[HKEY_CURRENT_USER\Key] "Value string"="B" "Value dword"=dword:00000001 "Value hard"=hex(1000800c):53,00,65,00,72,00,76,00,69,00,63,00,65,00,53,00,\ 74,00,61,00,72,00,74,00,54,00,79,00,70,00,65,00,00,00,4d,00,61,00,78,00,44
Making a little digression, I want to stress that the number in the line " hex(1000800c) " is the type identifier and it can be anything. It's often used as the data in the security branch [HKEY_LOCAL_MACHINE\SAM].
And now, let's try to extend the information about the possible contents of the reg file. Here, I represent some facts obtained during our research process:
1) Key Name may consist of alphabetical symbols and " , \ , [ , ]. 2) Number of values of one key can be from 0 to infinite 3) Value Name can be: - symbol '@' - it means default - "text" - any symbols can be in this "text" even these ones: \n , " , \ , [ , ] but it always ends with "\n symbols in sequence 4) Value data can be: - "text" - any symbols can be in this "text" but it always ends with " and \n in sequence - binary: - dword:XX - hex:XX - hex(N):XX Comments: XX can be the pairs of number symbols separated by comas and it can end with '\' symbol that means that data continue in the next line Example: dword:72,...,00,\ 00,..,20
Two approaches and their comparison
As it was mentioned, the structures of reg files and ini files are quite the same, so I started to search for methods for working with ini files. I found the standard Windows functions.
Using the Windows API functions
Windows gives a lot of functions to work with INI files; we are interested in two of them for our task:
GetPrivateProfileSectionNames: Retrieves the names of all the sections in an initialization file.
GetPrivateProfileSection: Retrieves all the keys and values for the specified section of an initialization file.
So, we should call
GetPrivateProfileSectionNames one time to obtain the list of keys and then call
GetPrivateProfileSection to obtain the values inside the keys.
The problem is that if the file is quite big, i.e., there are a lot of keys in it, then we should call
GetPrivateProfileSection several times to read from the file. Here is some row test data: file size: 30 MB, file includes about 30,000 keys, the parsing of this file takes about 20 minutes. And, I should say that reg files are often bigger than 100 MB.
So, the problem:
Unjustified number of readings from the file.
It's necessary to load the file to the memory at one time or in some parts and then parse its content using your own tools.
Using a custom parser
When parsing a reg file content using your our own tools, it's good idea to use already developed work. Thus, I came across the article about the ini file reader using the Spirit library. Using the example of the ini file parser, I developed my reg file parser.
Why boost::spirit ?
The Spirit Parser Framework is an object oriented recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates allow users to approximate the syntax of Extended Backus Naur Form (EBNF) completely in C++.
I was also attracted by these factors:
- There are no intermediate conversions to some code needed, and also no external applications except for the compiler.
- You need to include only two header files and no libraries to use the Spirit Parser Framework.
Fleeting glance on boost::spirit
The main idea of using boost::spirit is in using the rules. Usually, several basic rules are defined, and then other rules are defined by means of overridden operators as a combinations of basic rules. The following example shows the creation of rules using "AND" and "NOT" operators:
RuleType simpleRule = ~ch_t('A') & ~ch_t('B');
This rule works for any symbol except for 'A' and 'B'.
Below in this article, each rule will be described in detail, but now, I want to give some quick information about possible operators - to let you imagine what the possible operations with the rules are.
Set operators: a | b Union Match a or b. Also referred to as alternative a & b Intersection Match a and b a - b Difference Match a but not b. If both match and b's matched text is shorter than a's matched text, a successful match is made a ^ b XOR Match a or b, but not both Sequencing Operators: a >> b Sequence Match a and b in sequence a && b Sequential-and Sequential-and. Same as above, match a and b in sequence a || b Sequential-or Match a or b in sequence Optional and Loops: *a - Match a zero (0) or more times +a - Match a one (1) or more times !a - Match a zero (0) or one (1) time a % b - Match a list of one or more repetitions of a separated by occurrences of b. This is the same as a >> *(b >> a). Note that a must not also match b Single character parsers: anychar_p Matches any single character (including the null terminator: '\0') alnum_p Matches alpha-numeric characters alpha_p Matches alphabetic characters blank_p Matches spaces or tabs cntrl_p Matches control characters digit_p Matches numeric digits graph_p Matches non-space printing characters lower_p Matches lower case letters print_p Matches printable characters punct_p Matches punctuation symbols space_p Matches spaces, tabs, returns, and newlines upper_p Matches upper case letters xdigit_p Matches hexadecimal digits Other comments: negation ~ Example: ~ch_t('x') - matches any character except 'x'
Introduction to the parser development
Further in this article, I'll try to keep the high level of clearness so I'm sorry in advance if you think that my descriptions are too detailed.
The description starts from the heart of the parser - its algorithm. The algorithm is described in parts, and then the wrapper for this algorithm is described, and then other auxiliary classes. At the end, we'll consider a concrete usage example and test.
The algorithm is represented by a function and interface:
It follows from the function names that the algorithm will call the
OnKeyFound function for each key name found,
OnValueNameFound for each value name found, and
OnValueDataFound for the value content.
An observant reader can ask a question: "Why is the processing of value separated into two functions,
OnValueDataFound, after all it it one single entity?" The answer is simple: "The implementation of this processing inside the algorithm would be hard and so the algorithm entrusts calling inside it the processing of these two parts".
At this stage, we will prepare the environment for the convenient work. In particular, we should use namespaces, shorten the line for rule creation, and define frequently used rules.
The first rule is to omit the blanks and tabs. Asterisk means that the rule can work several times in a row or not work at all.
The second rule is to make sure that the current symbol is not the name separator.
The third rule is to omit empty data. The operator >> means that the rule at the left side of the operator and the rule at the right side should be performed in sequence, one after another.
The forth rule is assigned to free data not part of the separators. Any symbol matches this rule except for name separators, @ - the identifier of value by default, and the symbol " that signals the end of the line. This rule can work either several times or never.
Rule for the key name
This rule can be illustrated on the scheme:
ident_kname_continue is an auxiliary for
ident_key_name. First of all,
ident_key_name is the recursive rule, and after the process meets symbol ], it tries to define if it is the end of the name or the part of the name. If symbol ] is the part of the name, i.e., there is no symbol of line end after it, and then this rule starts itself again and continues parsing.
Let's consider the following case for example:
In this example, the symbol ] is the part of the name and that's why it's necessary to check if there is the symbol of line end after it. You can see such names in the Registry.
Rule for the Value name
This rule can be represented on the scheme:
Rule for the value content
This rule can be represented on the scheme like this:
Rule that describes the line with the key name
This rule requires the execution of several rules in sequence. The new idea here is to use the operator . In this case, it means that when the rule is activated, the functor transferred as the parameter should be called. And, the range where this rule worked will be transferred to this functor via two parameters.
Rule that describes the line with the Value
This rule is much like the previous one. At this moment, we can return to the question stated during the interface discussion: "Why is the processing of Value separated into two functions
OnValueDataFound, after all it is one single entity?". Now, we can give a more detailed answer. As it's seen from the code above, it's hard to support an interface corresponding to the reg file structure by means of boost::spirit - a function call with 4 parameters for the whole value.
Gathering everything together
So, we finished the algorithm of reg file parsing. The rules for the key name are set, the value is aggregated, and the file parsing starts.
The whole function looks like the following:
It's time to consider the wrapper class for this algorithm. This class assumes the processing of the value name and the value content. In other words, this class supports the following intrface:
And the class itself:
Before considering the observers created for testing, we should describe the usage of the boost Pool library.
What is Pool?
Pool allocation is a memory allocation scheme that is very fast, but limited in its usage.
The Boost Pool library is a header file library. That means there is no .lib, .dll, or .so to build; just add the Boost directory to your compiler's include file path, and you should be good to go!
How do I use Pool?
To make it convenient, I created the following classes that have the functionality of the standard STL containers but use the boost Pool library:
The only disadvantage - as you could notice - is no full support for original constructors. I actually think that it's not a big problem, you can just insert the necessary constructor if it's needed.
Some observers were created for testing purposes:
CRegStatusObserver- assigned to print the status of the parsing process on the screen in percentage.
CRegCountObserver- assigned to print the number of keys and the values on the screen.
CRegPrintObserver- assigned to print all the keys and values on the screen.
CRegFullObserver- assigned to store all the keys in
std::map, where the name of the Registry key is key and the vector of all the values of this Registry key is the std::vector values.
CRegObserversPool- assigned to make it possible to use several observers simultaneously.
It's important to mention that these observers print information on the screen, but they can be used to print to the file or some other stream. For example,
And some examples of its usage:
The usage of other observers is the same.
All auto-test are represented in the one project, RegFileParserAutoTest.
The test is a console application developed with the Boost Testing Framework. Five complicated cases were chosen to be the test data: one for key name, two for value name, and two for value content.
There is one more case with the typical content of a reg file. So, we have six cases, and there are two versions of reg file format - ANSI and UNICODE (Regedit4 and Regedit5 correspondingly); each of our cases duplicates for two formats. As a result, we have 12 test files.
To control the results of parsing, we use a comparison of the original content and the content parsed and saved in memory using
Description for manual testing
Manual tests are represented in one project, RegFileParserTestCmd. The test is a console application implemented using Boost Program Options that has these parameters:
C:\RegFileParser\bin\Debug>RegFileParserTestCmd.exe --help Allowed options: --help produce help message --reg_file arg source reg file --print Enable printing to the screen --count Enable counting parsed keys and values --status Enable printing status of parsing
It's a pity that I don't have enough time to add options to save the parsed content to a file, and using the save option like it is in autotest doesn't seem to be aesthetic to me.
Registry export using Regedit
This article is a special piece of knowledge, so may be it's not as systematic as it should be.
It was really interesting to learn boost spirit, and I managed to get the pleasure of my work with it and some aesthetic satisfaction - not just get my task done. After all, this discovered to be very effective; for example, the calculation of number of keys and values in a Registry file of 250 MB now takes a couple of seconds only.
So learn something new, and good luck to you in your development!
Continue reading our Dev Blog with the File system driver tutorial.