Hashed Lookup Genarator for Static Keysets


Node:Top, Next:, Previous:(dir), Up:(dir)

hashgen: Hashed Lookup Genarator for Static Keysets


Node:Copying, Next:, Previous:Top, Up:Top

GNU GENERAL PUBLIC LICENSE

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public Licencse as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later option.


Node:Overview, Next:, Previous:Copying, Up:Top

Overview

The hashgen utility generates the source of a lookup function for a given static keyset (reserved words of a programming language, assembler mnemonics, XML-based language elements, etc.) Every keyset's entry contains a key (either string or binary) and arbitrary user-defined atributes associated with that key. The keys must be unique.

The generated function will find any keyword in constant time using exactly one comparison. A typical lookup may take as little as 100ns on an average desktop machine.

The source code can be generated in C++, ANSI C, or K&R C.

The hashgen utility does basically the same thing and uses the same basic algorithm as the gperf utility by Douglas C. Schmidt (see Acknowledgements).

While gperf is a great piece of software, I found it a bit difficult to use. The major points addressed by hashgen are these:


Node:Invoking, Next:, Previous:Overview, Up:Top

Invocation

Synopsis:

hashgen [options]... [input-file]

options are expected in either standard short or GNU-style long format.

input-file contains the keylist, options, and C declarations. If input-file is not specified, the standard input is read. Only one input file can be handled at a time.

The generated code is written to the standard output or to the files specified with -o, -c, and -h options.


Node:Specifying Options, Next:, Previous:Invoking, Up:Invoking

Specifying Options

An option can be specified in either the [options] section of the input file or as a command-line option. The latter may be specified in either short or GNU-style long form. The long form of a command-line option has the same name as in the [options] section of the input file with two dashes prepended.

Example. A fragment of an input file like this

[options]
size=3
header=keylookup.h
...

is equivalent to the following command line fragment:

./hashgen --size=3 --header=keylookup.h ...

or, in the short form,

./hashgen -N3 -h keylookup.h ...

A Boolean option foo can be turned on and off like this

foo=yes
foo=no

or like this

foo=1
foo=0

or just

foo
no-foo

All of the above forms can be used on the command-line with two preceding dashes, e.g. --no-foo.

For a short form of a Boolean option, say i, -i and -i+ turn i on, -i- turns it off.

Short Boolean options can be combined; the last option in a cluster of options can have non-Boolean argument, e.g. -zi-l C++, will turn z on, i off, and l will have the argument C++. The blank between a short non-Boolean command-line option and its argument is not required.

Options specified in the input file override the options specified on the command-line. That is, if in the [options] section of the input file you specified output=foo, but on the command-line --output=bar, the output will still go to foo, no warnings will be issued.


Node:Options, Next:, Previous:Specifying Options, Up:Invoking

Options

This section list all program's options. All Boolean options are given in their non-default form. E.g., zero-term is on by default, so it's listed here as ---no-zero-term or -z-.


--output=FILE
-o FILE
The name of the output source file. By default, all the output goes to stdout. If the output option is specified, the generated lookup code goes to the specified file.

In order to separate the header and the implementation, use header and source options instead.

--header=HEADER
--source=SOURCE
-h HEADER
-c SOURCE
If you want to separate the lookup interface and its implementation, you must specify the name of the header file with -h option, and the name of the implementation file with -c option.

If you specify, say, only -h, the implementation source will go to the file specified with -o if it's given, or to the standard output if it's not.

--language=LANG
-l LANG
Specifies the lookup source language. Valid values are "ANSI C" (or just "C"), "K&R C" (or "KR"), and "C++". The default value is "ANSI C".
--namespace-name=NAME
-n NAME
The name of the namespace the generated lookup class will be placed in. If the value is empty (default), the global namespace is used. The value of the option is ignored for languages other than "C++".
--lookup-func-name=NAME
-f NAME
Name of the lookup function. The default value is "lookup". The lookup function has the following signature:
struct entry* lookup(const char* key, size_t len);

--entry-struct-name=NAME
-e NAME
The name of the key table entry struct. The default value is "entry".
--key-name=NAME
-k NAME
The key field name. The default value is "name".
--lookup-struct-name=NAME
-s NAME
The name of the lookup structure (class). The default value is "lookup_struct".

Even if the selected language is C, all the lookup tables are bundled into a struct with this name.

--lookup-data-name=NAME
-d NAME
The name of the lookup data (instance). The default value is "lookup_data".
--key-table-name=NAME
The name of the key table. The default value is "key". The key table is an array of keys and attributes.
--weight-table-name=NAME
The name of the weight table. The default value is "weight".
--hash-table-name=NAME
The name of the hash table. The default value is "hash".
--length-table-name=NAME
The name of the key length table. The default value is "length". The length table is generated only for binary keys (unless all the keys are of the same length).
--collision-table-name=NAME
The name of the collision table. The default value is "collision". The collision table is generated only if the hash function is not perfect (there are two or more keys with the same hash value.)
--test=FILE
-T FILE
Generate a test source file with the given name. An executable compiled from the test source will both test the generated lookup function, and mesure the average lookup time.

If the header option was specified, the generated test source will include the lookup header (and you will have to link the test program with the lookup implementation). Otherwise, it will contain a copy of the lookup source, and the test executable may be compiled from the test source alone.

--inline-lookup-func
-I
Make the lookup function inline.

The inline prefix will be #ifdefed, it will expand to inline for C++, __inline__ for GNU, or, if INLINE macro is defined (define it in the [declarations] section), to INLINE.

--static-lookup-data
-t
Hide the lookup tables inside the lookup function. If the selected language is C++, the option is ignored. For C++ the lookup tables are protected members of the lookup class.
--no-include-string-h
-i-
By default the generated code will include <string.h> (or <cstring> for C++), which is needed to declare strcmp() (or memcmp()), and NULL (for C.)

If the option is disabled, it becomes you responsibility.

--size=SIZE
-N SIZE
SIZE must be between 0 and 9, the default value is 0. Affects the size of the generated hash table, the larger is the value, the larger will be the hash table (this is true only for large keysets).

Larger hash tables will decrease time necessary to decide that a key does not belong to the set (there will be more empty slots in the hash table; when we hit an empty slot there is no need to compare keys). At the same time, very large hash tables may increase time of successful searches a negligible amount because of worse locality (more CPU cache misses.)

--min-expected-char=MINCHARCODE
--max-expected-char=MAXCHARCODE
-m MINCHARCODE
-M MAXCHARCODE
M??CHARCODE is decimal, hexadecimal (0xnn), or octal (0nnn) character code. The default values are 0x01 and 0xff respectively.

More likely than not the keys you are going to look up are not random byte arrays but rather the result of some kind of parsing. E.g., a C keyword is going to consist of nothing but lower case letters.

Narrowing the range of expected characters will result in smaller lookup tables. But be warned that if the character range is violated at runtime, the results may be disastrous.

If min-expected-char is 0, the zero-term option is turned off (a zero terminated string cannot contain zeroes).

--no-zero-term
-z-
By default hashgen assumes that the input keys are standard zero terminated C strings and the generated code will use strcmp() for key comparison.

If the option is disabled, memcmp() will be used instead, the keys at runtime don't have to be zero terminated and may contain NUL characters.

The option is automattically disabled when a NUL character is encountered in one of the input keys, or the min-expected-char option is set to 0.

--no-length-switch
-S-
The program tries to make the output code that calculates the hash values of the keys of varying length as efficient as possible. Some compilers (including GCC) can compile switch statements that branch on consequtive values into very efficient jump tables.

If you know that the target compiler cannot do that, disable the option, in this case if/goto pairs will be generated instead.

--optimize-hits
-H
By default, hashgen will try to minimize the time of both successful searches and failures. When the option is enabled, successful searches may become a couple of cycles faster at the cost of failures, which will become a bit slower.

This option affects only the generated lookup code, not the algorithm (the generated hash table will be the same.)

You may want to enable this option if you know that the lookup function is going to be called much more often for valid keys (in this case it also makes sense to try to generate a smaller hash table).

--random
-r
Use pseudo-random initial values. For large keysets the hash table will usually grow in size, but hashgen execution time will decrease.
--randimize
-R
By default, if the program is invoked twice with the same keyset and the same options, the output will be the same. If this option is enabled, the output will become non-deterministic. Implies --random option.
--debug
--D
Enables verbose output useful only for debugging.
--version
Show the program's version.
--help
Prints a very short usage summary.


Node:Input, Next:, Previous:Options, Up:Top

The Input File Format

The input file format is similar to that of MS Windows (R) .ini files. An input file is divided into sections. Every section starts with a section header line (e.g. [options]) and terminates at the beginning of the next section or end of file.

There are three kinds of sections: [options], [declarations], and [entries]. The order of sections does not matter.

Empty lines, and lines starting with "#" are ignored (except for the [definitions] section, which contains C/C++ declarations, and is inserted into the generated source verbatim.)

All sections except [entries] are optional. The [entries] section header may be omitted if this section is first in the file. Thus, a minimal valid input file contains just a list of keywords, each starting on a new line.


Node:[options], Next:, Previous:Input, Up:Input

The [options] Section

The [options] section contains all the program's options. This section is optional, and if it's omitted, all the options will get their default values, or values specified on the command line.

Each option starts on a new line, and has name=value format. Empty lines and lines starting with the hash sign ("#") are ignored.

For detailed description of the program's options see the Invoking Chapter (Invoking.)


Node:[declarations], Next:, Previous:[options], Up:Input

The [declarations] Section

The [declaration] section contains arbitrary C/C++ declarations that are copied verbatim into the generated code.

Within the [declarations] section, lines starting with "#" and empty lines are not ignored. If you need to comment something, use C/C++ comments instead. No line may start with the "[" character.

One thing that must be present in the [definitions] section is declaration of the structure of the key table entry. It must be a struct with the same name as specified by the entry-struct-name option (the default value is entry). The first field of the struct must be name (or the name specified with the key-name option), and be of type char* (const char*) or char[].

Instead of defining the entry structure in the input file, you can do it in some other file and just #include it in the [declarations] section.

If the [declarations] section is empty, the following entry declaration will be generated:

struct entry {
  char* name;
};

(the names entry and name may be different, as specified by entry-struct-name and key-name options.)

If the [declarations] section is not empty, the entry structure will not be defined automatically.

Example:

[options]
entry-struct-name = TokenEntry

[declarations]
struct TokenEntry {
  const char* name;
  int token;
  int flags;
};

[entries]
"for",   1, 0
"while", 2, 0
"do",    3, 0

Also see the Example Chapter (Example).


Node:[entries], Next:, Previous:[declarations], Up:Input

The [entries] Section

Each line of the [entries] section contains a key and attributes associated with that key separated with commas. The first field is the key itself, the attributes are optional. Empty lines and lines starting with "#" are ignored.

The syntax of each line must be suitable for C structure initializer (without the enclosing braces). It will be used to initialize an entry in the key table, the structure of the entry is defined in the [declarations] section (see [declarations]).

The [entries] section is assumed by default when the parsing begins, so a minimal valid input file may contain just a list of keywords, each starting on a new line.

Example:

[declarations]
struct entry {
  char* name;
  int id;
};

[entries]
"foo", 1
"bar", 2
"baz", 3

Within the key field C backslash escapes are allowed. If the key does not contain double quotes, commas, spaces, backslash escapes, and does not start with "[", the surrounding double quotes may be omitted:

[entries]
foo, 1
bar, 2
baz, 3

The attributes (after the first comma and up to the following newline character) are not parsed at all, just included in the generated initializer as is. It's your responsibility to ensure their syntax is correct.

For instance, for entries like these

key1, 1 /* one */, 2, something
key2
key3, 3
...

the following initializers will be generated

{ "key1", 1 /* one */, 2, something },
{ "key2" },
{ "key3", 3 },
...


Node:Example, Next:, Previous:[entries], Up:Top

Example

The following sections show an example of the input file and lookup sources generated from it.


Node:Example Input, Next:, Previous:Example, Up:Example

Example Input



# Hashgen input file example:
# Lookup class for C++ reserved words.

# The [entries] section: lists keys and associated data.
# The [entries] section header is implied by default.
# Double quotes around the key values are omitted.

asm,        TOKEN_ASM
auto,       TOKEN_AUTO
break,      TOKEN_BREAK
case,       TOKEN_CASE
catch,      TOKEN_CATCH
char,       TOKEN_CHAR
class,      TOKEN_CLASS
const,      TOKEN_CONST
continue,   TOKEN_CONTINUE
default,    TOKEN_DEFAULT
delete,     TOKEN_DELETE
do,         TOKEN_DO
double,     TOKEN_DOUBLE
else,       TOKEN_ELSE
enum,       TOKEN_ENUM
extern,     TOKEN_EXTERN
float,      TOKEN_FLOAT
for,        TOKEN_FOR
friend,     TOKEN_FRIEND
goto,       TOKEN_GOTO
if,         TOKEN_IF
inline,     TOKEN_INLINE
int,        TOKEN_INT
long,       TOKEN_LONG
new,        TOKEN_NEW
operator,   TOKEN_OPERATOR
overload,   TOKEN_OVERLOAD
private,    TOKEN_PRIVATE
protected,  TOKEN_PROTECTED
public,     TOKEN_PUBLIC
register,   TOKEN_REGISTER
return,     TOKEN_RETURN
short,      TOKEN_SHORT
signed,     TOKEN_SIGNED
sizeof,     TOKEN_SIZEOF
static,     TOKEN_STATIC
struct,     TOKEN_STRUCT
switch,     TOKEN_SWITCH
template,   TOKEN_TEMPLATE
this,       TOKEN_THIS
typedef,    TOKEN_TYPEDEF
union,      TOKEN_UNION
unsigned,   TOKEN_UNSIGNED
virtual,    TOKEN_VIRTUAL
void,       TOKEN_VOID
volatile,   TOKEN_VOLATILE
while,      TOKEN_WHILE


[options]

# All the options set explicitly for demonstration purposes
# (for most the default values would be okay.)

language = C++
header = cxx_keywords.h
source = cxx_keywords.cxx
test = cxx_keywords_test.cxx
namespace-name = Scanner
entry-struct-name = Token
key-name = text
lookup-struct-name = TokenHash
lookup-data-name = tokenHash
lookup-func-name = lookup
key-table-name = tokens
weight-table-name = charWeights
length-table-name = keyLengths
hash-table-name = hashTable
collision-table-name = hashCollisions

optimize-hits = yes
length-switch = yes
#inline-lookup-func = yes
static-lookup-data = no
min-expected-char = 0x30
max-expected-char = 0x7a


# The [declarations] sections: defines the structure
# of a key table entry (`Token' in this example).
# Also may contain any additional declarations,
# everything is copied verbatim into the generated header.

[declarations]

struct Token {
  const char* text;
  int token;
};

enum {
  TOKEN_ASM,
  TOKEN_AUTO,
  TOKEN_BREAK,
  TOKEN_CASE,
  TOKEN_CATCH,
  TOKEN_CHAR,
  TOKEN_CLASS,
  TOKEN_CONST,
  TOKEN_CONTINUE,
  TOKEN_DEFAULT,
  TOKEN_DELETE,
  TOKEN_DO,
  TOKEN_DOUBLE,
  TOKEN_ELSE,
  TOKEN_ENUM,
  TOKEN_EXTERN,
  TOKEN_FLOAT,
  TOKEN_FOR,
  TOKEN_FRIEND,
  TOKEN_GOTO,
  TOKEN_IF,
  TOKEN_INLINE,
  TOKEN_INT,
  TOKEN_LONG,
  TOKEN_NEW,
  TOKEN_OPERATOR,
  TOKEN_OVERLOAD,
  TOKEN_PRIVATE,
  TOKEN_PROTECTED,
  TOKEN_PUBLIC,
  TOKEN_REGISTER,
  TOKEN_RETURN,
  TOKEN_SHORT,
  TOKEN_SIGNED,
  TOKEN_SIZEOF,
  TOKEN_STATIC,
  TOKEN_STRUCT,
  TOKEN_SWITCH,
  TOKEN_TEMPLATE,
  TOKEN_THIS,
  TOKEN_TYPEDEF,
  TOKEN_UNION,
  TOKEN_UNSIGNED,
  TOKEN_VIRTUAL,
  TOKEN_VOID,
  TOKEN_VOLATILE,
  TOKEN_WHILE
};


Node:Generated Header, Next:, Previous:Example Input, Up:Example

Generated Header



// Generated by hashgen 1.0b (Dec  8 2002)
// language: C++
// 47 keys, hash table size 68 (145%), 0 collision(s)
// expected char range: 0x30-0x7a
// Mon Dec  9 14:50:21 2002

#ifndef TokenHash_INCLUDED
#define TokenHash_INCLUDED

#include <cstring>

namespace Scanner {

  struct Token {
    const char* text;
    int token;
  };

  enum {
    TOKEN_ASM,
    TOKEN_AUTO,
    TOKEN_BREAK,
    TOKEN_CASE,
    TOKEN_CATCH,
    TOKEN_CHAR,
    TOKEN_CLASS,
    TOKEN_CONST,
    TOKEN_CONTINUE,
    TOKEN_DEFAULT,
    TOKEN_DELETE,
    TOKEN_DO,
    TOKEN_DOUBLE,
    TOKEN_ELSE,
    TOKEN_ENUM,
    TOKEN_EXTERN,
    TOKEN_FLOAT,
    TOKEN_FOR,
    TOKEN_FRIEND,
    TOKEN_GOTO,
    TOKEN_IF,
    TOKEN_INLINE,
    TOKEN_INT,
    TOKEN_LONG,
    TOKEN_NEW,
    TOKEN_OPERATOR,
    TOKEN_OVERLOAD,
    TOKEN_PRIVATE,
    TOKEN_PROTECTED,
    TOKEN_PUBLIC,
    TOKEN_REGISTER,
    TOKEN_RETURN,
    TOKEN_SHORT,
    TOKEN_SIGNED,
    TOKEN_SIZEOF,
    TOKEN_STATIC,
    TOKEN_STRUCT,
    TOKEN_SWITCH,
    TOKEN_TEMPLATE,
    TOKEN_THIS,
    TOKEN_TYPEDEF,
    TOKEN_UNION,
    TOKEN_UNSIGNED,
    TOKEN_VIRTUAL,
    TOKEN_VOID,
    TOKEN_VOLATILE,
    TOKEN_WHILE
  };

  class TokenHash {
  public:
    static size_t size() {
      return 47;
    }
    static const Token* at(size_t index) {
      return tokens + index;
    }
    static const Token* lookup(const char* key) {
      return lookup(key, std::strlen(key));
    }
    static Token* lookup(const char* key, size_t len);
  protected:
    static unsigned char hashTable[68];
    static unsigned char charWeights[75];
    static Token tokens[47];
  };

  extern TokenHash tokenHash;

}

#endif


Node:Generated Source, Next:, Previous:Generated Header, Up:Example

Generated Source



#include "cxx_keywords.h"

namespace Scanner {

  TokenHash tokenHash;

  unsigned char TokenHash::hashTable[68] = {
    // Hash table. Each element either points to a key table entry,
    // or is null (47) if there is no key with this hash value.
    /*  0 */  47, 47, 47,  0, 47, 47, 35, 40,  1, 47,
    /* 10 */  36,  3,  7, 13, 34,  8, 31, 38, 19,  4,
    /* 20 */  15,  6, 30, 25, 28, 11, 17, 23, 29, 10,
    /* 30 */   9, 39, 37, 20, 24, 18, 14, 16, 32,  2,
    /* 40 */  41, 45, 22, 47,  5, 33, 47, 26, 43, 47,
    /* 50 */  47, 42, 47, 21, 47, 27, 47, 12, 47, 47,
    /* 60 */  44, 47, 47, 47, 47, 47, 47, 46
  };

  unsigned char TokenHash::charWeights[75] = {
    // Weight table (one character hash function).
    /*  40 */                                          255, 255,
    /*  50 */  255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    /*  60 */  255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    /*  70 */  255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    /*  80 */  255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    /*  90 */  255, 255, 255, 255, 255, 255, 255,   0,  28,   7,
    /* 100 */   23,   0,  23,  14,  27,   8, 255, 255,   9,   1,
    /* 110 */   31,   0,   9, 255,   6,   0,   0,   4,  33,  26,
    /* 120 */   14,   0, 255
  };

  Token TokenHash::tokens[47] = {
    // Key table. Holds keys and associated data (order preserved.)
    /*  0 */  { "asm",         TOKEN_ASM },
    /*  1 */  { "auto",        TOKEN_AUTO },
    /*  2 */  { "break",       TOKEN_BREAK },
    /*  3 */  { "case",        TOKEN_CASE },
    /*  4 */  { "catch",       TOKEN_CATCH },
    /*  5 */  { "char",        TOKEN_CHAR },
    /*  6 */  { "class",       TOKEN_CLASS },
    /*  7 */  { "const",       TOKEN_CONST },
    /*  8 */  { "continue",    TOKEN_CONTINUE },
    /*  9 */  { "default",     TOKEN_DEFAULT },
    /* 10 */  { "delete",      TOKEN_DELETE },
    /* 11 */  { "do",          TOKEN_DO },
    /* 12 */  { "double",      TOKEN_DOUBLE },
    /* 13 */  { "else",        TOKEN_ELSE },
    /* 14 */  { "enum",        TOKEN_ENUM },
    /* 15 */  { "extern",      TOKEN_EXTERN },
    /* 16 */  { "float",       TOKEN_FLOAT },
    /* 17 */  { "for",         TOKEN_FOR },
    /* 18 */  { "friend",      TOKEN_FRIEND },
    /* 19 */  { "goto",        TOKEN_GOTO },
    /* 20 */  { "if",          TOKEN_IF },
    /* 21 */  { "inline",      TOKEN_INLINE },
    /* 22 */  { "int",         TOKEN_INT },
    /* 23 */  { "long",        TOKEN_LONG },
    /* 24 */  { "new",         TOKEN_NEW },
    /* 25 */  { "operator",    TOKEN_OPERATOR },
    /* 26 */  { "overload",    TOKEN_OVERLOAD },
    /* 27 */  { "private",     TOKEN_PRIVATE },
    /* 28 */  { "protected",   TOKEN_PROTECTED },
    /* 29 */  { "public",      TOKEN_PUBLIC },
    /* 30 */  { "register",    TOKEN_REGISTER },
    /* 31 */  { "return",      TOKEN_RETURN },
    /* 32 */  { "short",       TOKEN_SHORT },
    /* 33 */  { "signed",      TOKEN_SIGNED },
    /* 34 */  { "sizeof",      TOKEN_SIZEOF },
    /* 35 */  { "static",      TOKEN_STATIC },
    /* 36 */  { "struct",      TOKEN_STRUCT },
    /* 37 */  { "switch",      TOKEN_SWITCH },
    /* 38 */  { "template",    TOKEN_TEMPLATE },
    /* 39 */  { "this",        TOKEN_THIS },
    /* 40 */  { "typedef",     TOKEN_TYPEDEF },
    /* 41 */  { "union",       TOKEN_UNION },
    /* 42 */  { "unsigned",    TOKEN_UNSIGNED },
    /* 43 */  { "virtual",     TOKEN_VIRTUAL },
    /* 44 */  { "void",        TOKEN_VOID },
    /* 45 */  { "volatile",    TOKEN_VOLATILE },
    /* 46 */  { "while",       TOKEN_WHILE }
  };

  // Key characters must be in 0x30-0x7a range.

  Token* TokenHash::lookup(const char* key, size_t len) {
    using namespace std;
    register struct Token* p;
    register unsigned h = len;
    switch (h) {
      default: return 0;
      case 9:
      case 8:
      case 7:
      case 6:
      case 5:
      case 4:
        h += tokenHash.charWeights[(unsigned char)key[3]-48];
      case 3:
      case 2:
        h += tokenHash.charWeights[(unsigned char)key[1]-48];
      case 1:
        h += tokenHash.charWeights[(unsigned char)key[0]-48];
      case 0: ;
    }
    if (h > 67)
      return 0;
    h = tokenHash.hashTable[h];
    if (h == 47)
      return 0;
    p = &tokenHash.tokens[h];
    if (*key == *p->text && !strcmp(key, p->text))
      return p;
    return 0;
  }

  // Please notice that in expression `a[x-c]', where `a' is a static
  // array and `c' is a constant, `a-c' is a constant sub-expression,
  // so the offset should not incur additional runtime overhead.

}


Node:Generated Test Source, Next:, Previous:Generated Source, Up:Example

Generated Test Source



// Test for hashed lookup code generated by hashgen 1.0b.
// Measures the average time of both successful and unsuccessful lookups.

#include "cxx_keywords.h"
#include <ctime>
#include <iostream>
#include <iomanip>

using namespace std;
using namespace Scanner;

#ifndef CLK_TCK
#define CLK_TCK CLOCKS_PER_SEC
#endif

int main() {
  clock_t tick;
  double time;
  unsigned count;
  unsigned i;
  // Hit test. If/goto overhead is negligible, so it's okay to test for
  // both correctness and speed at the same time.  There is no sense in
  // checking the key pointed to  by the lookup function's return value
  // after a successful search, since  the only place where  a non-null
  // pointer is returned is right after successful strcmp() or memcmp().
  count = 0;
  tick = clock();
  do {
    for (i = 0; i < 21276; i++) {
      if (TokenHash::lookup("asm", 3) == 0) goto err_miss;
      if (TokenHash::lookup("auto", 4) == 0) goto err_miss;
      if (TokenHash::lookup("break", 5) == 0) goto err_miss;
      if (TokenHash::lookup("case", 4) == 0) goto err_miss;
      if (TokenHash::lookup("catch", 5) == 0) goto err_miss;
      if (TokenHash::lookup("char", 4) == 0) goto err_miss;
      if (TokenHash::lookup("class", 5) == 0) goto err_miss;
      if (TokenHash::lookup("const", 5) == 0) goto err_miss;
      if (TokenHash::lookup("continue", 8) == 0) goto err_miss;
      if (TokenHash::lookup("default", 7) == 0) goto err_miss;
      if (TokenHash::lookup("delete", 6) == 0) goto err_miss;
      if (TokenHash::lookup("do", 2) == 0) goto err_miss;
      if (TokenHash::lookup("double", 6) == 0) goto err_miss;
      if (TokenHash::lookup("else", 4) == 0) goto err_miss;
      if (TokenHash::lookup("enum", 4) == 0) goto err_miss;
      if (TokenHash::lookup("extern", 6) == 0) goto err_miss;
      if (TokenHash::lookup("float", 5) == 0) goto err_miss;
      if (TokenHash::lookup("for", 3) == 0) goto err_miss;
      if (TokenHash::lookup("friend", 6) == 0) goto err_miss;
      if (TokenHash::lookup("goto", 4) == 0) goto err_miss;
      if (TokenHash::lookup("if", 2) == 0) goto err_miss;
      if (TokenHash::lookup("inline", 6) == 0) goto err_miss;
      if (TokenHash::lookup("int", 3) == 0) goto err_miss;
      if (TokenHash::lookup("long", 4) == 0) goto err_miss;
      if (TokenHash::lookup("new", 3) == 0) goto err_miss;
      if (TokenHash::lookup("operator", 8) == 0) goto err_miss;
      if (TokenHash::lookup("overload", 8) == 0) goto err_miss;
      if (TokenHash::lookup("private", 7) == 0) goto err_miss;
      if (TokenHash::lookup("protected", 9) == 0) goto err_miss;
      if (TokenHash::lookup("public", 6) == 0) goto err_miss;
      if (TokenHash::lookup("register", 8) == 0) goto err_miss;
      if (TokenHash::lookup("return", 6) == 0) goto err_miss;
      if (TokenHash::lookup("short", 5) == 0) goto err_miss;
      if (TokenHash::lookup("signed", 6) == 0) goto err_miss;
      if (TokenHash::lookup("sizeof", 6) == 0) goto err_miss;
      if (TokenHash::lookup("static", 6) == 0) goto err_miss;
      if (TokenHash::lookup("struct", 6) == 0) goto err_miss;
      if (TokenHash::lookup("switch", 6) == 0) goto err_miss;
      if (TokenHash::lookup("template", 8) == 0) goto err_miss;
      if (TokenHash::lookup("this", 4) == 0) goto err_miss;
      if (TokenHash::lookup("typedef", 7) == 0) goto err_miss;
      if (TokenHash::lookup("union", 5) == 0) goto err_miss;
      if (TokenHash::lookup("unsigned", 8) == 0) goto err_miss;
      if (TokenHash::lookup("virtual", 7) == 0) goto err_miss;
      if (TokenHash::lookup("void", 4) == 0) goto err_miss;
      if (TokenHash::lookup("volatile", 8) == 0) goto err_miss;
      if (TokenHash::lookup("while", 5) == 0) goto err_miss;
    }
    count++;
    time = (clock() - tick) / (double)CLK_TCK;
  } while ((time < 0.5) && (count < 1000000));
  cout
  << "successful search time "
  << setw(6)
  << int(time * 1e9 / 21276 / 47 / count)
  << "ns (" << setw(0)
  << int(999972 * count / time)
  << " searches/sec)\n";
  // Miss test.  Characters of the test keys are random within range
  // from min-expected-char to max-expected-char, their length from 0
  // to a bit more than maximum length of the existing keys.
  // The results should not be taken too seriously, the keys are too
  // random, so strcmp() or memcmp() is not called most of the time
  // because the calculated hash values are out of range.
  count = 0;
  tick = clock();
  do {
    for (i = 0; i < 21276; i++) {
      if (TokenHash::lookup("<T", 2)) goto err_hit;
      if (TokenHash::lookup("hPHv]", 5)) goto err_hit;
      if (TokenHash::lookup("UlkW", 4)) goto err_hit;
      if (TokenHash::lookup("", 0)) goto err_hit;
      if (TokenHash::lookup("h16QrP@", 7)) goto err_hit;
      if (TokenHash::lookup("=Wk", 3)) goto err_hit;
      if (TokenHash::lookup("`JN4?", 5)) goto err_hit;
      if (TokenHash::lookup("0", 1)) goto err_hit;
      if (TokenHash::lookup("d1[GsilcF5", 10)) goto err_hit;
      if (TokenHash::lookup("3jx>", 4)) goto err_hit;
      if (TokenHash::lookup("77EDG5i", 7)) goto err_hit;
      if (TokenHash::lookup("O<1GP", 5)) goto err_hit;
      if (TokenHash::lookup("w9P@9H", 6)) goto err_hit;
      if (TokenHash::lookup("v0Edr", 5)) goto err_hit;
      if (TokenHash::lookup("", 0)) goto err_hit;
      if (TokenHash::lookup("", 0)) goto err_hit;
      if (TokenHash::lookup("??wFTtGZb", 9)) goto err_hit;
      if (TokenHash::lookup("bWycwd_jmp", 10)) goto err_hit;
      if (TokenHash::lookup(":", 1)) goto err_hit;
      if (TokenHash::lookup("@:mteniZg", 9)) goto err_hit;
      if (TokenHash::lookup("?fRm2e=", 7)) goto err_hit;
      if (TokenHash::lookup("6", 1)) goto err_hit;
      if (TokenHash::lookup("Ei", 2)) goto err_hit;
      if (TokenHash::lookup("b6mT_8G<H", 9)) goto err_hit;
      if (TokenHash::lookup("yB", 2)) goto err_hit;
      if (TokenHash::lookup("Vd", 2)) goto err_hit;
      if (TokenHash::lookup("BbGxnm0Xdz", 10)) goto err_hit;
      if (TokenHash::lookup("7x5n_pJl", 8)) goto err_hit;
      if (TokenHash::lookup("R8JjBHe7", 8)) goto err_hit;
      if (TokenHash::lookup("O?", 2)) goto err_hit;
      if (TokenHash::lookup("6?gt", 4)) goto err_hit;
      if (TokenHash::lookup("", 0)) goto err_hit;
      if (TokenHash::lookup(";O8R", 4)) goto err_hit;
      if (TokenHash::lookup("6X2e6L@ZW", 9)) goto err_hit;
      if (TokenHash::lookup("tGDve", 5)) goto err_hit;
      if (TokenHash::lookup("R9CutRJ", 7)) goto err_hit;
      if (TokenHash::lookup("=SaE[9TaJ@", 10)) goto err_hit;
      if (TokenHash::lookup("", 0)) goto err_hit;
      if (TokenHash::lookup("EEcmFFmCu", 9)) goto err_hit;
      if (TokenHash::lookup("", 0)) goto err_hit;
      if (TokenHash::lookup("5", 1)) goto err_hit;
      if (TokenHash::lookup("Z", 1)) goto err_hit;
      if (TokenHash::lookup("CfI", 3)) goto err_hit;
      if (TokenHash::lookup("\\lP", 3)) goto err_hit;
      if (TokenHash::lookup("MYKg\\Dl1Z6", 10)) goto err_hit;
      if (TokenHash::lookup("L6crI]N", 7)) goto err_hit;
      if (TokenHash::lookup("LQtKd", 5)) goto err_hit;
    }
    count++;
    time = (clock() - tick) / (double)CLK_TCK;
  } while ((time < 0.5) && (count < 1000000));
  cout
  << "search failure time    "
  << setw(6)
  << int(time * 1e9 / 21276 / 47 / count)
  << "ns (" << setw(0)
  << int(999972 * count / time)
  << " searches/sec)\n";
  return 0;
err_hit:
  cerr << "cxx_keywords.cxx: non-existing key found\n";
  exit(1);
err_miss:
  cerr << "cxx_keywords.cxx: key not found\n";
  exit(1);
}


Node:Machine Code, Next:, Previous:Generated Test Source, Up:Example

Generated Machine Code

Bellow is a frament of the assembly code generated by GCC 2.95.3 (i386, FreeBSD 4.4) for the following line of the test source:

  if (TokenHash::lookup("catch", 5) == 0) goto err_miss;

The lookup function is inlined and called with constant arguments. GCC did not generate the code to check length of the key at all since it's known at compile time. .LC2 contains "catch", it's long enough to include all 3 hash positions, for shorter strings the compliler generated access only to included positions (lines marked with "*").

	movzbl .LC2+3,%eax
	movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%edx    *
	movzbl .LC2+1,%eax
	movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax    *
	leal 5(%eax,%edx),%edx
	movzbl .LC2,%eax
	movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax    *
	addl %eax,%edx
	cmpl $67,%edx
	ja .L540
	movzbl _Q27Scanner9TokenHash$hashTable(%edx),%edx
	cmpl $47,%edx
	je .L540
	leal 0(,%edx,8),%eax
	leal _Q27Scanner9TokenHash$tokens(%eax),%ebx
	movl _Q27Scanner9TokenHash$tokens(%eax),%edx
	movb (%edx),%al
	cmpb %al,.LC2
	jne .L577
	addl $-8,%esp
	pushl %edx
	pushl $.LC2
	call strcmp
	movl %eax,%edx
	addl $16,%esp
	movl %ebx,%eax
	testl %edx,%edx
	je .L567
.L577:
	xorl %eax,%eax
.L567:
	testl %eax,%eax
	je .L540

If the lookup function is not inline and the invocation context is not known, a jump table is generated for the switch on the key's length (jmp *.L21(,%edx,4)):

lookup__Q27Scanner9TokenHashPCcUi:
	pushl %ebp
	movl %esp,%ebp
	subl $20,%esp
	pushl %ebx
	movl 8(%ebp),%ecx
	movl 12(%ebp),%edx
	cmpl $9,%edx
	ja .L24
	jmp *.L21(,%edx,4)
	.p2align 2,0x90
	.section	.rodata
	.p2align 2
	.p2align 2
.L21:
	.long .L9
	.long .L19
	.long .L18
	.long .L18
	.long .L16
	.long .L16
	.long .L16
	.long .L16
	.long .L16
	.long .L16
.text
	.p2align 2,0x90
.L16:
	movzbl 3(%ecx),%eax
	movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax
	addl %eax,%edx
.L18:
	movzbl 1(%ecx),%eax
	movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax
	addl %eax,%edx
.L19:
	movzbl (%ecx),%eax
	movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax
	addl %eax,%edx
.L9:
	cmpl $67,%edx
	ja .L24
	movzbl _Q27Scanner9TokenHash$hashTable(%edx),%edx
	cmpl $47,%edx
	je .L24
	leal 0(,%edx,8),%eax
	leal _Q27Scanner9TokenHash$tokens(%eax),%ebx
	movl _Q27Scanner9TokenHash$tokens(%eax),%edx
	movb (%edx),%al
	cmpb %al,(%ecx)
	jne .L24
	addl $-8,%esp
	pushl %edx
	pushl %ecx
	call strcmp
	testl %eax,%eax
	jne .L24
	movl %ebx,%eax
	jmp .L26
	.p2align 2,0x90
.L24:
	xorl %eax,%eax
.L26:
	movl -24(%ebp),%ebx
	leave
	ret


Node:Tips, Next:, Previous:Machine Code, Up:Top

Tips


Node:Bugs, Next:, Previous:Tips, Up:Top

Known Bugs and Limitations of hashgen

The keys may be of any reasonable length, but only the first 32 characters are used for hashing, they must be unique, otherwise the generated hash function will not be perfect.

hashgen consumes lots of RAM when the keyset is large (~20Mb for 3,000 keys, for larger keysets it switches to a slower and less memory-hungry mode (~4Mb for 10,000 keys)). This estimation is very inaccurate, the actual figures depend on the average length of a key and its assotiated data, character code range, etc.

There are many misspellings, misplaced definite articles, and other mistakes and inaccuracies in this manual. I would be grateful to anyone willing to help me to clean this mess up.


Node:Acknowledgements, Previous:Bugs, Up:Top

Acknowledgements

The hashgen utility was written by Vladimir Shiryaev.

The program does basically the same thing and uses the same basic algorithm (with a few improvements) as the gperf utility by Douglas C. Schmidt. The differencies in functionality are briefly outlined in the Overview Chapter (Overview).

As Doug says in the gperf manual, the general idea for the perfect hash function generator was inspired by Keith Bostic's algorithm, created at the University of California, Irvine.