WAAVScript – Part 3, What’s an interned name?

technology3D PrintingManufactoryArtificial Intelligence

Sep 4

There are two places where this association matters. At startup time, the keywords need to be associated with the appropriate function that implements the behavior. Then, when scanning user input, the names need to be isolated, and used as a key to lookup the behavior associated with the name. Perhaps the easiest way to do this is to simply use std::string as a key into a dictionary. Just create std::unorderd_map<std::string, behavior> structure, where ‘behavior’ is some sort of pointer to a function. In the first iteration of the WAAVScript interpreter, that’s exactly what I did. It’s super fast to code it up. You don’t worry about memory allocations, and magic just happens. But, it’s pretty darned slow in execution speed. It’s not the creation of the string objects themselves, its how they are used in the dictionary lookup that makes things slow.

In the second iteration, I used OctetCursor as the ‘key’ instead of std::string. That’s actually faster, because I don’t have to re-allocate, because the span points to permanent memory (while the program is loaded). You still have to implement a hash function, and the spans might be compared by content when multiple things end up in the same bucket. This method is faster than std::string, but still not quite as fast as it can be. So, where’s the hangup really?

Let’s disect the typically dictionary lookup.

std::unordered_map<key, value>

Typically implemented as some sort of hash table. That means there are ‘buckets’, which contain things that are similar on the surface (landing them in the same bucket). Then, once you find the right bucket, you need to sift through the entries in the bucket, looking for an exact match. So, doing something as simple as finding whether a key even exists in a map, the machinery will:

Take the key, and create a hash value for it
Use that hashed key to find a bucket where the key might live (typically hashValue % numBuckets)
Once on the bucket, if there’s only one value, compare the content to see if the key matches
- If the key matched the entry, return ‘true’, value found
- If not, and there are more entries, repeat 3) until found

It’s step 3 there, when you’re comparing the values that things slow down. It’s the content comparison that kills it. Wouldn’t it be great if you could just compare pointer values, and be done with it? Yes, but where can we get these reliable pointers for comparison?

In steps the ‘interned string’. What we want is a system where once a ‘name’ is identified, we can just use a universal pointer that represents the string for comparisons. We don’t want to do content comparisons if we can avoid it. The ability to represent a string in this way is typically known as “string interning”, and it’s a fairly common technique, used by most modern language interpreters.

For example, I’d like to do the following:

const char *a = INTERN("hello");

const char *b = INTERN("hello");

printf("A == B: %d", a == b);

// : 1

The imaginary “INTERN” function, takes a string value, and creates a reliable pointer that represents that sequence of bytes. If you then do the same INTERN thing, somewhere else in your program, you will get the exact same pointer value, so you can compare the two pointers for equivalence. This is great for dictionaries, because you can use a pointer as the key, rather than a semantic ‘string’.

The implementation details are where the magic happens of course. For this, I go with a straight up C implementation, relying on nothing from the C++ world. The goal is to have a nametable mechanism which can be used at a global application level to provide this interning service. It’s a fairly simple requirement. When I intern a string, I want to get the same ‘char *’ pointer value whenever I use the same sequence of bytes. The interface looks like this.

#ifdef __cplusplus

extern "C" {

#endif

typedef struct WSNameTable WSNameTable;

// Convenience function to iterate over all names in the global name table.

// Return 0 to continue; any non-zero value aborts the walk and is returned by foreach.

typedef int (*ws_name_iter_fn)(const char* name, size_t len, void* user);

// Instance API

static inline int ws_name_table_init(WSNameTable* t, size_t capacity);

static inline WSNameTable* ws_name_table_create(void);

static inline void ws_name_table_destroy(WSNameTable* t);

static inline void ws_name_table_clear(WSNameTable* t);

static inline int ws_name_table_reserve(WSNameTable* t, size_t expected_count);

static inline const char* ws_name_table_lookup_len(const WSNameTable* t, const char* ptr, size_t len);

static inline size_t ws_name_table_capacity(const WSNameTable* t);

static inline size_t ws_name_table_size(const WSNameTable* t);

static inline double ws_name_table_load(const WSNameTable* t);

// Intern

static inline const char* ws_name_table_intern_len(WSNameTable* t, const char* ptr, size_t len);

static inline const char* ws_name_table_intern(WSNameTable* t, const char* cstr);

static inline int ws_name_table_intern_many(WSNameTable* t, const char* const* names, size_t count);

// Global singleton convenience

static inline const char* ws_name_intern_len(const char* ptr, size_t len);

static inline const char* ws_name_intern(const char* cstr);

static inline int ws_name_table_foreach(const WSNameTable* t, ws_name_iter_fn fn, void* user);

static inline void ws_name_table_global_shutdown(void);

#ifdef __cplusplus

} // extern "C"

#endif

The real working end of this API is the ‘ws_name_intern’ function. Everything else can pretty much be ignored. With this interface, we can perform some simple tests, like the above pseudo code.

// Using the globally accessible singleton name table

const char* literal1 = ws_name_intern("literal1");

const char* literal2 = ws_name_intern("literal2");

const char* literal3 = ws_name_intern("literal3");

const char* literal11 = ws_name_intern("literal1");

const char* literal21 = ws_name_intern("literal2");

const char* literal31 = ws_name_intern("literal3");

printf(" literal1: %p\n", literal1);

printf("literal11: %p\n", literal11);

printf(" literal2: %p\n", literal2);

printf("literal21: %p\n", literal21);

printf(" literal3: %p\n", literal3);

printf("literal31: %p\n", literal31);

When you run that code, all the expected pointers will be the same.

If you want to bulk load a name table, you could do the following.

int init_ops_table(WSNameTable *nt, const char * const names[], size_t n)

{

if (ws_name_table_init(nt, 0) != 0) return -1;

return ws_name_table_intern_many(nt, names, n);

}

static int dump_name(const char* name, size_t len, void* user) {

(void)user;

printf("%.*s\n", (int)len, name);

return 0; // keep going

}

// A function that uses the ws_name_table_foreach to visit each name

// and print its value.

void print_all(const WSNameTable* t) {

(void)ws_name_table_foreach(t, dump_name, NULL);

}

// Create a bulk name table

WSNameTable g_ops;

static const char* const k_builtin_names[] = {

"moveto",

"lineto",

"curveto",

"closepath",

"stroke",

"fill",

"show",

"matrix",

"scale",

"translate"

};

init_ops_table(&g_ops, k_builtin_names, sizeof(k_builtin_names) / sizeof(k_builtin_names[0]));

print_all(&g_ops);

That shows how you can create a temporary name table on the stack, and fill it with names, then execute some function against each of the names in the table. Keep in mind, this is just the name table. This is not the dictionary or any other higher level construct. We’re just creating a reliable pointer from a sequence of bytes, and that’s it. The real value is shown when we use these reliable pointers as keys in a dictionary.

So, what’s going on under the hood here?

Let’s look at the data structures and some helper functions.

100

#ifndef WS_NAME_TABLE_INITIAL_CAPACITY

#define WS_NAME_TABLE_INITIAL_CAPACITY 1024u // power of two

#endif

#define WS_NAME_TABLE_GROW_THRESHOLD(cap) ((cap) - ((cap) >> 2)) // 0.75

// A single entry stores the pointer to a copy of the

// original value, as well as the length.

// It also stores the hash value generated for that sequence of bytes

typedef struct WSNT_Entry {

const char* str; // NULL == empty

size_t len;

uint64_t hash;

} WSNT_Entry;

// Structure for the name table itself.

struct WSNameTable {

WSNT_Entry* buckets;

size_t capacity;

size_t size;

};

// 64-bit FNV-1a hash function for strings.

// This particular function has a good spread of bits

// for typical string input

static inline uint64_t wsnt_hash(const void* data, size_t len)

{

const unsigned char* p = (const unsigned char*)data;

uint64_t h = 1469598103934665603ULL;

for (size_t i = 0; i < len; i++)

{

h ^= (uint64_t)p[i]; h *= 1099511628211ULL;

}

return h;

}

// wsnt_next_pow2

//

// figure out next power of two >= n

// we use this when calculating the next size up

// when growing the table.

static inline size_t wsnt_next_pow2(size_t n)

{

if (n < 2) return 2;

n--; n |= n >> 1; n |= n >> 2; n |= n >> 4; n |= n >> 8; n |= n >> 16;

#if SIZE_MAX > 0xFFFFFFFFu

n |= n >> 32;

#endif

n++; return n;

}

// wsnt_strndup

//

// Allocate a new string, copy n bytes from s, and null-terminate it.

static inline char* wsnt_strndup(const char* s, size_t n)

{

char* p = (char*)malloc(n + 1);

if (!p) return NULL;

if (n) memcpy(p, s, n);

p[n] = '\0';

return p;

}

// wsnt_rehash

//

// Rehash the table to a new capacity, copying existing entries.

static inline int wsnt_rehash(WSNameTable* t, size_t new_cap)

{

size_t cap = wsnt_next_pow2(new_cap);

WSNT_Entry* nb = (WSNT_Entry*)calloc(cap, sizeof(WSNT_Entry));

if (!nb)

return 0;

for (size_t i = 0; i < t->capacity; i++) {

const char* s = t->buckets[i].str;

if (!s) continue;

uint64_t h = t->buckets[i].hash;

size_t idx = (size_t)h & (cap - 1);

while (nb[idx].str) idx = (idx + 1) & (cap - 1);

nb[idx] = t->buckets[i];

}

free(t->buckets);

t->buckets = nb;

t->capacity = cap;

return 1;

}

// wsnt_maybe_grow

//

// grow the table if needed, doubling capacity if size exceeds threshold.

static inline int wsnt_maybe_grow(WSNameTable* t)

{

if (t->size >= WS_NAME_TABLE_GROW_THRESHOLD(t->capacity)) {

size_t nc = t->capacity << 1;

if (!wsnt_rehash(t, nc))

return 0;

}

return 1;

}

A name table is a very simple collection (a set) of WSNT_Entry structures. It’s held in memory as a contiguous array of such structures. Each entry is 24 bytes long, which isn’t ideal for cache friendliness, but it’s fairly minimal. We just store a pointer, length, and a hash value, which we’ll look at. This may change in the future, either to get down to 16 bytes, or grow to 32 bytes. Either way would be ok.

The WSNameTable structure itself is just a pointer to the contiguous memory that contains a bunch of these entries. It also contains the capacity (how many entries can be in the table), and the size (actual number of entries). There is a hash function, next_pow2, and strndup. We’ll look at the other two functions in a moment.

Given all this, how do you insert a name into the table?

Let’s first look at the ws_name _table_init function.

static inline int ws_name_table_init(WSNameTable* t, size_t capacity)

{

if (!t)

return -1; // ERROR

// Initialize to safe values in case of failure.

t->buckets = NULL;

t->capacity = 0;

t->size = 0;

// Ensure the capacity is a power of two, at least 2.

// if '0' is passed, use the default initial capacity.

size_t cap;

if (capacity == 0) {

cap = WS_NAME_TABLE_INITIAL_CAPACITY;

}

else {

cap = wsnt_next_pow2(capacity);

if (cap < 2) cap = 2;

}

t->buckets = (WSNT_Entry*)calloc(cap, sizeof(WSNT_Entry));

if (!t->buckets)

return -1; // WSResultCode::WS_ERROR;

t->capacity = cap;

t->size = 0;

return 0; // WSResultCode::WS_SUCCESS;

}

To initialize a table, we do an initial allocation of space. The capacity is set, and the size is zero.

Now we come to the heart of the table, inserting a value

// ws_name_table_intern_len

//

// Intern a string in the name table, returning a stable pointer to it.

// If the string already exists, returns the existing pointer.

// If the string does not exist, it is added to the table.

//

static inline const char* ws_name_table_intern_len(WSNameTable* t, const char* ptr, size_t len)

{

if (!t) return NULL;

if (!ptr) { ptr = ""; len = 0; }

// get a hash value for the string as a starting point

uint64_t h = wsnt_hash(ptr, len);

// use that hash as a rough starting point to figure out where to look

size_t idx = (size_t)h & (t->capacity - 1);

// probe for existing

for (;;) {

WSNT_Entry* e = &t->buckets[idx];

if (!e->str) break;

if ((e->hash == h) && (e->len == len) && (memcmp(e->str, ptr, len) == 0))

return e->str;

idx = (idx + 1) & (t->capacity - 1);

}

// insert

if (!wsnt_maybe_grow(t))

return NULL;

idx = (size_t)h & (t->capacity - 1);

while (t->buckets[idx].str)

idx = (idx + 1) & (t->capacity - 1);

// duplicate the string for permanent storage

char* dup = wsnt_strndup(ptr, len);

if (!dup)

return NULL;

t->buckets[idx].str = dup;

t->buckets[idx].len = len;

t->buckets[idx].hash = h;

t->size++;

return dup;

}

This is a so called “open addressing, linear probe” algorithm. The general idea is, get a hash value for the input key. The particular hash function here needs to be very fast, and have a fairly low probability of collisions. The hash function used here is a variant of FNV-1a, which has these exact properties that we want. We then use that hashed value to do a quick lookup to see if the value is already stored. But, we don’t have 64-bits worth of buckets, we actually limit ourselves to a more manageable size which was applied when the table was constructed. So, to find the right ‘bucket’, we do a little transformation

size_t idx = (size_t)h & (t->capacity - 1);

// probe for existing

for (;;) {

WSNT_Entry* e = &t->buckets[idx];

if (!e->str) break;

if ((e->hash == h) && (e->len == len) && (memcmp(e->str, ptr, len) == 0))

return e->str;

idx = (idx + 1) & (t->capacity - 1);

}

‘idx’ will be in the range 0..capacity-1. So, we go to that entry. If we find something there, we compare the hash value, and if that matches, we then compare the actual contents just to be doubly sure we’re talking about the same thing. If all that is true, we then return the stable pointer that represents that sequence of bytes.

If we did not find anything there, then we need to make an entry.

// insert

if (!wsnt_maybe_grow(t))

return NULL;

idx = (size_t)h & (t->capacity - 1);

while (t->buckets[idx].str)

idx = (idx + 1) & (t->capacity - 1);

// duplicate the string for permanent storage

char* dup = wsnt_strndup(ptr, len);

if (!dup)

return NULL;

t->buckets[idx].str = dup;

t->buckets[idx].len = len;

t->buckets[idx].hash = h;

t->size++;

return dup;

First, we check to see if the table should grow. We use a 70% capacity rule for this. If the table is 70% full when an insert wants to happen, the table will double in size, strings will be spread around to new buckets, and we can then add the new entry.

This time around, we search for a bucket. If we find a bucket, but there’s already a value in that spot, we just roll forward, looking for the next available spot. This is the ‘linear probe’ portion of the algorithm.

Once we find a spot, we alter the entry there, by first creating a duplicate string (this becomes the permanent pointer), setting the length, and the hash value.

Finally, we return the new stable pointer, and we’re done.

The lookup is pretty straight forward, as it just performs half the work of the insert.

// ws_name_table_lookup_len

//

// Lookup a name in the table by its string content (ptr+len).

// Returns NULL if not found.

static inline const char* ws_name_table_lookup_len(const WSNameTable* t, const char* ptr, size_t len)

{

if (!t)

return NULL;

if (!ptr)

{

ptr = "";

len = 0;

}

uint64_t h = wsnt_hash(ptr, len);

size_t idx = (size_t)h & (t->capacity - 1);

for (;;) {

WSNT_Entry* e = &t->buckets[idx];

if (!e->str) return NULL;

if (e->hash == h && e->len == len && memcmp(e->str, ptr, len) == 0)

return e->str;

idx = (idx + 1) & (t->capacity - 1);

}

Basically, perform the same hash, go to the expected bucket. If it’s not found there, or not found in subsequent linear probing, return NULL. Otherwise, return the stable pointer value.

You might be thinking, “but you’re doing a memcmp every time you insert or search”. That’s true. So let’s look back on the purpose of this thing.

The purpose of the name table is to turn a sequence of bytes into a stable pointer which can be used for pointer value comparisons later. The insert and search operations will typically only occur at program setup time. The vast majority of usage is going to be in simple pointer comparisons in the context of a dictionary structure.
So yes, a bit slower on insert, but worth the initial cost because we can now do simple pointer comparisons everywhere else with the knowledge that if two pointers are equivalent, then they represent the exact same sequence of bytes. This can be extended to cover any data structures, as it’s just pointer and length. No null terminated strings implied.
And that’s it for string interning. Next time around, we can look at some more core data structures, and how these interned strings are used in the context of a dictionary, which is another key data structure.

William Adams

WAAVScript – Part 3, What’s an interned name?

WAAVScript – Part2, Scanning for Tokens