2 - Representing Complex Data Structures in Tcl

This chapter describes common approaches taken to represent complex data structures in Tcl, their costs and tradeoffs, and begins to describe some of Speed Tables' capabilities.

Tcl is not well-known for its ability to represent complex data structures. Yes, it has lists and associative arrays and, in Tcl 8.5, dicts. Yes, object-oriented extensions such as Incr Tcl provide ways to plug objects together to represent fairly complex data structures and yes, the BLT toolkit, among others, has provided certain more efficient ways to represent data (a vector data type, for instance) than available by default and, yes, you can abuse upvar and namespaces as part of expressing the structure of, and methods of access for, your data.

There are, however, three typical problems with this approach:

Ad-hoc construction of complex data structures sucks

Using a combination of upvar and namespaces and lists and arrays to represent a complex structure yields relatively opaque and inflexible ways of expressing and manipulating that structure, twisting the code and typically replicating little pieces of weird structure access code strewn throughout the application, making the code hard to follow, teach, fix, enhance, and hand off.

Speed tables reads a structure definition and emits C code to create and manipulate tables of rows of that structure. This allows operations to be performed on the table at native code speeds, using a custom extension that manages rows of fields as native C structs and emit subroutines for manipulating those rows in an efficient manner.

Memory efficiency is high because we have low per-row storage overhead beyond the size of the struct itself, and fields are stored in native formats such as short integer, integer, float, double, bit, etc.

Computational efficiency is high because we are reasonably clever about storing and fetching those values, particularly when populating from lines of tab-separated data as well as PostgreSQL database query results, inserting into them by reading rows from a Tcl channel containing tab-separated data, writing them tab-separated, locating them, updating them, and counting them, as well as importing and exporting by other means.

Speed tables avoids executing Tcl code on a per row basis when a lot of rows need to be looked at. In particular when bulk inserting and bulk processing via search, Tcl essentially configures an execution engine that can operate on millions of rows of data without the Tcl interpreter's per-row involvement except, perhaps, for example, executing scripted code only on the few rows that match your search criteria.

Speed tables also maintains a "null value" bit per field, unless told not to, and provide an out-of-band way to distinguish between null values and non-null values, as is present in SQL databases... providing a ready bridge between those databases and speed tables.

Example Application

Speed tables is used as the realtime database for a monitoring system that polls millions of devices every few minutes. Device status and performance data is kept in speed tables. Information about the status of devices is continually "swept" to the SQL database at a sustainable rate. The loss of even a sizable number of scan results in the event of a crash is not a serious problem, as within a few minutes of starting up, the system will have obtained fresh data by newly polling the devices.

Speed tables supports defining skip list-based indexes on one or more fields in a row, providing multi-hundred-fold speed improvements for many searches. Fields that are not declared to be indexable do not have any code generated to check for the existence of indexes, etc, when they are changed, one of a number of optimizations performed to make speed tables fast.

[1] It is common to see ten or twenty times the space consumed by the data itself used up by the Tcl objects, lists, arrays, etc, used to hold them. Even on a modern machine, using 20 gigabytes of memory to store a gigabyte of data is at a minimum kind of gross and, at worst, renders the solution unusable.)