shsql, a simple sorta-SQL database

shsql is a "sorta-SQL" database management package. Features:
• data tables are flat text files and can be anywhere on your file system
• flexible isam indexing
• not client-server, no server process, no "connections"
• SQL-like, weakly typed query language
• automatic table locking during updates; row locking available
• support for comma-delimited lists and multi-word user search terms
• Access via command line tool, quisp, or API
• Compact package... all documentation in one web page

Quick links:   Utilities     SQL reference     Indexes     Config file     API

To set up a shsql database
• if you need to compile/build shsql, do that
• decide on a specific file system location and a dbname
• mkdir dbname   (there's no "create database" command)
• cd dbname; mkdir ./data ./indexes ./locks ./logs ./tmp
• create a config file; it can be named ./config
• set environment variable SHSQL_CONFIG to full pathname of config file

Rules for shsql-compliant data files
• ordinary unix text files; use no filename extension
• first non-comment line is field name header (space-separated field names)
• subsequent lines are data comprised of space delimited fields
• null fields are represented by a single equals-sign (=)
• within data fields, embedded spaces are encoded as underscore (_)
• header and data lines must have uniform number of space-delimited fields
• comment lines beginning with // and blank lines are allowed anywhere
• generally created manually in a text editor or by programs   (there's no "create table" command)
• example:
// people enrolled since 2008...
id lastname firstname middlename suffix
47 Smith John_Jacob H =
51 Jones Jill = =
54 Barrington-Smythe Harrington C III

Three types of shsql-compliant data files
database table files are located in ./data node or subnodes thereof. The tablename for SQL purposes is the file's name relative to ./data. They can be updated using SQL "insert", "update", and "delete". They can be accessed using SQL "select", can be indexed, and can be maintained using shsql_maint(1). Index files are located in ./indexes node.
temp tables have SQL names beginning with dollar-sign eg. $tmp1. The only way to create (or append to) temp tables is via SQL "select into". They may be accessed using SQL "select". They are stored in ./tmp node and filenames are managed by shsql. They cannot be indexed nor updated using sql "insert", "update", "delete" or shsql_maint(1). Temp files can accumulate so take measures to clean them up.
other data files may be located anywhere in file system. Their name for SQL purposes is either full pathname beginning with "/" or relative pathname beginning with "./". They may be accessed with SQL "select" and can have indexes. They cannot be updated using sql "insert", "update", "delete", or shsql_maint(1). Index files are located in same directory as primary data file. Typical use is to implement fast retrieval of static content.

Command-line utility:   shsql
Use this command to issue one SQL command and print results to stdout, eg: shsql -nat "select * from people order by lastname"
Some options:
-nat .... native result, same format as stored in tables
-h .... include field name header in result
-nolog .... suppress transaction logging for "insert", "update", etc.
-ver .... display shsql version info
-debug and -debug2 .... display diagnostic info

Command-line utility:   shsql_maint
Use this command to do table cleaning and rebuild indexes. Also allows manual editing, sorting, etc. of data table file (preserves field name header). Sets table locks appropriately while in operation.
shsql_maint players   clean table and rebuild all indexes
shsql_maint players -edit   ...edit the text (vi) then clean/rebuild
shsql_maint players -s "sort -n +0 -1"   ...sort the data then clean/rebuild
shsql_maint players -edit -s "sort -n +0 -1"   ...sort, edit, then clean/rebuild

Command-line utility:   shsql_export
Use this command to create csv or tab-delimited data set from shsql data stream. Use either -csv or -tab option.
cat ./data/players | shsql_export -csv > players.csv

Command-line utility:   shsql_tabinfo
Print a list of fields and index info for a certain table, eg. shsql_tabinfo players

SQL commands / language rules
• sorta-SQL... supported SQL subset with extensions
• no "create database", "create table" or other sql if not mentioned below
• commands can be multi-line and can have parens anywhere
• enclose character literals in single or double quotes
• case-insensitive except character literals
• command tokens must generally be separated by spaces
• weakly typed... datatype an issue only when telling "order by" to do a numeric sort, and when creating index on numerics to be queried using < or >
• no native support for dates, times, etc. but these can be stored as strings (yyyy/mm/dd sorts and works well), or numeric julians
• maxrows: by default a maximum of 2000 rows can be selected, updated, or deleted. For "select" this applies before "distinct" and can be raised using "maxrows" keyword as shown below. For "update" or "delete", if maxrows is exceeded an error is thrown, no data are changed.

SQL insert   Examples:
insert into people values (47, "Smith", "John", "H", "Jr." )
insert into people (id, lastname, firstname, middlename, suffix)
    values (47, 'Smith', 'John', 'H', 'Jr.' )

SQL update       Examples:
update people set lastname = "Smythe", suffix = null where id = 47
update people (orinsert) set id = 47, lastname = "Smythe",
    firstname = "John", suffix = null where id = 47

The where-clause is always required; any rows matching the where-clause will be updated. The "orinsert" modifier does an "update" if row(s) are found matching the where-clause, otherwise an "insert" is done (which is why we set a value to "id" here).

SQL delete
Example: delete from people where id = 47   The where-clause is always required; any rows matching the where-clause will be deleted.

SQL select
Retrieve data. Here's a general template:
select [distinct] itemlist [into dest]   from tablename
[join table2 on on-clause ]
[group by groupspec ]
where where-clause
[order by orderspec ]   [maxrows n]   [for update]

itemlist ... field names to retrieve, or aggregates to compute. Examples:
    select *
    select lastname, firstname
    select lastname (as lname), firstname (as fname)
    select count(*) as nrows
    select sum(amount) as total     ...also avg(), min(), and max()
• into dest ... put result in dest   Examples:
    select * into $tmp1   (shsql temp table)
    select * into /data/a/result1   (file pathname)
    select * into /data/a/result1 append   (append to file)
• from tablename ... table to get data from
    select * from people
    select * from group3/people
    select * from $tmp1   (shsql temp table)
    select * from /data/a/myfile3   (file pathname)
    select * from list2 and list3   (concatenate 2 tables)
• join .... do a relational join with table2 using certain fields.
select * from people (as p) join authors (as a) on =
    where p.lastname like "A*" order by p.lastname, p.firstname

Shown above is the only join code form supported by shsql. If a table has duplicate rows in terms of the join id field(s) then it must be given as table2 Fields named in the sql must be identified with table abbreviations as shown. "Left join", "right join", and "outer join" variants are also available.
• where ... conditionally determine which rows to retrieve.
where lastname = "Jones"
where balance < 0.0 and status = "A"
where color in "red,green,blue"
where tag like "J*"                 where tag !like "0*"
where status = null                   where status != null
where name inlike "a*,b*,c*"       where name !inlike "a*,b*"
where zscore inrange -0.1,0.1       where zscore outrange -2.5,2.5
where authors hasword "smith"
where docname hasword '"rolling stones"'
  (note quoting)
where colorlist hasmember "blue"
where latitude inrange 38.1,38.5 or latitude = null

Supported operators: =   !=   >   >=   <   <=   like   !like   in   !in  
    inlike   !inlike   inrange   outrange   hasmember   hasword     and   or
Wild card chars: *   ?       Case: = and != are case-sensitive, "like" is not
Index-assist is automatically invoked if possible on the field named leftmost in the where-clause. Precedence is simplistic left-to-right, and parentheses are ignored, so "and" and "or" terms should not be mixed.
• order by ... control sorting of result. Example:
select balance from accounts order by balance (num desc), lastname
Sorting is always case-insensitive and based on linux LOCALE=C. Modifiers:
num ... tells shsql to use numeric sort, otherwise alpha sort is done
desc ... descending order, used with numerics to show highest values first
mag ... same as numeric but uses absolute values
dict ... dictionary order, punctuation chars are ignored
• group by ... control grouping for aggregation.
    Syntax is the same as "order by" above.
• distinct ... eliminate duplicate result rows
• maxrows n ... raise the default maxrows.
    select * from bigdata maxrows 100000
    Maxrows applies before any "select distinct" uniqueness processing.
• for update ... set a shsql row lock on selected row(s)

SQL identity
Set a user name which will appear in transaction logs and error logs for shsql activity done subsequently by the linux process. Also used with row locks. Optional unless doing row locking. Example: identity jim4492

SQL sqlmode
This allows control over operational details. Currently just one supported command:
sqlmode translog 0
This turns off transaction logging ie. for bulk updates To turn it back on use same but with 1

Indexes / create index
Indexing (isam method) is supported and can greatly help with retrieval performance on larger tables. We've happily applied shsql indexing on 3+ GB data files. In shsql retrievals, the leftmost field mentioned in the where-clause will be index-assisted if an index exists for that field. Appropriate index behavior is supported on just about all where-clause comparison operators.

To associate an index with a certain field, issue a command like this:   shsql "create index on people (lastname)"   When all desired indexes have been created for a table, run shsql_maint to complete the process (you'll be reminded to do so).

If a numeric field will be subject to > or < comparisons, specify "order=num". Note that things like numeric id numbers usually don't need order=num.
    shsql "create index on ledger (jdate order=num)"

For data files that will be in a certain physical sort order you can gain performance by using a "direct" index type. If the table is updatable (as opposed to static) you'll need to use shsql_maint -s from now and then to maintain the physical sort order and rebuild index.
    shsql "create index (type=direct) on ledger (jdate order=num)"

If an alphanumeric field can contain multiple words and you want each word to be indexed individually use the "word" index type. Then in where-clause use the hasword operator on this field.
    shsql "create index (type=word) on cases (description)"

If an alphanumeric field is a comma-delimited list with multiple members and you want each member to be indexed individually use the "listmems" index type. Then in where-clause use the hasmember operator on this field.
    shsql "create index (type=listmems) on cases (attributes)"

To build a single "word" index based on several fields in a certain table, use the "combinedword" index type [more info]. Then in where-clause use several hasword fields "or'ed" together. The query will hit just the one index. This command creates one combinedword index:
    shsql "create index type=combinedword on books (title, author, keywords)"

To remove indexes or make indexing changes first do a shsql_tabinfo to see index info, then go into the ./index directory and remove all files related to the table in question (based on filename). Then if any indexes need to be created again, create them as described above.

Indexes are not just for data tables... they can also be built on other shsql-compliant data files located anywhere in your server's file system. The tablename must begin with "." or "/" and the resulting index files will be written in the same directory as the data file (and named similarly). You don't use shsql_maint in this case.
    shsql "create index on ./genodata (id)"

Notes and caveats: "distinct" is automatically enabled for certain index-assisted selects such as when "hasword" or "or" are present, or "direct" index along with "in", "inlike", etc. Don't use "order=num" for any fields that will be used with "in", "inlike", "hasword", etc.

More on the 'hasmember' where-clause operator
Comma-delimited lists are supported in shsql as an effective way to handle minor vectorization of data. The "hasmember" comparison operator is true when the leftside database field term (which should be a comma-delimited list or null) contains a member given in the rightside term. Indexes on fields involved in "hasmember" comparisons should be of the "listmems" type. Example: ...where attr hasmember "ac"
For comma-delimited lists in rightside comparison terms use: "in", "inlike", etc.

More on the 'hasword' where-clause operator
The "hasword" comparison operator is true when the leftside database field term contains a word given in the rightside term. "Words" are delimited on any combination of whitespace or punctuation characters (this applies to both requested words and words in the database field). All matching is case insensitive. Nothing will be matched if the rightside term has punctuation chars other than double quotes or asterisk. Indexes on fields involved in "hasword" comparisons should be of the "word" or "combinedword" type.

It has basic handling of multiword user search terms. If the search term itself begins and ends with double quotes (") it is considered a phrase for exact matching. Otherwise, the search term is parsed into words, which are processed one at a time with begins-with matching. An exception to this is with very short words.. when a word is 1 or 2 characters then exact matching is done. Examples (note the quoting):
where authors hasword 'smit' ..... finds rows where the authors field has any word beginning with "smit" e.g. "smith" or "Smitts"
where authors hasword '"smith"' ..... finds rows where the authors field has the exact word "smith" (because enclosed in double quotes).
where authors hasword 'smith jones' ..... finds rows where the authors field has any words like "smith*" or "jones*"
where authors hasword 'smith, jones' ..... same as above
where title hasword '"gene expression"' ..... finds rows where title has "gene expression" as 2 adjacent words only (due to the double-quotes)

More on 'combinedword' indexes
A combinedword index is similar to a word index, except that words from several fields are combined into one index, for situations where a table contains several multiword fields that will typically be searched simultaneously for the same thing. For example, suppose you have the table "books" with fields "title", "authors", and "keywords", and you have a search engine application that allows a user to type in a word and all three of the fields need to be searched. To create a combinedword index: shsql "create index (type=combinedword) on books (title, authors, keywords)" Then a query to search the table would be something like this:
select * from books where title hasword "banana" or author hasword "banana" or keywords hasword "banana"
Because there's a combinedword index, just one index hit occurs for this query.

Cautions / caveats: In where-clauses the left-to-right order that fields are mentioned is critical. When a combinedword index is created, the field named first in the "create index" command is the field that the index is named for. In queries this first-named field must appear leftmost in the where-clause, followed by references to the other fields, as in the example above. Also, no other conditionals can be present in a where-clause involving combinedword indexes.

Searches involving only the primary field, or the primary field and a subset of the secondary fields, will use the combined index and will still give correct results. Searches involving only one or more secondary fields will not invoke the combinedword index but can have their own separate indexes if it makes sense to do so.

Simple row locking
This allows a user/process to lock a selected database row for up to 30 minutes to prevent row collisions (ie. updates by other users). Row locking generates additional write activity and often may not be worth the trouble.. up to you. Feature rules:
• Involved tables must have fields named _locktime and _lockowner.
• All involved processes must have a user identity
• To lock a row use a simple select command like this: select * from mytable where ... for update
• Lock is released by subsequent ordinary UPDATE or DELETE on the row, having same identity as above
• Attempts to UPDATE, DELETE, or SELECT..FOR UPDATE on a row locked by some other identity will fail with error code 7.
• This is an "advisory" system only. Nothing prevents UPDATEs or DELETEs on rows that haven't been locked first.
• Row locks timeout / are released automatically after 30 minutes.
• Multiple rows can be locked by one SELECT..FOR UPDATE command however this can lead to complex scenarios.

Example procedure involving several web page process instances:
1. set identity; do a SELECT...FOR UPDATE
2. capture sql return code; if 7 then row is already locked by another
3. user submits data update to form capture page
4. form capture page sets identity and then does an UPDATE or DELETE
5. capture sql return code, if 7 we fail (our lock must have timed out and someone else locked the row since then)
6. otherwise ok, data has been updated and row lock is automatically cleared.

Config file for shsql
This important file is read first-thing by every shsql process to get its bearings. Create using a text editor when you set up a shsql database and then set your environment var SHSQL_CONFIG to its full pathname. Here's an example, supposing my dbname is "db1":
projectdir: /home/scg/db1
tmpdir: /home/scg/db1/tmp
shsql_bin: /home/scg/db1/bin
putenv: LC_ALL=C
putenv: SHSQL_CONFIG=/home/scg/db1/config
• projectdir .... full pathname of database directory (where ./data etc. located)
• tmpdir .... full pathname of a temp file directory (usually ./tmp)
• shsql_bin ..... full pathname of shsql bin
• shsql_maxrows .... optional, set default maxrows (2000 is starting default)
• shsql_translog .... optional, set to 0 to turn off transaction logging.
• shsql_readonly .... optional, set to 1 to make a database read-only
• putenv .... set an environment variable, to set up environment for spawned child processes. LC_ALL is important; it controls sort(1) behavior which must be consistent with internal qsort() to avoid anomolies.

C language API
To build C applications that use shsql, link to libshsql.a. There are no "include" files necessary for shsql. All functions are of type int. See simple.c in the distribution which demonstrates. Most of these functions take an argument called "dbc" which is an integer channel selector... it is normally 0, but 1, 2, or 3 can be used if channel 0 is already in use.

Read database config file and initialize. This function must be called before any other. Function returns 0 if OK, or a non-zero error code.

SHSQL_sql( int dbc, char *sql )
Submit an SQL command. "dbc" is the channel selector. "sql" is an SQL command. For "select" commands, subsequent call(s) to SHSQL_getrow() will be required to fetch the results. Function returns 0 if the SQL command was processed normally, or a non-zero error code.

SHSQL_getrow( int dbc, char *fields[], int *n )
Get one row of results from the most recently submitted "select" command. Each "fields" pointer will reference one result field. For example, this SQL command: select lastname, id from people would have 2 fields on each returned row. The "lastname" value will be in fields[0] and the "id" value in fields[1]. "n" will be set to the number of fields in the result. Data will be available until the next SHSQL_sql() call on the channel, at which time it will be overwritten. All result fields are represented as character strings. Null fields will be represented using a single equals sign (=). Spaces within data fields will appear as spaces. Function returns 0 if a row was fetched, 1 if there are no more rows, or a non-zero error code.

SHSQL_getnames( int dbc, char *fields[], int *n )
Get the names of the result fields from the most recently submitted "select" command on channel "dbc". Each "fields" pointer will be set to reference one result field name. "n" will be set to the number of fields in the result. Data will be available until the next SHSQL_sql() call on the channel, at which time it will be overwritten. Function returns 0 if OK, or a non-zero error code.

SHSQL_getnrows( int dbc )
Return the number of rows retrieved (or inserted, updated, or deleted) by the most recently submitted SQL command on channel "dbc". For "select" commands, the row count will be exact after all rows are fetched via SHSQL_getrow() but it may not be exact before this time because of late processing related to "distinct", 'group by", or "limit". However, it is always true that a count of zero indicates no rows found and a count greater than zero indicates at least one row found. The row count associated with a "select into" is always exact.

SHSQL_getitem( char *sql, char *result )
This is a convenience call to get a single item from the database on channel 0. "sql" is a "select" command that retrieves one field, and the result field will be copied into "result" Function returns 0 on success, or a non-zero error code.

SHSQL_getitems( char *sql, char *fields[] )
This is a convenience call to get multiple items from the database on channel 0. "sql" is a "select" command that retrieves one or more fields. "fields" is an array of character pointers which will be set to reference the result fields in memory that will be available until the next SQL command is submitted on channel 0. Function returns 0 on success, or a non-zero error code.

Error codes
Applications should check the status of every submitted SQL command. Some error conditions (as noted below) do not result in an error message.

0 .... success
1 .... requested row not found. No message is issued.
7 .... row is locked by another identity. No message issued.
9 .... the table is write-locked and the process couldn't gain access.
20 .... attempt to update a database that has been set to read-only
Other .... see the error message text.