Prev Source
Next Source

SQLite As an Application File Format

original source

We make a distinction between a file format” and an application format”. A file format is used to store a single object. So, for example, a GIF or JPEG file stores a single image, and an XHTML file stores text, so those are file formats” and not application formats”. An EPUB file, in contrast, stores both text and images (as contained XHTML and GIF/JPEG files) and so it is considered an application format”. This article is about application formats”.

Pile-of-Files Formats. Sometimes the application state is stored as a hierarchy of files. Git is a prime example of this, though the phenomenon occurs frequently in one-off and bespoke applications. A pile-of-files format essentially uses the filesystem as a key/value database, storing small chunks of information into separate files.

It is also much less convenient to move a pile-of-files from one place or machine to another, than it is to move a single file.

Custom formats are specifically designed for a single application. DOC, DWG, PDF, XLS, and PPT are examples of custom formats. Custom formats are usually contained within a single file, for ease of transport. They are also usually binary, though the DWG format is a notable exception. Custom file formats require specialized application code to read and write and are not normally accessible from commonly available tools such as unix command-line programs and text editors. In other words, custom formats are usually opaque blobs”. To access the content of a custom application file format, one needs a tool specifically engineered to read and/or write that format

a pile-of-files format breaks the document metaphor”: there is no one file that a user can point to that is the document”.

Some applications use a Pile-of-Files that is then encapsulated into some kind of single-file container, usually a ZIP archive. EPUB, ODT,and ODP are examples of this approach. An EPUB book is really just a ZIP archive that contains various XHTML files for the text of book chapters, GIF and JPEG images for the artwork, and a specialized catalog file that tells the eBook reader how all the XML and image files fit together.

Any application state that can be recorded in a pile-of-files can also be recorded in an SQLite database with a simple key/value schema like this:

CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);  

But an SQLite database is not limited to a simple key/value structure like a pile-of-files database. An SQLite database can have dozens or hundreds or thousands of different tables, with dozens or hundreds or thousands of fields per table, each with different datatypes and constraints and particular meanings, all cross-referencing each other, appropriately and automatically indexed for rapid retrieval, and all stored efficiently and compactly in a single disk file. And all of this structure is succinctly documented for humans by the SQL schema.

the use of an SQLite database as an application file format has compelling advantages. Here are a few of these advantages, enumerated and expounded:

No new code is needed for reading or writing the application file. One has merely to link against the SQLite library, or include the single sqlite3.c” source file with the rest of the application C code, and SQLite will take care of all of the application file I/O.

An SQLite database is contained in a single file, which is easily copied or moved or attached. The document” metaphor is preserved.

SQLite does not have any file naming requirements and so the application can use any custom file suffix that it wants to help identify the file as belonging” to the application. SQLite database files contain a 4-byte Application ID in their headers that can be set to an application-defined value and then used to identify the type” of the document for utility programs such as file(1), further enhancing the document metaphor.

SQLite is a complete relational database engine, which means that the application can access content using high-level queries. Application developers need not spend time thinking about how” to retrieve the information they need from a document. Developers write SQL that expresses what” information they want and let the database engine to figure out how to best retrieve that content.

Information held in an SQLite database file is accessible using commonly available open-source command-line tools - tools that are installed by default on Mac and Linux systems and that are freely available as a self-contained EXE file on Windows. Unlike custom file formats, application-specific programs are not required to read or write content in an SQLite database. An SQLite database file is not an opaque blob.

An SQLite database is a well-defined and well-documented file format that is in widespread use by literally millions of applications and is backwards compatible to its inception in 2004 and which promises to continue to be compatible in decades to come. The longevity of SQLite database files is particularly important to bespoke applications, since it allows the document content to be accessed far in the future, long after all traces of the original application have been lost. Data lives longer than code. SQLite databases are recommended by the US Library of Congress as a storage format for long-term preservation of digital content.

SQLite database files are portable between 32-bit and 64-bit machines and between big-endian and little-endian architectures and between any of the various flavors of Windows and Unix-like operating systems. The application using an SQLite application file format can store binary numeric data without having to worry about the byte-order of integers or floating point numbers.

Writes to an SQLite database are atomic. They either happen completely or not at all, even during system crashes or power failures. So there is no danger of corrupting a document just because the power happened to go out at the same instant that a change was being written to disk.

SQLite is transactional, meaning that multiple changes can be grouped together such that either all or none of them occur, and so that the changes can be rolled back if a problem is found prior to commit.

When writing to an SQLite database file, only those parts of the file that actually change are written out to disk. This makes the writing happen faster and saves wear on SSDs. This is an enormous advantage over custom and wrapped pile-of-files formats, both of which usually require a rewrite of the entire document in order to change a single byte. Pure pile-of-files formats can also do incremental updates to some extent, though the granularity of writes is usually larger with pile-of-file formats (a single file) than with SQLite (a single page).

SQLite also supports continuous update. Instead of collecting changes in memory and then writing them to disk only on a File/Save action, changes can be written back to the disk as they occur. This avoids loss of work on a system crash or power failure. An automated undo/redo stack, managed using triggers, can be kept in the on-disk database, meaning that undo/redo can occur across session boundaries.

As an application grows, new features can be added to an SQLite application file format simply by adding new tables to the schema or by adding new columns to existing tables. Adding columns or tables does not change the meaning of prior queries, so with a modicum of care to ensuring that the meaning of legacy columns and tables are preserved, backwards compatibility is maintained.

In many cases, an SQLite application file format will be faster than a pile-of-files format or a custom format. In addition to being faster for raw read and writes, SQLite can often dramatically improves start-up times because instead of having to read and parse the entire document into memory, the application can do queries to extract only the information needed for the initial screen.

A pile-of-files format can be read incrementally just like SQLite. But many developers are surprised to learn that SQLite can read and write smaller BLOBs (less than about 100KB in size) from its database faster than those same blobs can be read or written as separate files from the filesystem. (See 35% Faster Than The Filesystem and Internal Versus External BLOBs for further information.)

In either case, if performance problems do arise in an SQLite application those problems can often be resolved by adding one or two CREATE INDEX statements to the schema or perhaps running ANALYZE one time and without having to touch a single line of application code.

SQLite automatically coordinates concurrent access to the same document from multiple threads and/or processes. Two or more applications can connect and read from the same document at the same time. Writes are serialized, but as writes normally only take milliseconds, applications simply take turns writing. SQLite automatically ensures that the low-level format of the document is uncorrupted. Accomplishing the same with a custom or pile-of-files format, in contrast, requires extensive support in the application. And the application logic needed to support concurrency is a notorious bug-magnet.

If the application file format is an SQLite database, the complete documentation for that file format consists of the database schema, with perhaps a few extra words about what each table and column represents. The description of a custom file format, on the other hand, typically runs on for hundreds of pages. A pile-of-files format, while much simpler and easier to describe than a fully custom format, still tends to be much larger and more complex than an SQL schema dump, since the names and format for the individual files must still be described.

Fred Brooks, in his all-time best-selling computer science text, The Mythical Man-Month says:

Representation is the essence of computer programming.  
...  
Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.  

Rob Pike, in his Rules of Programming expresses the same idea this way:

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.  

Linus Torvalds used different words to say much the same thing on the Git mailing list on 2006-06-27:

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.  

SQLite is not the perfect application file format for every situation. But in many cases, SQLite is a far better choice than either a custom file format, a pile-of-files, or a wrapped pile-of-files. SQLite is a high-level, stable, reliable, cross-platform, widely-deployed, extensible, performant, accessible, concurrent file format. It deserves your consideration as the standard file format on your next application design.


Date
May 24, 2022