DataBook

As explained by Kurt Cagle and Chloe Shannon in their article,  DataBooks: Markdown as Semantic Infrastructure, a DataBook is effectively a microdatabase.  

A DataBook is a document a human can read, a data file that a computer can process, and a toolbox that caries its own instructions. A DataBook is a technique that enables data and an explanation to travel together in a data pipeline.

One important part of the magic of DataBook files to understand is that a DataBook can also easily be read and interpreted by LLMs.

Another part of the magic of the DataBook is that everything travels together within one file including: 

  • data
  • meaning
  • rules
  • queries
  • documentation

Finally, DataBook files can easily be versioned by Github and Gitlab.  Both Github and Gitlab support MD files which are both based on CommonMarkup. And so, there appear to be different "flavors" of MD files, but they are close.

There are no separate files which can be forgotten or lost.  This technique uses a markdown file (.md) as the container which holds everything.  No separate files to lose, no context to forget. A markdown file provides a powerful yet straightforward way for users, both technical and non-technical, to write plain text documents that can be rendered richly as HTML but also easily read by a computer software application.

Within the markdown file you can provide fenced blocks (a.k.a. section) of structured data formats such as YAML, RDF/Turtle, JSON-LD, SPARQL, SHACL, SQL, CSV, and other such well understood structured formats.

That’s it; one physical file that contains many formally fenced off layers.  Here are some of the different fenced blocks that can be provided within a DataBook:

  • Markdown: this is the primary format of the file; holds the human-readable text and all of the other structured information, those fenced blocks.
  • YAML frontmatter: Used at the top of the file for metadata like title, author, version, provenance. YAML is a unicode based data serialization language which is broadly useful for programming needs ranging from configuration files to internet messaging to object persistence to data auditing and visualization.
  • RDF/Turtle: Used to store a graph of data a.k.a. linked data.
  • JSON‑LD: Popular alternative or complementary linked‑data format.
  • SPARQL: Queries and updates embedded directly in the file.
  • SHACL: Validation rules for the data.
  • Other typed structured blocks: Depending on the use case, a DataBook may also include:

Here is a very simple, basic example of a DataBook. You can read the databook file here on Github and a machine can be sent this raw databook file of the file on Github.


As I understand it, work is being done to try and get DataBook to become a W3C standard. Even if DataBook is not a global standard, it is a useful convention and might even be considered a best practice.

Additional Information:


Comments

Popular posts from this blog

Inhabiting Babel, A Manifesto for Responsible Meaning Engineering

Rethinking Financial Reporting: the Model-driven Financial Statement

PLATINUM Business Use Cases, Test Cases, Conformance Suite