Data provenance, typically used within the broader context of data lineage, refers to the source or first occurrence of a given piece of data. As a concept, data provenance (together with data lineage) is positioned to provide validity and encourage confidence related to the origin of data, whether or not data has mutated since its creation, and who the original publisher is, among other important details.
From tracking the origin of scientific studies to big banks complying with financial regulations, data provenance plays an integral role in supporting the authenticity and integrity of data.
Databases and Data Provenance
When it comes to databases, you can start to imagine how critical data provenance is when organizing and tracking files in a data warehouse or citing references from within a curated database. For consumer applications (take social media platforms such as Twitter, for example) that build entire advertising business models around the engagement derived from user-generated content, the claim of unaltered authorship (apart from account hacks) of a given Tweet is a guarantee made by the platform to its users and investors—trust cannot be built without it.
With the implications of data provenance in mind, organizations that rely on centrally controlled data stores within the context of consumer applications are constantly evolving security protocols and authentication measures to safeguard both their users and business from attacks that could result in data leaks, data alterations, data wipes, and more. However, so long as potential attack vectors and adequate user authentication are accounted for, these organizations benefit from inherent assurances related to the authenticity of incoming writes and mutations—after all, their servers are the agents performing these edit actions.
Data Provenance in Peer-to-Peer Protocols
But what about peer-to-peer data protocols and the applications built on them? How do topics such as cryptographic hashing, digital signatures, user authentication, and data origin verifiability in decentralized software coincide with data provenance and data lineage?
This article is meant to provide an initial exploration of how and where these topics converge and build a specific understanding of the overlap between these ideas and the technical architecture, challenges, qualities, and general functionality of ComposeDB on Ceramic. ComposeDB, built on Ceramic, is a decentralized graph database that uses GraphQL to offer developers a familiar interface for interacting with data stored on Ceramic.
The following article sections will set out to help accomplish the goals outlined above.
Smart Contract-Supported Blockchains
Blockchains that contain qualities analogous to a distributed state machine (such as those compatible with the Ethereum Virtual Machine) operate based on a specific set of rules that determine how the machine state changes from block to block. In viewing these systems as traversable open ledgers of data, accounts (both smart contract accounts and those that are externally owned) generate histories of transactions such as token transfers and smart contract interactions, all of which are publicly consumable without the need for permission.
What does this mean in the context of data provenance? Given the viability of public-key infrastructure, externally owned accounts prevent bad actors from broadcasting fake transactions because the sender’s identity (public access) is publicly available to verify. When it comes to the transactions themselves (both account-to-account and account-to-contract), the resulting data that’s publicly stored (once processed) includes information about both who acted, as well as their signature.
Transaction verifiability, in this context, relies on a block finalization process that requires validator nodes to consume multiple transactions, verify them, and include them in a block. Given the deterministic nature of transactions, participating nodes can correctly compute the state for themselves, therefore eventually reaching a consistent state about the transactions.
While there are plenty of nuances and levels of depth we could explore related to the architecture of these systems, the following are the most relevant features related to data provenance:
- The verifiable origin of each transaction represents the data we care about related to provenance
- Transactions are performed by externally owned accounts and contract accounts, both of which attach information about the transaction itself and who initiated it
- Externally owned accounts rely on cryptographic key pairs
ComposeDB vs. Smart Contract-Supported Blockchains
There is plenty to talk about when comparing ComposeDB (and the Ceramic Network more broadly) to chains like Ethereum; however, for this post, we’ll focus on how these qualities relate to data provenance.
Ceramic uses the Decentralized Identifier standard for user accounts (DID PKH and Key DID are supported in production). Similar to blockchains, they require no centralized party or registry. Additionally, both PKH DIDs and Key DIDs ultimately rely on public key infrastructure (PKH DIDs enable blockchain accounts to sign, authorize, and authenticate transactions, while Key DIDs expand cryptographic public keys into a DID document).
Sign in With Ethereum (SIWE)
Like chains such as Ethereum, Ceramic supports authenticated user sessions with SIWE. The user experience then diverges slightly when it comes to signing transactions (outlined below).
While externally-owned accounts must manually sign individual transactions on chains like Ethereum (both when interacting with a smart contract or sending direct transfers), data in Ceramic (or streams) are written by authenticated accounts during a timebound session, offering a familiar, Web2-like experience. The root account (your blockchain wallet if using Ceramic’s SIWE capability, for example) generates a temporary child account for each application environment with tightly-scoped permissions, which then persists for a short period in the user’s browser. For developers familiar with using JWTs in Node.js to authenticate users, this flow should sound familiar.
This capability is ideal for a protocol meant to support mutable data with verifiable origin, thus allowing for multiple writes to happen over a cryptographically authorized period (and with a signature attached to each event that can be validated) without impeding the user’s experience by requiring manual signs for each write.
Ceramic relies on event streams that offer a limited consensus model that makes it possible for a given stream to allow multiple parallel histories while ensuring any two parties consuming the same events for a stream will arrive at the same state. What this means is that all streams and their corresponding tips (latest events within their event logs) are not known by all participants at any given point in time.
However, a mechanism known as the Ceramic Anchor Service (CAS) is responsible for batching transactions across the network into a Merkle tree and regularly publishing its root in a single transaction to Ethereum. Therefore, Ceramic does offer a consensus on the global ordering of Ceramic transactions.
Just as smart contracts provide a deterministic structure that dictates how users can interact with them (while guaranteeing they will not change once deployed), ComposeDB schemas are also immutable, offering guarantees around the types of data a given model can store. When users write data using these definitions, each resulting model instance document can forever only be altered by accounts that created it (or grant limited permission to another account to do so), and can only make changes that conform to the schema’s definition.
Finally, every stream is comprised of an event log of one or more commits, thus making it easy for developers to extract not only the provenance of the stream’s data, based on the cryptographic signature of the account that created it, but also the stream’s data lineage by traversing through the commit history to observe how the data mutated over time.
Similar to networks like Ethereum, the Ceramic Network is public by default, allowing any participating nodes to read any data on any stream. While the values of the data may be plaintext or encrypted, contingent on the objectives of the applications using them, anyone can verify the cryptographic signatures that accompany the individual event logs (explained above).
The broad assumption I’ll line up for this comparison is that a traditional “Web2” platform uses a sandboxed database to store, retrieve, and write data on behalf of its users. Apart from the intricate architecture strategies used to accomplish this at scale with high performance, most of these systems rely on the assurances that their servers alone have sole authority to perform writes. Individual user accounts can be hacked into via brute force or socially engineered attacks, but as long as the application’s servers are not compromised, the data integrity remains intact (though requiring participants to trust a single point of failure).
ComposeDB vs. Centralized Databases
If this article set out to compare ComposeDB to traditional databases in the context of functionality and performance, we’d likely discuss a higher degree of similarities rather than differences; however, when comparing ComposeDB to the paradigm of a “traditional” database setup, in the context of data provenance, we find that the inverse holds true based on much of what was discussed in the previous section.
Embedded Cryptographic Proof
As previously discussed, all valid events in Ceramic include a required DAGJWS signature derived from the stream’s controlling account. While it’s possible (though logically unwise) that an application using a centralized database could fabricate data related to the accounts of its users, event streams in Ceramic are at all times controlled by the account that created the stream. Even if a Ceramic account accidentally delegates temporary write access to a malicious application that then authors inaccurate data on the controller’s behalf, the controlling account never loses admin access and can revert or overwrite those changes.
Unlike Ceramic, the origin of data (along with most accompanying information) is not accessible by design when using a centralized database, at least not in a permissionless way. The integrity of the data within a “traditional” database must therefore be assumed based on other factors requiring trust between the application’s users and the business itself. This architecture is what enables many of the business models these applications use that ultimately have free reign over how they leverage or sell user data.
Conversely, business models like advertising can be (and are currently being) built on Ceramic data, which flips this paradigm on its head. Individual users have the option to encrypt data they write to the network and have an array of tools at their disposal to enable programmatic or selective read access based on conditions they define. Businesses that want to access this data can therefore work directly with the users themselves to define the conditions under which their data can be accessed, putting the sovereignty of that data into individual users’ hands.
Timestamping and Anchoring
While in a private, sandboxed database, development teams can implement a variety of methods to timestamp entries, those teams don’t have to worry about trusting other data providers in a public network to be competent and non-malicious. Conversely, data in Ceramic leverages the IPLD Timestamp Proof specification which involves frequent publishing of the root of a Merkle tree to the blockchain with the sets of IPLD content identifiers as the tree’s leaves which represent Ceramic data. While the underlying data structure (event log) of each stream will preserve the ordering of its events with specific events pointing to the prior one in the stream, the anchoring process allows developers to use event timestamping in a decentralized, trustless way.
Verifiable credentials under the W3C definition unlock the ability for verifiable claims to be issued across a virtually limitless set of contexts, with the guarantee that they can later be universally verified in a cryptographically secure way. This standard relies on several key features (below are only a few of them):
- Verifiable Data Registry: A publicly available repository of the verifiable credential schemas one might choose to create instances of
- Decentralized Identifiers: Verifiable credentials rely on DIDs to both identify the subject of a claim, as well as the cryptographic proof created by the issuer
- Core Data Model: These credentials follow a standard data model that ensures that the credential’s body (made up of one or more claims about a given entity) is inherently tamper-evident, given the fact that the issuer generates a cryptographic proof that guarantees both the values of the claims themselves and the issuer’s identity
For example, an online education platform may choose to make multiple claims about a student’s performance and degree of completion related to a specific student and a specific course they are taking, all of which could be wrapped up into one verifiable credential. While multiple proof formats could be derived (EIP712 Signature vs. JWTs), the provenance of the credential is explicit.
However, unlike blockchains and databases, verifiable credentials are not storage networks themselves and therefore can be saved and later retrieved for verification purposes in a wide variety of ways.
ComposeDB vs. Verifiable Credentials (and other claim formats)
I mentioned earlier that schema definitions (once deployed to the Ceramic network) offer immutable and publicly available data formats that enforce constraints for all subsequent instances. For example, anyone using ComposeDB can deploy a model definition to assert an individual’s course completion and progress, and similarly, any participants can create document instances within that model’s family. Given the cryptographic signatures and immutable model instance controller identity (that’s automatically attached to each Ceramic stream commit discussed above), you can start to see how the qualities verifiable credentials are set out to provide, like tamper-evident claims and credential provenance, are inherent to ComposeDB.
Like a verifiable credential, each commit within a given Ceramic stream is immutable once broadcasted to the network. Within the context of a model instance document within ComposeDB, while the values within the document are designed to be mutated over time, each commit is publicly readable, tamper-evident, and cryptographically signed.
We’ve discussed this extensively above—each event provides publicly-verifiable guarantees about the identity of the controlling account.
Unlike verifiable credentials that offer just a standard, ComposeDB allows developers to both define claim standards (using schema definitions), as well as public availability for those instances to be read and confirmed by other network participants. ComposeDB is therefore also a public schema registry in itself.
In addition to the specific comparisons to other data storage options and verifiable claim standards, what qualities does ComposeDB offer that enable anyone to audit, verify, and prove the origin of data it contains? While parts of this section may be slightly redundant with the first half of this article, we’ll take this opportunity to tie these concepts together in a more general sense.
Auditable, Verifiable, and Provable
For trust to be equitably built in a peer-to-peer network, the barrier to entry to be able to run audits must be sufficiently low, concerning both cost and complexity. This holds especially true when auditing and validating the origin of data within the network. Here are a few considerations and trade-offs related to ComposeDB’s auditability.
No Cost Barrier With Open Access to Audit
Developers building applications on ComposeDB do not need to worry about cost-per-transaction fees related to the read/write activity their users perform. They will, however, need to architect an adequate production node configuration (that should be built around the volume a given application currently has and how it expects to grow over time), which will have separate network-agnostic costs.
This also holds for auditors (or new applications who want to audit data on Ceramic before building applications on that data). Any actor can spin up a node without express network permissions, discover streams representing data relevant to their business goals, and begin to index and read them. Whether an organization chooses to build on ComposeDB or directly on its underlying network (Ceramic), as long as developers understand the architecture of event logs (and specifically how to extract information like cryptographic signatures and controlling accounts), they will have fully transparent insight into the provenance of a given Ceramic dataset.
Trade-Off: Stream Discoverability
While fantastic interfaces, such as s3.xyz, have been built to improve data and model discoverability within the Ceramic Network, one challenge Ceramic faces as it continues to grow is how to further enable developers to discover (and build on) existing data. More specifically, while it’s easy to explain to developers the hypothetical benefits of data composability and user ownership in the context of an open data network (such as the data provenance-related qualities we’ve discussed in this post), showing it in action is a more difficult feat.
The Ceramic Network also exists in an existing, non-conforming territory that does not fit neatly into the on- or off-chain realm. Just as the Ethereum Attestation Service (EAS) mentions on its Onchain vs. Offchain page, a “verifiable data ledger” category of decentralized storage infrastructure is becoming increasingly appealing to development teams who want to gain the benefits of both credible decentralization and maximum performance, especially when dealing with data that’s meant to mutate over time.
As we discussed above, here’s a refresher on key insights into ComposeDB’s structure, and how these impact the provenance of its data.
Ceramic Event Logs
Ceramic relies on a core data structure called an event log, which combines cryptographic proofs (to ensure immutability and enable authentication via DID methods) and IPLD for hash-linked data. All events on the network rely on this underlying data structure, so whether developers are building directly on Ceramic or using ComposeDB, teams always have access to the self-certifying log that they can verify, audit, and use to validate provenance.
ComposeDB Schema Immutability
Developers building on ComposeDB also benefit from the assurances that schema definitions provide, based on the fact that they cannot be altered once deployed. While this may be an issue for some teams who might need regular schema evolution, other teams leverage this quality as a means to ensure constant structure around the data they build on. This feature therefore provides a benefit to teams who care strongly about both data provenance and lineage - more specifically, the origin (provenance) can be derived from the underlying data structure, while the history of changes (lineage) must conform to the immutable schema definition, and is always available when accessing the commit history.
A Decentralized Data Ledger
Finally, Ceramic nodes support the data on Ceramic and the protocol—providing applications access to the network. For ComposeDB nodes, this configuration includes an IPFS service to enable access to the underlying IPLD blocks for event streams, a Ceramic component to enable HTTP API access and networking (among other purposes), and PostgreSQL (for indexing model instances in SQL and providing a read engine). All Ceramic events are regularly rolled into a Merkle tree and the root is published to the Ethereum blockchain.
Within the context of data provenance, teams who wish to traverse these data artifacts back to their sources can use various tools to publicly observe these components in action (for example, the Ceramic Anchor Service on Etherscan), but must be familiar with Ceramic’s distributed architecture to understand what to look for and how these reveal the origins of data.
There’s no question that the distributed nature of the Ceramic Network can be complex to comprehend, at least at first. This is a common problem within P2P solutions that uphold user-data sovereignty and rely on consensus mechanisms, especially when optimizing for performance.
Trade-Off: Late Publishing Risks
As described on the Consensus page in the Ceramic docs, all streams and their potential tips are not universally knowable in the form of a global state that’s available to all participants at any point in time. This setup does allow for individual participants to intentionally (or accidentally) withhold some events while publishing others, otherwise known as engaging in ‘selective publishing’. If you read into the specifics and the hypothetical scenario outlined in the docs, you’ll quickly learn that this type of late publishing attack is illogical in practice since streams can only have one controlling user, so that user would need to somehow be incentivized to attack their data.
What does this have to do with data provenance? While the origin of Ceramic streams (even in the hypothetical situation of a stream with two divergent and conflicting updates) is at all times publicly verifiable, the potential for this type of attack has more to do with the validity of that stream’s data lineage (which is more concerned with tracking the history of data over time).
Finally, another important notion to consider in the context of data provenance and P2P software is replication and sharing. Developers looking to build on this class of data network should not only be concerned with how to verify and extract the origin of data from the protocol but also need assurances that the data they care about will be available in the first place.
ComposeDB presumes that developers will want options around the replication and composability of the data streams they will build on.
You’ll see on the Server Configurations page that there’s an option to deploy a ComposeDB node with historical sync turned on. When configured to the ‘off’ position, a given node can still write data to a model definition that already exists in the network, but the node will simply only index model instance documents written by that node. Conversely, when toggled ‘on’, this setting will sync data from other nodes and write data to a canonical model definition (or many). The latter enables the ‘composability’ factor that development teams can benefit from—this is the mechanism that allows teams to build applications on shared, user-controlled data.
Recon (Ceramic Improvement Proposal)
There is an active improvement proposal underway, called Recon, to improve the efficiency of the network. In short, development related to this proposal aims to streamline the underlying process by which nodes sync data, offering benefits such as significantly lifting the load off of nodes that are uninterested in a given stream set.
Trade-Off: Data Availability Considerations
Of course, the question of data portability and replication necessitates conversation around the persistence and availability of information developers care about. In Ceramic terms, developers can provide instructions to their node to explicitly host commits for a specific stream (called pinning), improving resiliency against data loss. However, developers should know that if only one IPFS node is pinning a given stream and it disappears or gets corrupted, the data within that stream will be lost. Additionally, if only one node is responsible for pinning a stream and it goes offline, that stream won’t be available for other nodes to consume (which is why it’s best practice to have multiple IPFS nodes running in different environments pinning the same streams).