Editor’s note: This is the third post in the Palantir RFx Blog Series , which explores how organizations can better craft RFIs and RFPs to evaluate digital transformation software. Each post focuses on one key capability area within a data ecosystem, with the goal of helping companies ask the right questions to better assess technology.
Introduction Since Palantir’s inception, we have been obsessed with version control (VC). We saw how VC technologies have propelled the most important software projects in the world and recognized its potential to transform the way we think about data. One of Palantir’s first technologies was the Revisioning Database, a framework that applies version control concepts to graph databases in our Palantir Gotham Platform . We continue to refine our VC technologies across the varied and evolving terrains of modern data ecosystems.
Broadly speaking, version control refers to a set of tools and techniques that allow for large-scale collaboration in software development. VC does this by keeping track of all the changes (or “diffs”) that each developer makes to the code and determines how those changes should be incorporated into a new version of the software. Along the way, VC systems also track many other things, like metadata about the diffs, who made them, and when they were made.
In the context of a data ecosystem, version control is important for tracking changes to code (the logic applied to data to clean, transform, join, or resolve it) and changes to the data itself. Like a VC system for code, a VC system for data tracks changes to the data and metadata related to those changes. Capturing metadata is often more important than capturing changes to the data itself because, as discussed in our previous Ontology post , data doesn’t have inherent meaning and metadata is one of the ways we add meaning to data. Metadata includes information about which source system the data originated from, what security policies were attached to that data, and which developers have made changes to the data. As data pipelines are created in a data ecosystem, metadata informs much of the end user experience. For example, two users with different roles could be looking at the same data file but will see different versions of the file based on their respective authorizations.
When we launched our Palantir Foundry Platform in 2016, we focused heavily on giving users tooling to build high quality data pipelines. VC is a key component of strong data pipelines because of the many complex ways that code and data interact in this space. Over the lifetime of a data pipeline, the code that is used to define data transformations and produce intermediate and final datasets will experience a complex evolution. Being able to track this evolution — and understand which combination of code and data led to which outcome — is a key element of a data ecosystem.
In this blog post, we explore what defines data pipeline version control, why data pipeline VC matters, and how organizations can craft requirements to ensure the highest quality support for these capabilities.
What is Data Pipeline Version Control? A data pipeline can be described as a series of datasets connected by incremental logical transforms . Datasets themselves are collections of data files (spreadsheets, documents, images, etc.). Transforms are the software code (or logic) that changes a dataset in some useful way. For example, a developer might write transformation logic to clean, join, resolve, or merge a dataset. A data pipeline can have hundreds of logical transforms, each applying a set of changes to the previous state in a long series of incremental steps. Version control of data pipelines is meant to track the chaos of these many diffs — what changes are made, when, and by whom — within a system where transformation logic becomes deeply intertwined with the data itself.
Version control systems for code help track and manage changes to software code. They provide a mechanism for different developers to write code within the same shared environment. A key concept in any VC system is branching . Developers create their own “branch” of new code in which they can write, test, and revise independently without affecting the main branch. Once the new branch is approved for production, it can be merged back into the main branch as a new version. By enabling developers, even thousands of them, to collaborate on the same codebase, version control systems provide the backbone for software development operations (“DevOps”) and a mechanism for organizations to “roll back the clock” to a previous version should a new one prove unsatisfactory.
Version control systems for data track changes to the data itself. One of the main differences between data and code is that data often has multiple purposes when leveraged by an enterprise, while code is usually designed for one specific purpose. A given data set might be used operationally in one context, analytically in a second context, and informationally in a third — each subject to its own set of policies and restrictions. These purpose-based differences introduce unique challenges to VC systems for data. While data also has branches and versions, what really matters about data is where it came from and where it is going . Understanding where data came from is referred to as lineage which allows visibility into what policies were applied to the data before being transformed. Understanding where data is going is referred to as inheritance and helps determine which policies should remain with the data as it travels through the pipeline.
Version control of data also presents a unique challenge because data can take on many forms. Code is a predictable kind of data in that it is text-based, with strong rules that govern how that text can fit together. Other kinds of data generally lack such restrictions. To ensure that all changes to all data are similarly tracked, VC systems for data often require the changes to be ACID-compliant. ACID (atomicity, consistency, isolation and durability) means that each change can be treated as a separate unit. Acid compliance is relatively simple to implement for textual changes but becomes much harder to implement for more complex data sources such as images or audio. ACID guarantees are an important requirement for the version control of data and therefore a requirement for the version control of data pipelines. Version control of data pipelines is very difficult to accomplish because it requires support for all of these concepts — versions, branching, lineage, inheritance, ACID-guarantees — and others.
As pipelines change and evolve over time, tracking past, present, and parallel versions gives organizations the ability to more effectively manage their data and make trustworthy decisions. Why does the Data Pipeline Version Control matter? Version control matters because it engenders trust in the data ecosystem itself. By providing mechanisms for both user collaboration and individual experimentation, version control technologies provide a necessary foundation for a healthy development environment. Properly implemented, VC allows creativity and experimentation in ways that are non-destructive. It limits duplicative and redundant work. It prevents disruption between users. It provides fallback options should a newly released set of code prove to be unstable or problematic. And perhaps most importantly, it ensures a tight security posture over all data in the system so that users see the right data at the right time for the right reasons.
In the context of a data pipeline, a strong version control system obviates the need to deploy and maintain separate development environments. Traditional data ecosystems have distinct environments (e.g., a development environment and a QA/staging environment) where developers can write and refine their code before pushing it to the main production environment. These separate environments create significant overhead for organizations, as they need to be individually maintained and synced at defined intervals. They also carry heavy costs in terms of hosting, administrative staffing, data transfer delays, and system outages. By applying principles of code versioning (versions, branches, reviews, commits, etc.) to data, organizations streamline their development operations, avoid costs of multiple environments and, by allowing developers to test code on actual production data, ultimately build more trust in the data ecosystem itself.
Applying code VC technologies to data is easier said than done. Data-centered workflows are highly variable, touching a wide range of purposes from analytical to operational to informational. Traditional code versioning tools have limited utility when applied to these data-centric workflows due to the sheer scale of large datasets, which can span billions of large binary files. These datasets are not only cumbersome to store, but they also quickly push code version control systems to their limits because they are optimized for text-based code. Git, for example, does not calculate the diff of binary BLOBS and just stores full copies. As a result, data versioning is often accomplished by either a full backup store of production data or a staging database that tracks the production database.
When data is efficiently versioned, organizations do not need to deploy, manage, and sync multiple environments for staging and production. The lineage of every dataset is tracked from its source system, ensuring that any data policies associated with those sources can be inherited by descendent datasets as appropriate. This approach effectively keeps an auditable, accurate, restorable history of all enterprise data, while facilitating organization-wide collaboration and guaranteeing data protection. When logic and data are versioned together, users are able to ask questions like: How does what I know now differ from what I knew in the past and what has contributed to these changes? This meta-understanding of a data ecosystem propels continuous improvement and more trustworthy decision-making.
Requirements The solution must include a version control system for data pipelines that tracks the lineage of both data and logic together. This version control system should make unnecessary the need for separate testing, staging, and production environments for data authoring. As described above, maintaining separate environments for testing and production carries significant costs related to performance, hosting costs, and administrative overhead. When the development environment is distinct from the production environment, developers are forced to write and test code on data that is rarely the same as that within the production instance. A strong version control system for data pipelines ensures that developers can build and test logic using real production data and while relying on the same production configuration as the production environment (since it is the same environment).
The solution must support branching of both data and logic for users to create test branches off the production codebase in order to develop and test new capabilities. The most effective data ecosystems treat data much like code — with mechanisms to “fork” production data into a new branch so that developers can work with the data within their own individual sandbox. Once the new transformations are complete and approved, they can be merged back to the production codebase through a pre-defined review process.
The solution must maintain a persistent log correspondence between the version of a data pipeline’s code and the versions of its output dataset. Within a data ecosystem, data and code are intertwined in many complex ways, and organizations will need ways to track them together in highly granular ways. This log correspondence shows which version of which code produced which dataset, and allows a user to electively and separately update code and data. Data and code can advance in independent versions. This allows isolated workflows while maintaining crucial relationships for traceability and collaboration.
The solution must record transactional metadata about each transformation, including the transformation’s input datasets, output datasets, and the source of the transformation logic. Metadata forms the connective tissue between data and logic within the data ecosystem, enabling entire categories of processes and features. This combination of information-related dataset inputs/outputs and transaction code form the basis for a job specification , which can be thought of as a unique recipe that can be used to identify and materialize the dataset. This recipe is fully recorded, versioned, and branched, allowing for organizations to specify retention policies on downstream datasets while also allowing them to switch back to any any previous version of the pipeline.
The solution must maintain a complete dependency graph of the data pipeline. Tracking dependencies between datasets is a critical element of version control, as it effectively maintains a coherent organization of datasets through a given data pipeline. This dependency graph enables many critical functions, including: computing which datasets are out-of-date relative to its inputs and then rebuild them; enabling “look ahead” error detection before building, particularly around schema conflicts; serving as the skeleton for a system of data-health checks; and enabling visualizations to help users understand data lineage. It also structures the propagation of security and access controls throughout data pipelines.
The solution must provide guarantees that data transactions are serialized and ACID compliant. Because data transactions are sequential, non-overlapping transactions, the system can provide a clear version history of changes made to each dataset. This data retention allows you to see and recover previous states of a dataset. Further, ACID compliance precludes edge cases in the data reading/writing that compromise the data’s trustworthiness.
The solution must provide mechanisms to group and track datasets, including the ability to define source data plus a set of data transformations. Datasets are the sum total of the changes that have been made to their upstream datasets through transformations. The individual dataset represents one stop in a line of inheritance allowing multiple upstream parents to be transformed into a single dataset (from multiple datasets to a single output). Knowing this inheritance enables more intuitive and more granular propagation of properties and permissions.
The solution must enable automatic propagation of data policies, access controls, and other restrictions to data sources throughout the entirety of the pipeline. Policies must persist across versions, branches, and downstream applications. Given the number of users and the diversity of their roles, robust access control is critical for handling potentially sensitive data. Access controls at the most granular level (e.g. the row/column level) facilitate maximum freedom for user discovery and collaboration without compromising proper data governance.
Conclusion As we’ve discussed the different pillars of a data ecosystem in this RFx Blog Series, one consistent theme has been trust . For data ecosystems to be effective, system users need to trust that the system is performant, reliable, and secure. Concurrent version control of both data and code is a foundational part of building trust for and within a data ecosystem, as it allows for individual creativity and experimentation in a non-destructive way. The best version control systems treat data the way software developers treat code — with mechanisms to create new branches of data to develop and test new capabilities in a safe, sandboxed environment. While these concepts have been a part of code versioning systems for many years, their applications to data and code together in data ecosystems have only recently emerged, representing an exciting new path for impact and collaboration.
Data Pipeline Version Control: Tracking Code & Data Together (Palantir RFx Blog Series, #3) was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.