web analytics
Press "Enter" to skip to content

Moving Forward with DataOps

Andy Vidan

Organizations that have truly embraced Big Data have shifted away from monolithic, traditional architectural models that fail to manage the new data paradigm of nearly unlimited amount of high-variety and high-velocity data. Instead, these forward-leaning organizations have come to adopt a modern data architecture to meet new data requirements that demand flexibility and adaptability in the architecture to aggregate, process and analyze cross-silo data coming from an increasing number of sources. Whereas traditional data architecture models are ineffective, a modern data architecture can meet the rapidly-evolving, often unpredictable changes in data and requirements that organizations face.

The modern data architecture, often referred to as a Logical Data Warehouse, a Unified Data Architecture, or an Enterprise Data Hub, depending upon who you follow in the industry, is an agile data infrastructure that delivers a coherent view of data assets with broad functionality to ingest, consolidate and process raw, unmodeled or lightly modeled data. This agile data infrastructure is being built, operated and used by a number of different data professionals, including software developers, data scientists, data engineers, IT and business analysts. We are therefore now seeing a shift among these data professionals to unify around a set of best practices, and tools, for “Data OPerationS” or DataOps.

The software development and IT community is already accustomed to such a development: DevOps grew out of the availability of agile (hardware) infrastructure. Initially introduced at the Agile 2008 conference on “Infrastructure and Operations”, DevOps, a set of processes between software development and IT operations, has transformed the deployment process of complex software systems, especially in the era of cloud computing. Enterprises that adopted DevOps practices have experienced more reliable software releases and faster deployment pipelines, among other benefits. Similarly, enterprises that will adopt a core set of DataOps practices and tools will be more efficient at operationalizing their data products at scale using their agile data infrastructure.

DataOps practices must begin with the view that data, and the ability to manage it, should be considered a key corporate asset. DataOps professionals must strive to effectively use the agile infrastructure available in a modern data architecture to maximize the value potential of this asset. Ultimately, DataOps best practices should define an organization’s roadmap that guides operational data-driven capabilities that are reliable and robust. This set of practices should include consideration for the following:

  • Data Catalog: Know what data is available and how it can add value to the organization.
  • Data Lineage: Know where, when and how data is moved and consumed not just within the data warehouse or data lake, but across all downstream business functions, and the wider enterprise.
  • Data Quality: Enforce policies and processes around data acquisition, transmission, consumption and disposition using automation, while reporting on key metrics through real-time analysis.
  • Meta Data: End-to-end visibility, audit and traceability on all kinds of metadata while maximizing analytics performance.
  • Ingestion: Accommodate a vast variety of Big Data, in various formats, structures and attributes, from a variety of sources, and yet enable efficient processing, query speed and precision.
  • Analytics: Synthesize and master the available data and provision actionable insights when and as required.
  • Data Security: Establish policy based security and access controls for end-to-end data audit, authentication and protection.

The data operations landscape within an organization undergoes a continuous state of evolution as data requirements, organizational circumstances and technologies evolve. DataOps professionals should embrace these changes and adopt practices and technologies as they are available and required. As Andy Palmer argues in a post on the merits of DataOps, enterprises ultimately must consider their data as deriving from “thousands of sources that are not controlled centrally and frequently change” much like “websites being published inside of an organization.”

There are a few underlying capabilities that those of us in the DataOps community should strive to support as we develop a set of reliable processes for operationalizing data at scale and implement next-generation data architectures that are optimized for the “thousands of sources” reality. Here are three such capabilities:

  1. Real-Time Data Flows

Real-time data access, transmission, and analysis allows organizations to achieve faster time-to-value and apply insights as they are available. Timely decision, to the order of seconds or minutes can make or break revenue streams in many industry verticals including retail, finance, and security, among others. The challenge to achieve this goal intensifies exponentially as the architecture ingests billions of data points at scale. Various architectural patterns may be required for data streaming or batch processing use cases. Achieving real-time visibility and control of data flows in action will therefore be of critical importance as we begin to fully practice DataOps.

  1. Analytics-as-a-Service

The idea behind being a data-driven business organization is that insightful information must support key business decisions. This is only possible if insights, in complete and concise form, are available and accessible to the appropriate personnel with minimal efforts and in a timely manner. Modern data architectures are adopted in part to support and streamline these self-service capabilities by incorporating the right set of analytics tools that deliver reports from across the wide pool of data and sources. DataOps processes should impose standards of governance and control without limiting or slowing down user access to data and insights.

  1. Composability to handle variability

Composability is a system design principle that can lead to a truly evolutionary architecture that can support information agility. A composable architecture is essentially implemented through the use of small, modular components that each perform a specific function or service, communicates with other components through a well-defined contract, and can be inter-connected in various configurations. Information agility is just one benefit of utilizing the composability principle for DataOps. Another equally, or perhaps more, important benefit to the DataOps community is that a composable architecture can rapidly operationalizing a “stand-alone” advanced data science model by integrating the model within a full, enterprise-grade data engineering pipeline that encompasses data orchestration, automation and analytics. In this way, the true value of DataOps is achieved: providing maintainable data-driven capabilities for the enterprise that are robust and reliable.

At Composable Analytics, we are building a DataOps Enterprise Platform that we believe provides a single coherent ecosystem for DataOps professionals. The Composable DataOps Enterprise Platform provides a complete portfolio of composable capabilities for data orchestration, automation and analytics, essentially full-stack DataOps-as-a-service. Working closely with a number of forward leaning companies across several industries, we have now had a chance to see first-hand how adoption of a set of DataOps processes and tools can lead to more effective, standardized data production environments, and we believe that the community will embrace DataOps much in the same way they have embraced DevOps.

Andy Vidan

Andy has diverse and extensive experience spanning data sciences, information technology and applied physics, with a passion for developing and scaling disruptive technology platforms. At MIT Lincoln Laboratory, Andy served as a key technical contributor to a broad range of homeland security and defense research programs, and was the architect for the Laboratory’s Distributed Disaster Response program, developing advanced information systems for large-scale crisis response and management. Andy has a PhD from Harvard University and a BS from Cornell University.