Data Ops Tech: June 2020

Data Lake is getting popular in consumer and enterprise data strategy as it proposes a wide variety of ingestion, conformance, analytical, and visualization offerings.

As the interest and adoption of Data Lake grow across multiple sectors, best practices, potential pitfalls, and operationalization techniques are blended into solutions and products, More than often best practices are followed at their best during the initial days and lose focus after the post-implementation phases.

Although a decade-old term now, "Data Lake" can be quick-referenced with the ecology of a freshwater lake, where water (data) is often collected from small streams (e.g. batch logs, weblogs) to large rivers (e.g. unstructured data, images, videos).

Like the Littoral zone in any freshwater lake, a data lake has staging zone(s) where specific types of data are analyzed, filtered, and consumed. The photic zone is within eyesight and can host batch, ETL/ELT processes within the data lake infrastructure. Aphotic zone stores archived content and have a possible used case for large-scale data mining.

Unlike Data Mart or Database, the storage of Data Lake is usually a network file-store with compute capability for transformation and visualization of data. Data Lake with mature infrastructure can have multiple ranges of storage options from NAND/SSD (fast IO use cases) to Tapes / Vaults (Archives), and low intensity compute (IOT) to large-scale GPU farm for Streaming Analytics (multiple stream - CV use cases)

While there is general awareness of data lake implementation and operational procedures, often data office and CDO come across challanges after successful implementation of data transformation programs.

Data Swamp:

Data Lake architecture is one of the strongest solutions to offer accessibility and democratization of data, biggest hurdle to this vision is 'Data Swamp'.

A data lake becomes a data swamp when data is accumulated and stored without categorization and there is no process to identify and clean the congestion within the lake. Data Swamps eliminate the democratic and accessibility aspects of CDO vision and this is the biggest challenge for data office and technology teams.

Often data swamps are results of in-adequate co-ordination beteeen data governance and technology implementation teams. As observed, many organization do not have clear roadmap or even a vision of how data will be consumed internally and externally.

DataOps:

Effective utilization of data with appropriate controls needs organization vision and roadmap. Technology and Operations have TechOps ways of working, while Data Group and Operations define Data Governance.

"A data platform built on agreed set of principal and roadmap that helps transform information to actionable insights for organization."

Essential components of DataOps Strategy and contributing team can be as follows.

Components	Team
Source Management	Tech
Infrastructure as Code	Tech
Access, Monitoring and Control	Data & Tech
Continuous Integration and Delivery	Tech
Machine Learning and AI Development	Data & Tech
MLOps and Deployment Strategy	Data & Tech
Data Quality and Validation Framework	Data & Ops
Workflow Management	Data & Ops
Data Modeling	Data & Ops
Business Continuity	Ops & Tech

Mature cloud implementation often combines DataOps as TechOps+MLOps with view on data availability and actionable insights. a reference implementation of DataOps Platform (AWS) can be illustrated as below.

Advancement in ML / AI implementation, backed by cheap storage and packaged products is pushing more organization to move from operation driven to more data driven business strategies, in this roadmap - DataOps is certainly one of the most significant milestone to implement and practice.

Data Ops Tech

Saturday, 27 June 2020

Data Lake : Swamp and DataOps