Saturday, 27 June 2020

Data Lake : Swamp and DataOps

 Data Lake is getting popular in consumer and enterprise data strategy as it proposes a wide variety of ingestion, conformance, analytical, and visualization offerings. 

As the interest and adoption of Data Lake grow across multiple sectors, best practices, potential pitfalls, and operationalization techniques are blended into solutions and products, More than often best practices are followed at their best during the initial days and lose focus after the post-implementation phases.


Although a decade-old term now, "Data Lake" can be quick-referenced with the ecology of a freshwater lake, where water (data) is often collected from small streams (e.g. batch logs, weblogs) to large rivers (e.g. unstructured data, images, videos). 

Like the Littoral zone in any freshwater lake, a data lake has staging zone(s) where specific types of data are analyzed, filtered, and consumed. The photic zone is within eyesight and can host batch, ETL/ELT processes within the data lake infrastructure. Aphotic zone stores archived content and have a possible used case for large-scale data mining.


Unlike Data Mart or Database, the storage of Data Lake is usually a network file-store with compute capability for transformation and visualization of data. Data Lake with mature infrastructure can have multiple ranges of storage options from NAND/SSD (fast IO use cases) to Tapes / Vaults (Archives), and low intensity compute (IOT) to large-scale GPU farm for Streaming Analytics (multiple stream - CV use cases)


While there is general awareness of data lake implementation and operational procedures, often data office and CDO come across challanges after successful implementation of data transformation programs.


Data Swamp:
Image: https://www.dreamstime.com
Data Lake architecture is one of the strongest solutions to offer accessibility and democratization of data, biggest hurdle to this vision is 'Data Swamp'.

A data lake becomes a data swamp when data is accumulated and stored without categorization and there is no process to identify and clean the congestion within the lake. Data Swamps eliminate the democratic and accessibility aspects of CDO vision and this is the biggest challenge for data office and technology teams.

Often data swamps are results of in-adequate co-ordination beteeen data governance and technology implementation teams. As observed, many organization do not have clear roadmap or even a vision of how data will be consumed internally and externally.  


DataOps:

Effective utilization of data with appropriate controls needs organization vision and roadmap. Technology and Operations have TechOps ways of working, while Data Group and Operations define Data Governance.

"A data platform built on agreed set of principal and roadmap that helps transform information to actionable insights for organization."

 
Essential components of DataOps Strategy and contributing team can be as follows.

ComponentsTeam
Source ManagementTech
Infrastructure as CodeTech
Access, Monitoring and ControlData & Tech
Continuous Integration and DeliveryTech
Machine Learning and AI DevelopmentData & Tech
MLOps and Deployment StrategyData & Tech
Data Quality and Validation FrameworkData & Ops
Workflow ManagementData & Ops
Data ModelingData & Ops
Business ContinuityOps & Tech
 

Mature cloud implementation often combines DataOps as TechOps+MLOps with view on data availability and actionable insights. a reference implementation of DataOps Platform (AWS) can be illustrated as below.



Advancement in ML / AI implementation, backed by cheap storage and packaged products is pushing more organization to move from operation driven to more data driven business strategies, in this roadmap - DataOps is certainly one of the most significant milestone to implement and practice.