Data Orchestration Ecosystem

About Client

The client builds artificial intelligence (AI) solutions for applications in energy, oil and gas, manufacturing, finance, aerospace, defence, and security sectors. Their primary requirement was to build efficient data pipelines that could seamlessly integrate various data sources and adapt to the ever-changing nature of business requirements.

Business Need

Challenges

Organizations that deal with large amounts of data face a major challenge in handling various types of data, such as structured, unstructured, CSV, ZIP files, audio, and video. Developing efficient data pipelines that can seamlessly integrate various data sources and adapt to the ever-changing nature of business requirements is a complex problem to solve.
To overcome this challenge, we need to create a comprehensive framework that is both fully configurable and customizable, enabling it to handle various data inputs from different sources.
The framework should process the data, extract meaningful insights, and ingest the results for further analysis

Raw data is often fragmented, dirty, and lacks context. This makes it difficult to generate insights from the data.

Fragmented: The data comes from multiple sources, which means that it is stored in different formats and locations. This makes it difficult to combine the data and analyze it together.
Dirty: The data contains errors and inconsistencies. This can be due to human errors, data entry errors, or problems with the data collection process.
Lack of context: The data does not contain enough information about the business processes that it relates to. This makes it difficult to understand the meaning of the data and to generate insights from it.

Technical Details

We used the Data Orchestration Ecosystem (DOE) framework to solve this problem. The DOE is a framework that sits on top of Prefect and allows you to ingest, transform, and enrich data in a low/no-code way. It is designed to be flexible and scalable, so you can easily customize it to meet your specific needs.

Ingesting data from multiple sources: The DOE framework can ingest data from a variety of sources, including databases, files, and APIs. This allows companies to combine data from different sources and analyze it together.
Cleaning the data: The DOE framework can clean the data to remove errors and inconsistencies. This can be done using a variety of techniques, such as data validation, data scrubbing, and data normalization.
Enriching the data with additional context: The DOE framework can enrich the data with additional context by adding information about the business processes that it relates to. This can be done by using data dictionaries, business rules, and domain knowledge.
Automating the data processing tasks: The DOE framework can automate the data processing tasks, such as data ingestion, data cleaning, and data enrichment. This can save companies time and resources.
The DOE framework can also be used to train and run ML models. ML models can be used to analyze data and identify patterns that would not be visible to humans. This can help companies to make better decisions and to improve their operations.
The DOE framework can store data in Neo4j, a graph database. Graph databases are well-suited for storing and analyzing data that has relationships between different entities. This makes them a good choice for storing data, which often has complex relationships. And then this data can be used to visualize data with graphs and charts. This can help companies to understand the data and to identify trends and patterns. Graphs and charts can also be used to communicate the results of data analysis to other stakeholders.

Architecture

Azure Triggers are used to automate processes, execute tasks on a schedule, or respond to events in real-time. So whenever a new file is dropped or added into Azure blob storage, the trigger promptly fires an event, prompting the execution of Azure Function Apps.

Azure Function Apps are a serverless compute service provided by Microsoft Azure. They allow you to execute code in response to events without the need to manage the underlying infrastructure. Function Apps are ideal for building event-driven applications and microservices that scale automatically based on demand. This function app call the prefect API with necessary information about where to find the configuration, the input data etc.

Prefect is a workflow automation tool that enables data engineers and data scientists to define, schedule, and execute complex data workflows in the cloud. It lets you coordinate your workflows - running them on a schedule with automatic retries, caching, reusable configuration, a collaborative UI, and more.

Prefect is implemented in Python and allows users to define workflows using Python code, making it easier for developers. Prefect offers tools for managing flows, including versioning, serialization, and scheduling. This allows for reproducibility and consistency in workflow execution. Prefect provides built-in monitoring and logging capabilities, allowing users to track the progress and performance of workflows.

It also supports distributed execution, enabling workflows to be executed on multiple machines or in a cloud-based environment.

This is where Kubernetes comes into the picture. Kubernetes is used in conjunction with Prefect to enhance the scalability and flexibility of running data workflows in a containerized environment. Kubernetes is a powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications.

When Prefect starts its execution, it loads the configuration and input dataset from the location provided by the trigger function. Then Prefect initiates data orchestration, we begin by cleaning the data to ensure compatibility with the ML models. Once the data is cleaned, it is then fed into pre-trained ML models that are hosted as a separate web service. These ML models have been prepared using the DOE framework as well. Upon successful execution of the ML models, the resulting data is stored in a graph database (Neo4j, in this case) and a Solr database (depending on the use case), following a predefined structure.

The data extracted from various sources is currently stored in graph and Solr databases, making it challenging for humans to comprehend directly. To address this issue, we have developed a web application that transforms this data into a human-readable format. The application fetches the data from the databases where it was stored and presents it in an intuitive graphical view. Through the interactive user interface, users can also take advantage of various functionalities, such as filtering and searching, to easily explore and analyze the information.

Business Impact

Reduce the time to extract different insights of data by just updating the pipeline configuration
Improved data quality. The DOE framework can help to improve data quality by cleaning the data and removing errors and inconsistencies. This can lead to more accurate insights and better decision-making.
Increased efficiency. The DOE framework can help to increase efficiency by automating data processing tasks. This can free up time for data scientists and ML engineers to focus on other tasks, such as developing new models and insights.
Reduced costs. The DOE framework can help to reduce costs by automating data processing tasks and by improving data quality. This can lead to lower costs for data storage, data processing, and data analysis.
Sophisticated population of extracted data in a highly enriched user interface which is very simple and easy to understand.