About Client
SparkCognition builds artificial intelligence (AI) solutions for applications in energy, oil and gas, manufacturing, finance, aerospace, defense, and security sectors. Their primary requirement was to build efficient configurable/Low Code data pipelines that could seamlessly integrate various data sources.
Business Need
Challenges
- Organizations that deal with large amounts of data face a major challenge in handling various types of data, such as structured, unstructured, CSV, ZIP files, audio, and video. Developing efficient data pipelines that can seamlessly integrate various data sources and adapt to the ever-changing nature of business requirements is a complex problem to solve.
- To overcome this challenge, we need to create a comprehensive framework that is both fully configurable and customizable, enabling it to handle various data inputs from different sources.
- The framework should process the data, extract meaningful insights, and ingest the results for further analysis.
Raw data is often fragmented, dirty, and lacks context. This makes it difficult to generate insights from the data.
- Fragmented: The data comes from multiple sources, which means that it is stored in different formats and locations. This makes it difficult to combine the data and analyze it together.
- Dirty: The data contains errors and inconsistencies. This can be due to human errors, data entry errors, or problems with the data collection process.
- Lack of context: The data does not contain enough information about the business processes that it relates to. This makes it difficult to understand the meaning of the data and to generate insights from it.
Our Solution:
Utilizing the Data Orchestration Ecosystem (DOE), we addressed this problem. The DOE, built upon Prefect, enables seamless data ingestion, transformation, and enrichment through a user-friendly interface requiring minimal coding. Its versatility and scalability make it adaptable to diverse requirements, allowing for effortless customization.
Features:
- Ingesting data from multiple sources: The DOE can ingest data from a variety of sources, including databases, files, and APIs. This allows companies to combine data from different sources and analyze it together.
- Cleaning the data: The DOE can clean the data to remove errors and inconsistencies. This can be done using a variety of techniques, such as data validation, data scrubbing, and data normalization.
- Enriching the data with additional context: The DOE can enrich the data with additional context by adding information about the business processes that it relates to. This can be done by using data dictionaries, business rules, and domain knowledge.
- Automating the data processing tasks: The DOE can automate the data processing tasks, such as data ingestion, data cleaning, and data enrichment. This can save companies' time and resources.
- Manage Machine Learning Models: The DOE can also be used to train and run ML models. ML models can be used to analyze data and identify patterns that would not be visible to humans.
- Storage to Support Visualization: The DOE can store data in Neo4j, a graph database. Graph databases are well-suited for storing and analyzing data that has relationships between different entities.
Data Orchestration Ecosystem (DOE)
Architecture Overview
Architecture of Data Orchestration Ecosystem (DOE)
Technical Details
The architecture leverages a combination of cloud-native services and open-source technologies to create a robust and scalable data orchestration system. By integrating Azure Triggers, Function Apps, Prefect workflows, Kubernetes for container orchestration, and specialized databases, the system ensures seamless data processing, machine learning model execution, and visualization. Below are the details of the major components of the DOE.
Azure Triggers and Function Apps:
- Azure Triggers automate processes, execute tasks on a schedule, or respond to events in real-time.
- Trigger fires an event when new files are dropped or added into Azure blob storage.
- Azure Function Apps, a serverless compute service, execute code in response to events without managing the underlying infrastructure.
- Function Apps call the Prefect API with necessary information about configuration and input data.
Prefect Framework:
- Prefect enables defining, scheduling, and executing complex data workflows in the cloud.
- Workflow coordination includes automatic retries, caching, reusable configuration, and a collaborative UI.
- Implemented in Python, Prefect facilitates defining workflows using Python code.
- Tools for managing flows include versioning, serialization, and scheduling, ensuring reproducibility and consistency.
- Built-in monitoring and logging capabilities track workflow progress and performance.
- Supports distributed execution on multiple machines or in a cloud-based environment.
Integration with Kubernetes:
- Kubernetes enhances the scalability and flexibility of running data workflows in a containerized environment.
- Kubernetes automates the deployment, scaling, and management of containerized applications.
Web Application:
- The web app's front end is developed using REACT, while the REST Server is developed with the Fast API framework.
- Through the web app, users can easily create and configure their data pipeline using an intuitive interface. It empowers data scientists and ML engineers to update configuration through simple clicks and keyboard inputs.
- Furthermore, it allowed users to visualize the data injected post-execution of the pipeline. Also allowed filtering and searching for easy exploration and analysis.
Orchestration Process
- Data scientists utilize the Web App to create and configure the pipeline. They input details such as the file's source, the destination for processed data, and choose the cleaning operations and processing steps.
- Upon the file's arrival at the source, an Azure function set to detect incoming files triggers the pipeline.
- Leveraging the Prefect framework, the pipeline configuration is loaded, and a container is initiated within the Kubernetes cluster to execute the pipeline.
- Throughout pipeline execution, various tasks such as data cleaning, transformation, and ML predictions, as specified in the configuration, are carried out.
- Finally, the processed data is deposited into the sink, typically a graph or document database.
Business Impact
- Reduce the time to extract different insights of data by just updating the pipeline configuration.
- Improved data quality. The DOE can help to improve data quality by cleaning the data and removing errors and inconsistencies. This can lead to more accurate insights and better decision-making.
- Increased efficiency. The DOE can help to increase efficiency by automating data processing tasks. This can free up time for data scientists and ML engineers to focus on other tasks, such as developing new models and insights.
- Reduced costs. The DOE can help to reduce costs by automating data processing tasks and by improving data quality. This can lead to lower costs for data storage, data processing, and data analysis.
- Sophisticated population of extracted data in a highly enriched user interface which is very simple and easy to understand.
Technology
Tech Prescient was very easy to work with and was always proactive in their response.
The team was technically capable, well-rounded, nimble, and agile. They had a very positive attitude to deliver and could interpret, adopt and implement the required changes quickly.
Amit and his team at Tech Prescient have been a fantastic partner to Measured.
We have been working with Tech Prescient for over three years now and they have aligned to our in-house India development efforts in a complementary way to accelerate our product road map. Amit and his team are a valuable partner to Measured and we are lucky to have them alongside us.
We were lucky to have Amit and his team at Tech Prescient build CeeTOC platform from grounds-up.
Having worked with several other services companies in the past, the difference was stark and evident. The team was able to meaningfully collaborate with us during all the phases and deliver a flawless platform which we could confidently take to our customers.
We have been extremely fortunate to work closely with Amit and his team at Tech Prescient.
The team will do whatever it takes to get the job done and still deliver a solid product with utmost attention to details. The team’s technical competence on the technology stack and the ability to execute are truly commendable.