About Client
Business Need
Challenges
- Empowering Data Scientists: Providing tools and platforms to productize data science work, enabling seamless integration into pipelines.
- Accelerated Development Cycles: Offering configurable workflows and low-code environments for rapid development and deployment.
- Business Ability to Tweak Pipelines: Allowing businesses to customize pipelines, facilitating rapid adaptation to market changes.
- Real-time Data Processing: Supporting real-time or near-real-time processing for timely decision-making.
- Customizable Workflow: Providing configurable workflows to adapt to diverse processing requirements.
- Cost Efficiency and Scalability: Optimizing costs and ensuring scalability to handle growing data volumes efficiently.
Raw data is often fragmented, dirty, and lacks context. This makes it difficult to generate insights from the data.
- Data Integration from Multiple Sources: Integrating data from diverse sources stored in varying formats and locations.
- Data Quality Assurance and Cleaning: Ensuring data integrity by addressing errors, inconsistencies, and missing values in raw data.
- Scalable and Efficient Data Processing: Handling large volumes of data efficiently using parallelization and distributed computing.
- Contextual Enrichment and Semantic Understanding: Adding context and semantics to raw data for meaningful insights.
- Real-time Data Processing and Analysis: Processing data streams in real-time for timely decision-making
Our Solution
We addressed this problem by building a Data Orchestration Ecosystem (DOE). The DOE, built upon Prefect, enables seamless data ingestion, transformation, and enrichment through a user-friendly interface requiring minimal coding. Its versatility and scalability make it adaptable to diverse requirements, allowing for effortless customization.
Features
- Ingesting data from multiple sources: The DOE can ingest data from a variety of sources, including databases, files, and APIs. This allows companies to combine data from different sources and analyze it together.
- Cleaning the data: The DOE can clean the data to remove errors and inconsistencies. This can be done using a variety of techniques, such as data validation, data scrubbing, and data normalization.
- Enriching the data with additional context: The DOE can enrich the data with additional context by adding information about the business processes that it relates to. This can be done by using data dictionaries, business rules, and domain knowledge.
- Automating the data processing tasks: The DOE can automate the data processing tasks, such as data ingestion, data cleaning, and data enrichment. This can save companies' time and resources.
- Manage Machine Learning Models: The DOE can also be used to train and run ML models. ML models can be used to analyze data and identify patterns that would not be visible to humans.
- Storage to Support Visualization: The DOE can store data in Neo4j, a graph database. Graph databases are well-suited for storing and analyzing data that has relationships between different entities.
Data Orchestration Ecosystem (DOE)
Architecture Overview
Architecture of Data Orchestration Ecosystem (DOE)
Technical Details
The architecture leverages a combination of cloud-native services and open-source technologies to create a robust and scalable data orchestration system. By integrating Azure Triggers, Function Apps, Prefect workflows, Kubernetes for container orchestration, and specialized databases, the system ensures seamless data processing, machine learning model execution, and visualization. Below are the details of the major components of the DOE.
Azure Triggers and Function Apps
- Automate processes, execute tasks on a schedule, or respond to events in real-time, including triggering pipeline execution upon new file arrivals in Azure blob storage.
- Utilize Azure Function Apps, a serverless compute service, to execute code in response to events without managing the underlying infrastructure, including calling the Prefect API with configuration and input data.
- Enables defining, scheduling, and executing complex workflows with features like automatic retries, monitoring, and distributed execution.
- Implemented in Python for easy workflow definition and management, including versioning, serialization, and scheduling for reproducibility.
- Built-in monitoring and logging capabilities track workflow progress and performance, supporting distributed execution on multiple machines or in a cloud-based environment.
Integration with Kubernetes
- Kubernetes enhances the scalability and flexibility of running data workflows in a containerized environment.
- Kubernetes automates the deployment, scaling, and management of containerized applications.
Web Application
- The web app's front end is developed using REACT, while the REST Server is developed with the Fast API framework.
- Through the web app, users can easily create and configure their data pipeline using an intuitive interface. It empowers data scientists and ML engineers to update configuration through simple clicks and keyboard inputs.
- Furthermore, it allowed users to visualize the data injected post-execution of the pipeline. Also allowed filtering and searching for easy exploration and analysis.
Orchestration Process
- Data scientists utilize the Web App to create and configure the pipeline. They input details such as the file's source, the destination for processed data, and choose the cleaning operations and processing steps.
- Upon the file's arrival at the source, an Azure function set to detect incoming files triggers the pipeline.
- Leveraging the Prefect framework, the pipeline configuration is loaded, and a container is initiated within the Kubernetes cluster to execute the pipeline.
- Throughout pipeline execution, various tasks such as data cleaning, transformation, and ML predictions, as specified in the configuration, are carried out.
- Finally, the processed data is deposited into the sink, typically a graph or document database.
Business Impact
- Efficient Insights Extraction: Significantly reduces the time needed to extract insights from data by simply updating the pipeline configuration, improving the speed of decision-making.
- Improved Data Quality: Enhances data quality by effectively cleaning and correcting errors, leading to more reliable insights and better decisions.
- Enhanced Operational Efficiency: By automating data processing tasks, it boosts operational efficiency, allowing data experts to focus on innovation and development.
- Cost Savings: Contributes to cost savings by automating tasks and improving data quality, resulting in reduced expenses for data management.
- User-Friendly Interface: Offers a simple and intuitive interface for data exploration, making it easier for users to understand and utilize extracted data for decision-making.
Technology
Tech Prescient was very easy to work with and was always proactive in their response.
The team was technically capable, well-rounded, nimble, and agile. They had a very positive attitude to deliver and could interpret, adopt and implement the required changes quickly.
Amit and his team at Tech Prescient have been a fantastic partner to Measured.
We have been working with Tech Prescient for over three years now and they have aligned to our in-house India development efforts in a complementary way to accelerate our product road map. Amit and his team are a valuable partner to Measured and we are lucky to have them alongside us.
We were lucky to have Amit and his team at Tech Prescient build CeeTOC platform from grounds-up.
Having worked with several other services companies in the past, the difference was stark and evident. The team was able to meaningfully collaborate with us during all the phases and deliver a flawless platform which we could confidently take to our customers.
We have been extremely fortunate to work closely with Amit and his team at Tech Prescient.
The team will do whatever it takes to get the job done and still deliver a solid product with utmost attention to details. The team’s technical competence on the technology stack and the ability to execute are truly commendable.