application

About Client

Our customer is a marketing measurement company that provides a single source of truth for media investment decisions. Central to this mission is the collection of raw data from multiple marketing data sources through various methods such as APIs, emails, FTP, and more. The Data Ingestion Framework (DIF) facilitates this process by extracting, transforming, and loading data into a data warehouse for comprehensive analytics.

Business Need

Challenges

Our Solution

We’ve crafted a comprehensive data ingestion solution featuring 150+ integrations and low-code workflows for seamless API management. With intelligent rate throttling, sophisticated data preprocessing, and rapid loading mechanisms, our architecture ensures resilience and scalability. Our secure Clean Room Deployment option exemplifies its proficiency in ensuring efficient and secure data handling while prioritizing utmost data privacy.

Extensive Integration Support: Support for 150+ integrations spanning search, sales, analytics endpoints.

Low Code Workflows: Repeatedly writing code to call APIs can be cumbersome. We built a low-code platform which had ready-made reusable components which could encompass a variety of API strategies like:

  • Single API Calls: Effortlessly retrieve relevant data with a single API call.
  • Paginated APIs: Streamline data retrieval by efficiently handling paginated responses, ensuring all relevant data is captured.
  • Dependent APIs: Manage complex data dependencies by orchestrating sequential API calls to gather comprehensive insights.
  • Asynchronous Data Fetching Seamlessly manage APIs that return data asynchronously by tracking job statuses and fetching data upon completion.

System Diagram

Architecture Overview

The architecture utilizes a blend of AWS services and open-source technologies to establish a resilient and scalable Data Ingestion Framework. Through integration with Redis, MySQL and AWS services like S3, Redshift, ECS, Cloudwatch the system guarantees smooth data ingestion. Below, we outline the key components of the framework:

Low-Code Step Executor

  • Interpret and execute a JSON-driven configuration outlining the data flow across multiple steps.
  • Employ MySQL to store Customer specific information which serve as inputs to the workflow.
  • Enable the execution of straightforward JavaScript expressions based on the response from each step.
  • Capable of conducting paginated API calls and handling dependent API calls.
  • Configurable parameters for request throttling.

Credential Manager

  • Leverage AWS KMS service to encrypt secrets like passwords, API keys, refresh tokens and store it in MySQL.

Resource Manager

  • Implemented a distributed counting semaphore using Redis for job management, ensuring a limited number of jobs per API.
  • Utilized account-based locks for vendors with shared accounts and customer-based locks for vendors with individual customer accounts.
  • Each job acquired a lock upon execution, preventing additional jobs from starting and conserving resources.

Data Preprocessor

  • Utilized a configuration-driven data preprocessor for transforming data and files into standard CSV format on S3.
  • Configurations stored in MySQL specified data column expectations and extraction methods for the Data Preprocessor input.
  • Employed Node.js streams to efficiently manage large datasets without memory accumulation.

Data Loader

  • Harnesses the power of Redshift’s COPY command to swiftly load large CSV files, ensuring optimal performance.
  • Implements a deduplication mechanism to eliminate duplicate records from the final dataset, enhancing data integrity.
  • Prevents potential deadlocks by employing a Redis-based semaphore approach, ensuring exclusive access to tables during data loading.

Clean Room

  • A secure deployment environment within the customer’s cloud infrastructure to safeguard sensitive data. 
  • Data is initially ingested into the customer’s cloud, undergoes encryption of personally identifiable information (PII), and is subsequently transferred to our cloud environment for further processing.

Business Impact

Technology

Related Contents

5 minutes read

Data Orchestration Ecosystem

Facilitating the seamless processing of vast amounts of data from multiple sources for analytical purposes.
Implemented a comprehensive data pipeline that involves the collection, transformation, validation, and enrichment of raw data into a human-readable format.

Know more
5 minutes read

Auth0 IAM implementation for Zero
trust Networking platform

A leading provider of media optimization solutions recognized the need for a robust Identity and Access Management (IAM) solution to strengthen their security framework and streamline user access to their platform. 

Know more
5 minutes read

Okta to Azure AD Migration

The client, a multinational organization, had been using Okta as their primary identity and access management (IAM) solution for several years. However, due to a strategic organizational decision, they decided to migrate their IAM infrastructure from Okta to Azure Active Directory (Azure AD).

Know more