application

About Client

Our client is a marketing measurement company which provides a single source of truth for media investment decisions. To make these decisions, the first step is to have the data. This raw data can be fetched from multiple sources through the APIs exposed by the channel. This is where Data Ingestion Framework (DIF) comes into play.

Business Need

  • Extract data from different sources with different data schema for different clients. 
  • Transform – Convert the raw data received from the APIs to meaningful and usable data.
  • Load data into a warehouse for analytics and downstream applications.

Challenges

  • Integrate different nature of API’s – Different APIs sources can be of different nature like a single API call, dependent API calls, paginated API calls, rate limiting APIs etc.
  • Scalability – extensible and scalable architecture to support new client instances with simple configuration changes.
  • Guaranteed data delivery SLA
  • Ingesting & Cleaning high volume of data – Some endpoints result in large amounts of data which can not be stored in memory.

Our Solution

  • Store API credentials– Different types of credentials like access token, refresh token, api key etc are encrypted and then stored in the database. Similarly while making the API call, the encrypted value is decrypted. Encryption, decryption is handled via Amazon KMS service.
  • Ingesting data – Ingestion process was smartly built to handle different types of APIs such as paginated api calls, dependent api calls, rate limited api calls etc. For rate limiting redis distributed locks are used.
  • Handling data – Raw data received can be in any format such as csv, json, xml, zip etc. First step is to transform the response into JSON format, after that, cleaning and transformation of data takes place. In case of large amounts of data which can not be handled inside memory, streaming is implemented to process data in small chunks. 
  • Loading data – Once clean data is available, the same can be converted to csv data and loaded into AWS S3 for easy readability and tracking. Same data is loaded into AWS Redshift. This data can now be queried by downstream applications for analytics purposes.
  • Deployment – Jenkins is used for CI/CD. AWS Lambda and AWS ECS are used as a runtime environment and are scheduled to run multiple times during a day to perform ETL process.

System Diagram

Business Impact

  • Data availability – Data is fetched from multiple sources daily on a timely schedule to have the latest possible data which can be used for analysis.
  • Handling high volume data Ingesting and Processing huge data with ease so that it is not heavy on the system.
  • Secrets Management– Passwords and secured credentials can be easily managed.
  • Resource Management – All the resources are judiciously consumed with the help of Redis Distributed Locks.

Technology

Related Contents

5 minutes read

Data Orchestration Ecosystem

Facilitating the seamless processing of vast amounts of data from multiple sources for analytical purposes.
Implemented a comprehensive data pipeline that involves the collection, transformation, validation, and enrichment of raw data into a human-readable format.

Know more
5 minutes read

Auth0 IAM implementation for Zero
trust Networking platform

A leading provider of media optimization solutions recognized the need for a robust Identity and Access Management (IAM) solution to strengthen their security framework and streamline user access to their platform. 

Know more
5 minutes read

Okta to Azure AD Migration

The client, a multinational organization, had been using Okta as their primary identity and access management (IAM) solution for several years. However, due to a strategic organizational decision, they decided to migrate their IAM infrastructure from Okta to Azure Active Directory (Azure AD).

Know more