Our client is a marketing measurement company which provides a single source of truth for media investment decisions. To make these decisions, the first step is to have the data. This raw data can be fetched from multiple sources through the APIs exposed by the channel. This is where Data Ingestion Framework (DIF) comes into play.
- Extract data from different sources with different data schema for different clients.
- Transform – Convert the raw data received from the APIs to meaningful and usable data.
- Load data into a warehouse for analytics and downstream applications.
- Integrate different nature of API’s – Different APIs sources can be of different nature like a single API call, dependent API calls, paginated API calls, rate limiting APIs etc.
- Scalability – extensible and scalable architecture to support new client instances with simple configuration changes.
- Guaranteed data delivery SLA
- Ingesting & Cleaning high volume of data – Some endpoints result in large amounts of data which can not be stored in memory.
- Store API credentials– Different types of credentials like access token, refresh token, api key etc are encrypted and then stored in the database. Similarly while making the API call, the encrypted value is decrypted. Encryption, decryption is handled via Amazon KMS service.
- Ingesting data – Ingestion process was smartly built to handle different types of APIs such as paginated api calls, dependent api calls, rate limited api calls etc. For rate limiting redis distributed locks are used.
- Handling data – Raw data received can be in any format such as csv, json, xml, zip etc. First step is to transform the response into JSON format, after that, cleaning and transformation of data takes place. In case of large amounts of data which can not be handled inside memory, streaming is implemented to process data in small chunks.
- Loading data – Once clean data is available, the same can be converted to csv data and loaded into AWS S3 for easy readability and tracking. Same data is loaded into AWS Redshift. This data can now be queried by downstream applications for analytics purposes.
- Deployment – Jenkins is used for CI/CD. AWS Lambda and AWS ECS are used as a runtime environment and are scheduled to run multiple times during a day to perform ETL process.
- Data availability – Data is fetched from multiple sources daily on a timely schedule to have the latest possible data which can be used for analysis.
- Handling high volume data – Ingesting and Processing huge data with ease so that it is not heavy on the system.
- Secrets Management– Passwords and secured credentials can be easily managed.
- Resource Management – All the resources are judiciously consumed with the help of Redis Distributed Locks.