Data Sync System Design

Customer Data or configuration sync of omnichannel B2C offerings is one of very common use cases that we all have addressed as prime system integrator. We would like to navigate through a sample system design which would address the said system integration use case and cost effective of hosting the solution in cloud hyperscaler like AWS or Azure.

System Design Approach & Considerations

One way of designing the sync solution is, Change Data Capture Event driven approach of replicating data form source to target which is directly proportional to the number of changes being made in source.

  • The proposed solution would be highly available to propagate change in near realtime and elastic enough to scale as per variable load.
  • The proposed solution would abstract the ETL layer and transformation overhead that include schedule job run and monitoring.
  • Messaging system runs on events of changes in data and can retain events in case of non-availability of target data.
  • Persistent and durable but historical data to be truncated.
  • Message size is in KB
  • High throughput.

High Level System Design

Data Change Capture(CDC) Module

This module identifies and track changes to specific set of preconfigured models which to be synced or any selective part(subset/attributes) of model data to be synced within deferent internal or external systems. CDC provides real-time or near-real-time movement of data by moving and processing data continuously as new data change events occur. This model consists of various types of connectors to get the change data.

CDC Push Connector

In this, the source system does the heavy lifting. It implements logic and processes to capture changes, then sends those updates to Data sync solution through a data change post connector then those records are being pushed to broker for further processing.

CDC Pull Connector

In this, the source system’s task is less than that of the push connector since it only logs the data changes. It is the target systems’ responsibility to continuously poll to retrieve the changes and then pushed to data sync solution broker for further processing.

There are various patterns to determine the change like

  • Timestamp-based: Each model data will capture the last updated time then pull connector will read the newer data based on timestamp.
  • Log-based: Pull connector read the transaction log then determine the model data change.
  • Trigger-based: In this, shadow records are being created against each data change then pull connector reads those records. This is one of the best suited methods for RDBMS with trigger support.

Data Change Event Transformer

In this, each data change record is being examined, enriched if required, masked, or encrypt critical attributes as per data governance guideline then transformed into common data exchange model. If certain types of records require deep processing like heavy enrichment by involving any other system(S) then those records are being pushed to separate queue for further processing. Ticket or alert to be created against each record which encounters error in processing. Successfully processed records are being posted to topics with proper subjects so that various data change consumers can listen. This topic support filters so that consumers can filter and consume only required types of data change events.

Data Change Event Consumer

Each consume subscribe to topic and uses filter to consume only required types of changes. Each record pushed to each type of data change event queue. Then, each type of events the processed and pushed to target system’s specific endpoint.

Cost Effect Data Sync Design using AWS

We could realize the above data sync solution cost effectively using hyperscaler like AWS, Azure or GCP which gives us advantages such as low-code approach, less time to market, automated deployments, scalability, security, recovery and reliability.

Above Data Sync Solution could be implemented in AWS as below

Conclusion

Above Data Sync Solution could be extended to handle large scale data integration. We might need to introduce new modules to fulfil specific needs for certain industry.  

Leave a Reply