Data synchronization is some of the most difficult logic to write. There are many interactions and transformations to express, and many factors to consider. This article discusses synchronization in the context of integrating with third-party back-end software such as ERP and CRM systems. If you find yourself responsible for creating and implementing synchronization strategies, it will save you a lot of time to list and consider each of the issues. I’ve spent the past three years or better thinking about and planning different aspects of synchronization logic, at different stages of product maturity, and conversations occassionally fire up about it, with diagrams being drawn and plans being outlined and reworked. As we implement it, many aspects have emerged and occassionally surprise us.
Existing & Upcoming Technologies
Remote Data Access (RDA) is pretty nice. In simple applications, I have no doubt it serves its purpose well. But when we’re talking about enterprise-scale applications that need complete control over synchronization behavior (such as collision handling), as well as data shaping/transformation, more is needed. Merge synchronization provides an ability to add some custom collision handling, and it requires schema updates and table schema locks. This is unfriendly to back-end systems that we integrate into. We want to have as light a footprint on external systems as possible. What would happen if our customer’s DynamicsGP database upgrade failed because we were locking the tables with synchronization mechanisms? Not a great idea. This is really too bad, because so many of the ugly details are taken care of nicely by SQL Server replication.
Microsoft Synchronization Services looks like a fascinating new unification of synchronization services for the Microsoft platform, but unfortunately it doesn’t look like Windows Mobile will be supporting this for a while. We needed a solution to deliver to customers now, and can’t wait around, and this is how we come to be involved with this.
I also needed to make sure that multiple back-end company databases could be supported, or that it could run viably without any back-end system. How do you easily set up SQL Server merge replication with a growing set of back-end databases? Can databases be updated easily? It’s about more than product functionality. It’s also about the development process and the costs associated with accurrately performing upgrades (without losing data or falling into an unusable state). Handshakes, recovery tactics, and other protocol details can become sophisticated and difficult to troubleshoot. More about this later.
Data Transformations
What data do you need? How can you map from source and destination of data, plus any transformations in between, in a way that’s as transparent and efficient as possible? This will depend on whether you use an ORM library or direct SQL access. Do you store these data replications centrally or in a distributed way? If distributed, do they themselves get synchronized in some default way? How easy is this metadata to maintain and refactor?
Security & Performance
If your data isn’t secure, you don’t have a viable enterprise system. Not only does data need to be encrypted (and those keys managed well), but you need to restrict access to that data based on roles, users, and ultimately permissions. Sarbanes-Oxley has some strict guidelines that you’ll have to play along with for any of your publicly-traded customers.
Another major concern is performance. Because mobile devices may be synchronizing over slow connections (cellular modems and cell phones), which can be 50 times slower than typical high-speed connections, synchronization speed is crucial. You may be pulling tens or hundreds of thousands of rows of data without the right shortcuts in place. These are the typical ones:
- Minimizing scope of data vertically (data filtering) and horizontally (data shaping). Data can be filtered by any other aspect of your data model that makes sense, and because we don’t always need every column on the server, we can store only that subset we absolutely require on the mobile device or other client node.
- Remoting technology choice. Binary serialization is must faster than XML, but requires third-party or custom solutions on Compact Framework.
- Compression of messages between client and server. Even with the additional processing burden to handle the compression, the overall improvement in throughput is significant.
- Multithreading of synchronization tasks. Some tasks can be performed in parallel, while some tasks (and in some contexts, whole synchronizations) can be performed while the user is doing something else (where it’s safe, such as sitting idle in the menu).
Simplifying Data Flow
It’s tempting to let any data object update any other, but in information systems that are critical to the success of a business, every update typically requires some traceability and accountability. Transactions are used to formally define an addition or update. In integration scenarios, you usually hand off a message which has second-order affects on other tables in that system when executed. How this is done is not usually visible to you.
By stereotyping data flows and differentiating a pull of reference data (from an external system to your system or one if its clients) and a push of transaction data (from your system to the external system), the data synchronization challenge can be divided and therefore simplified. Solving separate push and pull strategies is much easier than trying to implement a true generic merge replication. Reference data pulls are defined as data-shaped transformations. Transaction pushes are defined as queued invocations on third-party entry points. Transactions can have an affect on reference data, which will then be pulled down and therefore update your own system. Data flow is circular.
Using explicit transactions provides some other advantages.
- We are able to protect the ERP/CRM from our system. We can’t assume that the back-end system will accept the messages we send it. They might be formatted poorly, or may contain invalid data that we were unable to identify or validate. If our attempt is rejected, we may need to execute some compensating actions.
- We can provide an approval mechanism to support more complex business workflows, and provide checkpoints before the transactions are submitted or posted.
- We create an audit trail of data updates for business analysis and system troubleshooting.
- We can support integrations with different back-end products. By defining a data model that is abstracted away from any specific back-end system’s model, we have the ability to build integration adapters for each back-end system. This allows us to transform our generic transactions into the vendor-specific ERP or CRM transactions. By swapping out the adapters, we can switch our whole integration to another vendor’s system.
Location-Based Behavior
Being occassionally disconnected presents some big challenges. While connected we can immediately submit transactions, which updates reference data that we have access to in real time. When disconnected, however, we may enqueue transactions that should have an impact on reference data. Because these transactions don’t normally affect reference data until they run on the server, data cached on the disconnected device can become stale—and useless after a while, in many cases.
Just as ASP.NET developers learn the page request event lifecycle, it’s easy to see why a transaction (in the journey that it takes) could have different behaviors in different environments and at different stages. On the handheld, queueing a transaction may need to make an adjustment to reference data so that subsequent workflows can display and use up-to-date data. The eventual success of that transaction can’t be guaranteed, but for that unit of work, that daily work session, it’s as close to the truth as we can get.
Detecting the state of transactions of the host system you’re integrating into can be tricky, too. Sometimes when you save or enqueue a transaction in the back-end system, you get immediate changes to its own reference tables, or it may only update reference data only once it’s approved and “posted”. Only experience and experimentation in working with these systems (lots of trial and error) can give you the necessary insight to do it right.
Schema Versioning
You should define your data schema correctly from the beginning, and then never change it. (Yeah, right.) Unfortunately, requirements change, scope expands, products follow new paths, and schemas will have to change, too. This means you’ll need some way of identifying, comparing, and dealing with version conflicts. Do you define a “lock-down” period during which existing transactions can be submitted to the server but new transactions can’t be started? What happens if the server gets updated and then a client with the old schema finally connects? Can you update the schema with the data in place before submitting the transaction, or have you defined a transaction version format converter? Can schema upgrades deal with multiple schema version hops? Can it roll back if necessary? Do you tell your customers “tough luck” in some of those scenarios?
Summary
This discussion was meant as a rough outline, a preliminary introduction to some of the issues, and is not meant as a comprehensive account of data synchronization in distributed architectures. Fortunately, there is a managable core of useful functionality that can be implemented without solving every conceivable problem mentioned above. There are always opportunities for optimizations and tightening down. I’ll be returning to this topic many times as I discover and work with new scenarios and design goals.