When working with multiple data centers it is often important to make sure that if one data center goes down, another data center is fully capable of picking its load and data. Data center replication is meant to solve exactly this problem. When data center replication is turned on, GridGain in-memory database will automatically make sure that each data center is consistently backing up its data to other data centers (there can be one or more). GridGain supports both active-active and active-passive modes for replication.
Data Center Replication (or simply DR) is a GridGain feature allowing data transfer between caches in distinct topologies, possibly located in different geographic locations. Whenever a cache update occurs in one GridGain topology, it can be transparently transferred to another topologies.
IgniteCache.clear operation does not affect remote data centers when using Data Center Replication.
Each distinct GridGain topology that is capable of sending or receiving cache updates is called a Data Center. Every data center has a unique ID between 1 and 31. This ID is assigned to all GridGain nodes in the topology.
The picture below illustrates a simple view of Data Center Replication.
Data Center Replication(DR) in GridGain can be managed using
GridDr interface which can be obtained using
GridDr dr = grid.dr();
There are four roles in a DR process: sender cache, sender hub, receiver hub and receiver cache.
Sender cache- is a cache in a particular data node whose contents should be replicated. All updates to this cache are transferred to a sender hub.
Sender hub- is a node which receives updates from a sender cache and sends them to receiver hubs in remote cluster.
Receiver hub- is a node which receives updates from a remote sender hub and routes them to a receiver cache.
Receiver cache- is a cache in a particular data node that receives updates from a receiver hub and applies them.
Sender cache and sender hubs are located in the same cluster. Receiver cache and receiver hubs are located in the same cluster.
There are no restrictions as to which role can be assigned to a node. For example, a node can act as both sender and receiver hub, have a cache which is both sender and receiver, etc.
At the conceptual level, replication is based on two requirements:
- Before replication is started, the data must be synchronized between the data centers.
- When replication is started, every change of data in the primary data center is replicated in the remote data center(s).
The first requirement is achieved by the process called state transfer. State transfer is a process of sending the entire cache content to the remote data center(s). State transfer must be done manually by either calling
GridDr.stateTransfer(String cacheName, byte... dataCenterIds) for the desired caches or via the Visor GUI (go to the Data Replication - Sender tab and click Bootstrap).
State transfer is required when:
- You enable replication for the first time and need to synchronize data between data centers (if caches already contain data that wasn't synchronized).
- You manually paused and then resumed replication (if the caches were updated during the pause).
- Replication was paused due to a replication failure (for example, sender cache was unable to send data to the remote data centers).
The second requirement is implemented through the interaction between sender hubs and receiver hubs. Each cache whose content needs to be replicated (sender cache) sends updates in batches to sender hubs. Sender hubs transfer data to receiver hubs (nodes in the remote data center(s)). Receiver hubs put data into the receiver caches (replicated sender cache).
GridGain version 8.4.9 introduces a new approach to configuring and performing cache replication.
In the old approach, cache is replicated through a number of nodes (called sender hubs). Each sender hub is configured to work with a specific set of caches, and its configuration cannot be changed dynamically. This means that dynamically created caches cannot be replicated (because the sender hubs are not aware of them).
In the new approach, cache’s data is sent to the remote data center(s) through a predefined group of nodes (sender groups). The cache only 'knows' the name of the group, and the cluster decides which nodes from the group will be used to replicate cache’s data. With this approach, if you create a cache dynamically, you can specify which group will be used to replicate that cache. The cluster will find the nodes from that group and will use them to transfer cache’s data to the remote data center(s) (with no need to reconfigure the sender hubs).
In GridGain version 8.4.9 and later, the new approach is used by default. If you do not specify the group name in your cache configuration, the cache will be replicated through the default sender group, which includes all nodes configured as sender hubs. If you want to use the old approach, set
If you start GridGain version >= 8.4.9 with an old configuration of sender nodes (created for version < 8.4.9) and do not set
GridGainConfiguration.setDrUseCacheNames(true), the node will throw an exception and will not start.
To use the new approach, you need to:
- Define one or multiple groups of nodes that will be used to send data to the remote data centers. Each node of these nodes must be configured as a sender hub. Note that a node can be included in more than one group. To tell the node that it is included in a specific group or groups, use
- Configure each cache to send its data through a specific group. Specify the name of the group using
If you want to replicate a dynamically created cache, you need to set the sender group for that cache and make sure that the corresponding cache is created in the remote cluster. The following procedure outlines the steps involved in this process.
1) For caches created via the Java API, specify a sender group by calling the
CacheDrSenderConfiguration.setSenderGroup(String) method, as shown in the example below.
CacheDrSenderConfiguration senderCfg = new CacheDrSenderConfiguration(); //setting the sender group name senderCfg.setSenderGroup("cache_group_name"); GridGainCacheConfiguration cachePluginCfg = new GridGainCacheConfiguration().setDrSenderConfiguration(senderCfg); CacheConfiguration cacheCfg = new CacheConfiguration<>().setPluginConfigurations(cachePluginCfg);
If you create caches using the CREATE TABLE command, the only way to specify the sender group name is to use a predefined cache template. You can create a cache template with the desired sender group (and other replication-specific properties) and pass it as a parameter to the CREATE TABLE command. The created cache will have the properties of the specified template. For more information and examples on how to use cache templates, see the Cache Template page in Ignite documentation.
2) Create a similar cache in the remote data center(s) that will receive replicated data.
3) Start using the cache. The data will be sent to the remote data centers.
4) If you started putting data into the cache before creating its counterpart in the remote data center(s), you have to transfer the cache's content to the remote data centers. This will ensure that the remote caches have exactly the same data and sending updates will not cause any issues.
To do a state transfer for the new cache, use the following code snippet:
GridGain gg = Ignition.ignite().plugin(GridGain.PLUGIN_NAME); gg.dr().stateTransfer("new_cache_name", dataCenterIds);
Updates in a sender cache are not sent to the sender hub immediately. Instead, they are accumulated in batches. Once the batch is too big or too old, it is sent to a sender hub. There could be several batches scheduled for sending to a sender hub or waiting for an acknowledgment from it. In case there are too many batches in the queue, further updates to caches are suspended until there is room for a new batch. This is a form of back-pressure to avoid out-of-memory issues.
You can provide an optional filter for cache updates by implementing the
GridDrSenderCacheEntryFilter interface and providing it to
CacheDrSenderConfiguration.setEntryFilter(). Updates that do not pass the filter will not be replicated. See Sender Cache section for more information.
GridGain supports various failover features guaranteeing replication batch delivery to all target data centers.
In case a sender cache sends a replication batch to a sender hub, and the sender hub goes down before acknowledging this batch, the sender cache automatically resends this batch to another available sender hub.
Cache updates are replicated from the primary node, i.e. in case you have a PARTITIONED cache with one backup and a key K for which N1 is the primary node and N2 is the backup node, then an update to this key will be replicated from N1. In case node N1 goes down before sending a replication batch to the sender hub, then the backup node N2 will resend this update to the sender hub.
There is a time window between when a replication batch is received by a sender hub and batch delivery to all receiver data centers. The batches that are pending delivery can optionally be saved in a persistent store thus surviving a sender hub restart. For the configuration details, see the Sender Hub section.
In case a sender hub sends a replication batch to a receiver hub and this receiver hub goes down before acknowledging the batch, the sender hub automatically resends the batch to another receiver hub in the same remote data center.
If replication fails, for example, when remote data center is not available and sender hubs cannot store updated data due to space shortage, the replication process will be paused. When the cause of the failure is eliminated, you need to manually do a state transfer to force data synchronization and then resume replication by calling
GridGain supports up to 31 data centers participating in DR and virtually any links between them. You can have data center A replicating updates to several other data centers. You can have data center A replicating to data center B, and then data center B forwarding these updates to data center C. You can have data centers A, B and C replicating updates to each other. And so on. To avoid loops in such complex topologies, you can filter some updates on sender hubs. For example, you can configure that updates occurred in data center A should not be replicated from data center B to data center C, while still replicating updates occurred in data center B to data center C.
GridGain supports two methods for enabling secure connections for use between clusters when using DR. You can enable SSL across all channels (Discover, Communication, etc.) via
IgniteConfiguration.SslContextFactory() in which case SSL is enabled for DR automatically. If
SslContextFactory is not defined, or if you want to override the settings in
SslContextFactory(), you can enable SSL just for DR via
DrReceiverConfiguration.setSslContextFactory and replication will be performed through a secure SSL channel created with this factory.
DrReceiverConfiguration.setSslContextFactory parameters are not present, the
isUseIgniteSslContextFactory flag is evaluated. If that flag is set to
true (the default value) and
IgniteConfiguration.getSslContextFactory() exists, then the Ignite SSL context factory will be used to establish a secure connection between the clusters.
For more information, see SSL/TLS.
Replication for a particular sender cache can be paused either manually or automatically. Automatic replication pause occurs when GridGain detects that the data cannot be replicated for some reason. This includes the following situations:
- There are no available sender hubs;
- None of the sender hubs can process the replication batch (e.g. in case a persistent store is full or corrupted).
A manual pause can be performed through the DR API. Once paused, replication can only be resumed manually through the DR API.
In active-active scenarios when updates to the same cache occur in different data centers, we refer to updates to the same key from various data centers as a conflict. To ensure consistent conflict resolution across different data centers, a user can use a custom implementation of conflict resolver.
GridDr interface provides access to various metrics interfaces that allow you to gain replication-related statistics, such as the number of batches sent from and received by a cache or a hub. Each metrics interface can be obtained by calling the corresponding method of
For example, the following code snippet can be used to receive the number of batches that the myReplicatedCache cache transmitted to sender hubs.
GridDr dr = grid.dr(); // obtaining the number of batches sent for replication // by the cache named 'myReplicatedCache' int batchesSent = dr.senderCacheMetrics("myReplicatedCache").batchesSent();