GridGain Documentation

The GridGain Developer Hub

Welcome to the GridGain developer hub. You'll find comprehensive guides and documentation to help you start working with GridGain as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Data Center Replication

When working with multiple data centers it is often important to make sure that if one data center goes down, another data center is fully capable of picking its load and data. Data center replication is meant to solve exactly this problem. When data center replication is turned on, GridGain in-memory database will automatically make sure that each data center is consistently backing up its data to other data centers (there can be one or more). GridGain supports both active-active and active-passive modes for replication.

Data Center Replication (or simply DR) is a GridGain feature allowing data transfer between caches in distinct topologies, possibly located in different geographic locations. Whenever a cache update occurs in one GridGain topology, it can be transparently transferred to another topologies.

The IgniteCache.clear operation does not affect remote data centers when using Data Center Replication.

Each distinct GridGain topology that is capable of sending or receiving cache updates is called a Data Center. Every data center has a unique ID between 1 and 31. This ID is assigned to all GridGain nodes in the topology.

The picture below illustrates a simple view of Data Center Replication.

Data Center Replication(DR) in GridGain can be managed using GridDr interface which can be obtained using Grid.dr() method.

GridDr dr = grid.dr();

Roles

There are four roles in a DR process: sender cache, sender hub, receiver hub and receiver cache.

  • Sender cache - is a cache in a particular data node whose contents should be replicated. All updates to this cache are transferred to a sender hub.
  • Sender hub - is a node which receives updates from a sender cache and sends them to receiver hubs in remote cluster.
  • Receiver hub - is a node which receives updates from a remote sender hub and routes them to a receiver cache.
  • Receiver cache - is a cache in a particular data node that receives updates from a receiver hub and applies them.

Sender cache and sender hubs are located in the same cluster. Receiver cache and receiver hubs are located in the same cluster.

There are no restrictions as to which role can be assigned to a node. For example, a node can act as both sender and receiver hub, have a cache which is both sender and receiver, etc.

Replication Process Overview

At the conceptual level, replication is based on two requirements:

  1. Before replication is started, the data must be synchronized between the data centers.
  2. When replication is started, every change of data in the primary data center is replicated in the remote data center(s).

The first requirement is achieved by the process called state transfer. State transfer is a process of sending the entire cache content to the remote data center(s). State transfer must be done manually by either calling GridDr.stateTransfer(String cacheName, byte... dataCenterIds) for the desired caches or via the Visor GUI (go to the Data Replication - Sender tab and click Bootstrap).

State transfer is required when:

  • You enable replication for the first time and need to synchronize data between data centers (if caches already contain data that wasn't synchronized).
  • You manually paused and then resumed replication (if the caches were updated during the pause).
  • Replication was paused due to a replication failure (for example, sender cache was unable to send data to the remote data centers).

The second requirement is implemented through the interaction between sender hubs and receiver hubs. Each cache whose content needs to be replicated (sender cache) sends updates in batches to sender hubs. Sender hubs transfer data to receiver hubs (nodes in the remote data center(s)). Receiver hubs put data into the receiver caches (replicated sender cache).

Sender Groups

GridGain version 8.4.9 introduces a new approach to configuring and performing cache replication.

In the old approach, cache is replicated through a number of nodes (called sender hubs). Each sender hub is configured to work with a specific set of caches, and its configuration cannot be changed dynamically. This means that dynamically created caches cannot be replicated (because the sender hubs are not aware of them).

In the new approach, cache’s data is sent to the remote data center(s) through a predefined group of nodes (sender groups). The cache only 'knows' the name of the group, and the cluster decides which nodes from the group will be used to replicate cache’s data. With this approach, if you create a cache dynamically, you can specify which group will be used to replicate that cache. The cluster will find the nodes from that group and will use them to transfer cache’s data to the remote data center(s) (with no need to reconfigure the sender hubs).

In GridGain version 8.4.9 and later, the new approach is used by default. If you do not specify the group name in your cache configuration, the cache will be replicated through the default sender group, which includes all nodes configured as sender hubs. If you want to use the old approach, set GridGainConfiguration.setDrUseCacheNames(true).

If you start GridGain version >= 8.4.9 with an old configuration of sender nodes (created for version < 8.4.9) and do not set GridGainConfiguration.setDrUseCacheNames(true), the node will throw an exception and will not start.

To use the new approach, you need to:

  • Define one or multiple groups of nodes that will be used to send data to the remote data centers. Each node of these nodes must be configured as a sender hub. Note that a node can be included in more than one group. To tell the node that it is included in a specific group or groups, use DrSenderConfiguration.setSenderGroups(String[]).
  • Configure each cache to send its data through a specific group. Specify the name of the group using CacheDrSenderConfiguration.setSenderGroup(String).

Dynamic Cache Replication

If you want to replicate a dynamically created cache, you need to set the sender group for that cache and make sure that the corresponding cache is created in the remote cluster. The following procedure outlines the steps involved in this process.

1) For caches created via the Java API, specify a sender group by calling the CacheDrSenderConfiguration.setSenderGroup(String) method, as shown in the example below.

CacheDrSenderConfiguration senderCfg = new CacheDrSenderConfiguration();

//setting the sender group name
senderCfg.setSenderGroup("cache_group_name");

GridGainCacheConfiguration cachePluginCfg = new GridGainCacheConfiguration().setDrSenderConfiguration(senderCfg);

CacheConfiguration cacheCfg = new CacheConfiguration<>().setPluginConfigurations(cachePluginCfg);
 

If you create caches using the CREATE TABLE command, the only way to specify the sender group name is to use a predefined cache template. You can create a cache template with the desired sender group (and other replication-specific properties) and pass it as a parameter to the CREATE TABLE command. The created cache will have the properties of the specified template. For more information and examples on how to use cache templates, see the Cache Template page in Ignite documentation.

2) Create a similar cache in the remote data center(s) that will receive replicated data.

3) Start using the cache. The data will be sent to the remote data centers.

4) If you started putting data into the cache before creating its counterpart in the remote data center(s), you have to transfer the cache's content to the remote data centers. This will ensure that the remote caches have exactly the same data and sending updates will not cause any issues.

To do a state transfer for the new cache, use the following code snippet:

GridGain gg = Ignition.ignite().plugin(GridGain.PLUGIN_NAME);

gg.dr().stateTransfer("new_cache_name", dataCenterIds);

Features

Batching and Asynchrony

Updates in a sender cache are not sent to the sender hub immediately. Instead, they are accumulated in batches. Once the batch is too big or too old, it is sent to a sender hub. There could be several batches scheduled for sending to a sender hub or waiting for an acknowledgment from it. In case there are too many batches in the queue, further updates to caches are suspended until there is room for a new batch. This is a form of back-pressure to avoid out-of-memory issues.

Filtering

You can provide an optional filter for cache updates by implementing the GridDrSenderCacheEntryFilter interface and providing it to CacheDrSenderConfiguration.setEntryFilter(). Updates that do not pass the filter will not be replicated. See Sender Cache section for more information.

Failover

GridGain supports various failover features guaranteeing replication batch delivery to all target data centers.

In case a sender cache sends a replication batch to a sender hub, and the sender hub goes down before acknowledging this batch, the sender cache automatically resends this batch to another available sender hub.

Cache updates are replicated from the primary node, i.e. in case you have a PARTITIONED cache with one backup and a key K for which N1 is the primary node and N2 is the backup node, then an update to this key will be replicated from N1. In case node N1 goes down before sending a replication batch to the sender hub, then the backup node N2 will resend this update to the sender hub.

There is a time window between when a replication batch is received by a sender hub and batch delivery to all receiver data centers. The batches that are pending delivery can optionally be saved in a persistent store thus surviving a sender hub restart. For the configuration details, see the Sender Hub section.

In case a sender hub sends a replication batch to a receiver hub and this receiver hub goes down before acknowledging the batch, the sender hub automatically resends the batch to another receiver hub in the same remote data center.

If replication fails, for example, when remote data center is not available and sender hubs cannot store updated data due to space shortage, the replication process will be paused. When the cause of the failure is eliminated, you need to manually do a state transfer to force data synchronization and then resume replication by calling GridDr.resume(String cacheName).

Complex Topologies

GridGain supports up to 31 data centers participating in DR and virtually any links between them. You can have data center A replicating updates to several other data centers. You can have data center A replicating to data center B, and then data center B forwarding these updates to data center C. You can have data centers A, B and C replicating updates to each other. And so on. To avoid loops in such complex topologies, you can filter some updates on sender hubs. For example, you can configure that updates occurred in data center A should not be replicated from data center B to data center C, while still replicating updates occurred in data center B to data center C.

Data Center Replication and SSL

GridGain supports two methods for enabling secure connections for use between clusters when using DR. You can enable SSL across all channels (Discover, Communication, etc.) via IgniteConfiguration.SslContextFactory() in which case SSL is enabled for DR automatically. If SslContextFactory is not defined, or if you want to override the settings in SslContextFactory(), you can enable SSL just for DR via DrSenderConfiguration.setSslContextFactory and DrReceiverConfiguration.setSslContextFactory and replication will be performed through a secure SSL channel created with this factory.

If the DrSenderConfiguration.setSslContextFactory and DrReceiverConfiguration.setSslContextFactory parameters are not present, the isUseIgniteSslContextFactory flag is evaluated. If that flag is set to true (the default value) and IgniteConfiguration.getSslContextFactory() exists, then the Ignite SSL context factory will be used to establish a secure connection between the clusters.

For more information, see SSL/TLS.

Pause/Resume

Replication for a particular sender cache can be paused either manually or automatically. Automatic replication pause occurs when GridGain detects that the data cannot be replicated for some reason. This includes the following situations:

  • There are no available sender hubs;
  • None of the sender hubs can process the replication batch (e.g. in case a persistent store is full or corrupted).

A manual pause can be performed through the DR API. Once paused, replication can only be resumed manually through the DR API.

Conflict Resolution

In active-active scenarios when updates to the same cache occur in different data centers, we refer to updates to the same key from various data centers as a conflict. To ensure consistent conflict resolution across different data centers, a user can use a custom implementation of conflict resolver.

Metrics

The GridDr interface provides access to various metrics interfaces that allow you to gain replication-related statistics, such as the number of batches sent from and received by a cache or a hub. Each metrics interface can be obtained by calling the corresponding method of GridDr.

For example, the following code snippet can be used to receive the number of batches that the myReplicatedCache cache transmitted to sender hubs.

GridDr dr = grid.dr();

// obtaining the number of batches sent for replication 
// by the cache named 'myReplicatedCache'
int batchesSent = dr.senderCacheMetrics("myReplicatedCache").batchesSent();
 

Data Center Replication


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.