Cloud Exchange High Availability
Cloud Exchange High Availability
This document describes how High Availability (HA) works in Cloud Exchange. After reviewing the architectural diagram feature list, prerequisites, and sizing guidelines, deploy HA in Cloud Exchange. After the deployment section, there are sections that explain migrating, upgrading, hardening, known limitations, and troubleshooting.
HA Architecture
Features
- Active-Active configurations for Netskope CE nodes, enhancing system availability and fault tolerance.
- The Netskope CE Core container is engineered to function as a dedicated worker capable of handling multiple tasks concurrently. For a medium-sized setup, it can manage 10 tasks, while a large setup allows for 20 tasks. These tasks may include operations such as polling, ingestion, and heartbeat monitoring. In a High Availability (HA) cluster with three nodes, the processing power effectively triples, enhancing the system’s execution capacity.
- Task assignments are coordinated through RabbitMQ, which distributes tasks to the core container’s workers based on their current load and the number of tasks they are actively processing. In the event of a core container restart or a node failure, transformation and ingestion tasks are requeued to ensure no data is lost. Meanwhile, data retrieval tasks are reassigned to another node, typically within five minutes of the failure, subject to the workload of the new node.
- To encapsulate, the core container is designed with both task-level and node-level HA to ensure continuous operation and data integrity.
- In a cluster, there will be multiple instances of MongoDB and RabbitMQ. If one node goes down, the other nodes will continue to serve requests, ensuring zero downtime.
- The multiple identical nodes of a Netskope CE will be running concurrently. And all of them are actively processing plugin tasks simultaneously.
To watch videos about Cloud Exchange HA, go here.
UI Dashboard for Cluster Status
Check the current cluster status on the Home page of the Netskope CE.
Note:
The Core service is dependent on the UI services. Because for security reasons, we are not making the core service public. The Core service will be accessible through UI service, and it will be accessed by an internal network. If the UI service is down, the core will show an unknown status on the dashboard.
Prerequisite for HA Deployment on Linux
- The prerequisites for standalone deployment must be met before proceeding for HA.
- Configure and mount the NFS (Network File System) volume on the required machines. Make sure you have permission to Read, Write, and modify the permissions of the files on the NFS server. This NFS volume will serve as a shared storage repository for critical assets, including:
- Mongo authentication key
- SSL certificates
- Environment variables
- Plugins and Custom Plugins
- Repositories
- It is highly recommended to have at least three machines where Netskope CE will be deployed. Refer to Cluster Node Count Requirements to see the minimum required operational nodes in a cluster.
(Ref: https://www.mongodb.com/docs/manual/core/replica-set-members/) - Make sure that all Netskope CE instances have identical physical resources, such as CPU, RAM, and disk space. This uniformity is crucial for achieving Active-Active high availability and consistent performance across the cluster.
- The SELinux must be disabled before running the deployment scripts. This is required to access the NFS volume from the local machine.
- Install the Python3 dependencies. Execute the commands below to install the Python module using the pip package manager.
$ sudo pip3 install "pyyaml>=6.0.0" $ sudo pip3 install "python-dotenv>=0.20.0,<=1.0.0" $ sudo pip3 install "pymongo>=4.1.1,<=4.3.3"
- Make sure that every machine can connect to the ports listed below on every machine including the machine itself. The reason for including the current machine is, the API calls will be made to the IP address of the server, and the API call will be made from inside the docker container. So the connectivity of the port using IP address must be allowed. The firewall policies for below listed ports and all the machines should be configured to ensure seamless connection between the machines.
- 4369 (A peer discovery service used by RabbitMQ nodes and CLI tools)
- 5672 (Used by AMQP 0-9-1 and AMQP 1.0 clients without and with TLS)
- 15672 (HTTP API clients, management UI and rabbitmqadmin, without and with TLS)
- 25672 (Used for inter-node and CLI tools communication)
- 35672 (Used for CLI tools communication)
- 27017 (The default port for mongod and mongos instances.)
- Selected UI port (Default 443 for HTTPS and 80 for HTTP) (To access the UI and internode healthcheck)
Note
If you are using CE as a VM, firewalld is already installed and disabled by default. You will need to enable the firewall by running following commands and restart docker service.
$ sudo systemctl enable firewalld $ sudo systemctl start firewalld $ sudo systemctl restart docker
Sample command to open 443 port using firewall:
$ sudo firewall-cmd --permanent --add-port=443/tcp
These ports will be used for the clustering of the MongoDB and RabbitMQ services. Also the UI ports are required to perform a health check from all the machines.
- If you are transitioning from a standalone configuration to a High Availability (HA) setup, you should be knowing your maintenance password, it is required to migrate the Mongo data into the new setup. You can find the maintenance password in the existing standalone configuration.
- If you are using CE as a VM, check this point and make hostname changes accordingly.
Make sure all the machines have different hostnames. Use the below command to change the hostname of a particular machine.$ sudo hostnamectl set-hostname <new_hostname>
Sizing Guideline
- Use the NFS server with a minimum of 5 GB and maximum up to 80 GB of free disk space.
- Other requirements are the same as a standalone deployment.
Deploy HA in Cloud Exchange
- Clone the netskopeoss/ta_cloud_exchange Github repository in all the machines where Netskope CE will be deployed:
$ mkdir netskope $ cd netskope $ git clone https://github.com/netskopeoss/ta_cloud_exchange $ cd ta_cloud_exchange
If you already have the repo cloned, then remove any local changes and pull the latest version.
$ git reset --hard $ git pull
Note
If you are using CE as a VM, the repo will be available on the /opt/cloudexchange/cloudexchange path. Change the current directory to the cloudexchange directory.
$ cd /opt/cloudexchange/cloudexchange
- Execute the setup script on the primary node and provide the requested information required for the setup:
$ sudo python3 ./setup
- Respond with “yes” when prompted for HA (High Availability) parameters, and additional HA-related parameters will be subsequently requested.
- The script will request further HA-related parameters, such as the Shared Volume path, IP list, and the current node’s IP address.
NOTE: For migration from standalone to HA, ensure that the primary node’s IP address is listed first in the IP list, followed by the secondary nodes. - If you are transitioning from a standalone configuration to a High Availability (HA) setup, it is mandatory to keep the same maintenance password as in the previous configuration. If the password is lost, the data could not be retained.
- Execute the setup script on the remaining nodes with the –location option. Provide the path of the NFS-mounted directory in the –location option. And provide the IP address of the current machine.
$ sudo python3 ./setup --location /path/to/mounted/directory
- If you are migrating from the standalone to HA, and you were using the custom plugins and/or custom repos, copy the plugins to the shared directory. The repos and custom_plugins directories will be created once the previous step is completed.
$ cp <standalone>/data/custom_plugins/ <shared-storage>/custom_plugins
$ cp <standalone>/data/repos/ <shared-storage>/repos - Launch Cloud Exchange on the primary node first. The script will wait until the migrations are complete. Then launch Cloud Exchange on the remaining nodes to join the cluster:
$ sudo ./start
- The UI should be accessible with all the IPs joined in the cluster (
https://<ip>
). You can add an external load balancer to the list of IPs in order to distribute the load on all the machines. It is important to note that if the CE UI has a configuration to use TLSv1.3, then we need to use the load balancer which supports the TLSv1.3 for redirecting requests from the HAProxy (Or any other Load Balancer). Make sure your load balancer supports TLSv1.3.
Note
If you want to add your SSL certificate, you can add them to the <path-to-shared-volume>/config/ ssl_certs
directory. The name of certificate file should be cte_cert.crt
and cte_cert_key.key
.
Migrating from Standalone to HA
Go here to see the migration options.
Add a New Node to the Cluster
- Run the setup script in the primary node and update the IP list.
$ sudo python3 ./setup
- Run the setup script in the remaining machines to add the connection information.
$ sudo python3 ./setup --location /path/to/mounted/directory
- Run the start script in the primary node first, and then run the start script for the remaining machines as well. At last run the start script in the new node.
$ sudo ./start
Note
An existing node restart is required; otherwise, there will be inconsistency between the UI dashboards and healthcheck.
Remove a Node from the Cluster
- Make sure the cluster is healthy, all the services should be up and running. Otherwise you may see errors while removing the node from the cluster.
- Execute the
./stop
script on the node, and it will remove the MongoDB and RabbitMQ nodes from the cluster. Identify the new primary node of the MongoDB cluster and update the shared configuration file. When done, stop the running services.
Note
As per the HA proxy list available in the environment variables, the dashboard will show all the IP addresses in the status.
If we want to completely remove the node from the UI dashboard as well, we must run the setup script in the remaining nodes and remove the IP from the HA parameters and run the start script again. This will restart all the services in current node and the IP list will be updated.
Upgrading HA
Upgrade HA to backward compatible version (Rolling Upgrade)
- Fetch the most recent version of the docker-compose repository.
- Execute the setup script as outlined in the comprehensive setup section.
- Launch the start script.
- Repeat the above steps for all the remaining nodes in the cluster.
Upgrade HA to non-backward compatible version (Full Stop Upgrade)
- Fetch the most recent version of the docker-compose repository.
- Stop all nodes, excluding the primary node.
- Execute the setup script as outlined in the comprehensive setup section.
- Execute the start script in the primary node.
- Repeat the setup and start script procedures on the remaining nodes to ensure uniform configuration.
Note
During this process, it’s important to be aware that there may be a brief period during which the CE will experience a temporary outage or downtime. It is recommended to keep a small maintenance window during the upgrade process.
Hardening Guidelines
- In order to establish the necessary connectivity between Docker services across various machines and facilitate the integration of nodes into the cluster, we have exposed the ports listed below from the Docker services. It is imperative that these ports remain accessible from all machines where the Netskope CE HA deployment is intended. To enhance security measures, it is also advisable to restrict access to these ports from other IP addresses.
The below ports will be used for the clustering of the MongoDB and RabbitMQ services. Also the UI ports are required to perform a health check from all the machines.- 4369 (A peer discovery service used by RabbitMQ nodes and CLI tools)
- 5672 (Used by AMQP 0-9-1 and AMQP 1.0 clients without and with TLS)
- 15672 (HTTP API clients, management UI and rabbitmqadmin, without and with TLS)
- 25672 (Used for inter-node and CLI tools communication)
- 35672 (Used for CLI tools communication)
- 27017 (The default port for mongod and mongos instances)
- Selected UI port (Default 443 for HTTPS and 80 for HTTP) (To access the UI and internode healthcheck)
- As we are utilizing NFS for shared storage, it is important to configure the NFS volume in a manner that restricts access exclusively to Netskope CE instances. This precaution is of utmost importance in order to strengthen security measures.
Cluster Node Count Requirements
As a part of MongoDb replication and RabbitMQ mirroring requirements in the event of a failure, it is crucial to ensure that the HA cluster remains operational with the majority of the CE nodes ACTIVE/ONLINE. An HA cluster with an odd number of CE nodes has a higher chance of remaining operational than a cluster with an even number of CE nodes.
Total Number of CE Nodes in an HA cluster |
Minimum Number of ACTIVE/ONLINE CE Nodes Required in an HA Cluster at any Given Time for Successful Operation |
3 |
2 |
4 |
3 |
5 |
3 |
6 |
4 |
7 |
4 |
8 |
5 |
9 |
5 |
Limitations
- You should ensure the availability of NFS volume.
- Running multiple redundant instances of the Netskope CE requires additional hardware and computing resources.
- In some cases, HA setups may introduce increased latency due to the need to replicate data or coordinate between active instances.
- Whenever there are modifications to IP addresses or node-related configurations, it will be necessary to execute both the setup and start scripts on all machines within the system to ensure that the changes are properly propagated and synchronized across the cluster. It is recommended to add all the required machines at once and use a fixed IP addresses for the machines.
- It is crucial to maintain the majority of nodes operational at all times, as the failure to do so would result in cluster failure. i.e. In a 3-node cluster, at least 2 nodes must be operational at any given time; similarly, in a 5-node cluster, at least 3 nodes should be up and running. The pattern continues accordingly for larger clusters. This is because MongoDB will not initiate the Primary election process under such circumstances. Additionally, we have implemented a “pause_minority” option for RabbitMQ to mitigate the challenges associated with network partition-induced inconsistencies. This should recover once the node comes back up and connects to the cluster again. Refer to Cluster Node Count Requirements to see the minimum required operational nodes in a cluster.
Ref: https://www.mongodb.com/docs/v5.0/core/replica-set-elections/#network-partition | https://www.rabbitmq.com/partitions.html#options - In rare cases, when a cluster is facing back-to-back node failures, there are chances of data loss in the RabbitMQ because the queue will become leader and mirror very frequently, and in such cases, other nodes may lose the data which is not synchronized yet.
Ref: https://www.rabbitmq.com/ha.html#behaviour - To use SSO, we will be able to add only one node’s IP address as a redirect URL. Optionally, we can configure the load balancer against all the IP addresses, and use IP address of the load balancer, to configure the SSO. And it will redirect the request to any available node.
Troubleshooting HA in Cloud Exchange
- SE Linux must be disabled before running the HA. This is required to access the NFS server from the machine. And SE Linux won’t allow that.
- The ./stop script in HA will remove the MongoDB and RabbitMQ node from the cluster. And stop the services in that particular machine. It is crucial to have the services up while running the ./stop script. Otherwise the script might fail as the script won’t be able to create the Mongo client to remove the node.
- Make sure you have given permission to read and write both for the NFS server. Check if you are able to create a new file and are able to modify the permissions of the file.
- During migration, if you have any issues with node renaming, check the prerequisite and restore the backup from 1st step. Then retry the migration from 2nd step.
$ sudo rm -rf <path-to-ta_cloud_exchange>/data/mongo-data/data/* $ sudo cp -R <temp-mongo-path>/*
<path-to-ta_cloud_exchange>/data/mongo-data/data/ - :Make sure docker/podman services are running. If not, run the services using this command
$ sudo systemctl restart docker
- If the UI shows a down status for any container for a long period of time, check the status of the container and perform reboot if required. Use podman wherever applicable.
- Check for the container status. If all the services are running move to 2nd step. Otherwise, run the start script again to recover the container.
$ docker ps
- Check for the MongoDB cluster status. Execute the commands below to run the command inside the MongoDB container.
$ docker-compose exec -- mongodb-primary bash $ mongosh -u root -p $MONGO_INITDB_ROOT_PASSWORD $ rs.status()
If the status shows anything other than PRIMARY and SECONDARY for long period of time, run this command to restart the container on a specific machine.
$ docker-compose restart mongodb-primary
- Check for the RabbitMQ cluster status. Execute the commands below to run the command inside the RabbitMQ container.
$ docker-compose exec -- rabbitmq-stats bash $ rabbitmqctl cluster_status
Check the running node section in the output and figure out the connected and running nodes. Restart the RabbitMQ container if required.
$ docker-compose restart rabbitmq-stats
- Check for the container status. If all the services are running move to 2nd step. Otherwise, run the start script again to recover the container.
- If any node experiences a worker lost error, like if the worker is not connected with RabbitMQ, restart the specific core container to recover the workers. This issue is likely to occurre when the RabbitMQ cluster is down for long period of time.
$ docker-compose restart core