Docy

Cloud Exchange High Availability

Cloud Exchange High Availability

This document describes how High Availability (HA) works in Cloud Exchange. After reviewing the architectural diagram feature list, prerequisites, and sizing guidelines, deploy HA in Cloud Exchange. After the deployment section, there are sections that explain migrating, upgrading, hardening, known limitations, and troubleshooting.

HA Architecture

Features

  • Active-Active configurations for Netskope CE nodes, enhancing system availability and fault tolerance.
  • The multiple identical nodes of a Netskope CE will be running concurrently. And all of them are actively processing plugin tasks simultaneously.

UI Dashboard for Cluster Status

Check the current cluster status on the Home page of the Netskope CE.

Note:

The Core service is dependent on the UI services. Because for security reasons, we are not making the core service public. The Core service will be accessible through UI service, and it will be accessed by an internal network. If the UI service is down, the core will show an unknown status on the dashboard.

Prerequisite for HA Deployment on Linux

  • The prerequisites for standalone deployment must be met before proceeding for HA.
  • Configure and mount the NFS (Network File System) volume on the required machines. Make sure you have permission to Read, Write, and modify the permissions of the files on the NFS server. This NFS volume will serve as a shared storage repository for critical assets, including:
    • Mongo authentication key
    • SSL certificates
    • Environment variables
    • Plugins and Custom Plugins
    • Repositories
  • It is highly recommended to have at least three machines where Netskope CE will be deployed.
    (Ref: https://www.mongodb.com/docs/manual/core/replica-set-members/)
  • Make sure that all Netskope CE instances have identical physical resources, such as CPU, RAM, and disk space. This uniformity is crucial for achieving Active-Active high availability and consistent performance across the cluster.
  • The SELinux must be disabled before running the deployment scripts. This is required to access the NFS volume from the local machine.
  • Install the Python3 dependencies if HA is required.
    Execute the command mentioned below to install the Python module using the pip package manager.

     

    $ sudo pip3 install "pyyaml>=6.0.0" 
    $ sudo pip3 install "python-dotenv>=0.20.0,<=1.0.0"
    $ sudo pip3 install "pymongo>=4.1.1,<=4.3.3"
  • Make sure that every machine can connect to the ports listed below on every other machine excluding the NFS server. The firewall policies for below listed ports and all the machines should be configured to ensure seamless connection between the machines.
    • 4369 (A peer discovery service used by RabbitMQ nodes and CLI tools)
    • 5672 (Used by AMQP 0-9-1 and AMQP 1.0 clients without and with TLS)
    • 15672 (HTTP API clients, management UI and rabbitmqadmin, without and with TLS)
    • 25672 (Used for inter-node and CLI tools communication)
    • 27017 (The default port for mongod and mongos instances.)
    • Selected UI port (Default 443 for HTTPS and 80 for HTTP) (To access the UI and internode healthcheck)

Note

If you are using “CE as a VM”, firewalld is already installed and disabled by default. You will need to enable the firewall by running following commands and restart docker service.

$ sudo systemctl enable firewalld
$ sudo systemctl start firewalld
$ sudo systemctl restart docker

Sample command to open 443 port using firewall:

$ sudo firewall-cmd --permanent --add-port=443/tcp

These ports will be used for the clustering of the MongoDB and RabbitMQ services. Also the UI ports are required to perform a health check from all the machines.

  • If you are transitioning from a standalone configuration to a High Availability (HA) setup, you should be knowing your maintenance password, it is required to migrate the Mongo data into the new setup. You can find the maintenance password in the existing standalone configuration.
  • If you are using “CE as a VM”, check this point and make hostname changes accordingly.
    Make sure all the machines have different hostnames. Use the below command to change the hostname of a particular machine.

     

    $ sudo hostnamectl set-hostname <new_hostname>

Sizing Guideline

  • Use the NFS server with a minimum of 5 GB of free disk space.
  • Other requirements are the same as a standalone deployment.

Deploy HA in Cloud Exchange

  1. Clone the netskopeoss/ta_cloud_exchange Github repository in all the machines where Netskope CE will be deployed:
    $ mkdir netskope
    $ cd netskope
    $ git clone https://github.com/netskopeoss/ta_cloud_exchange
    $ cd ta_cloud_exchange

    If you already have the repo cloned, then remove any local changes and pull the latest version.

    $ git reset --hard
    $ git pull

    Note

    If you are using CE as a VM, the repo will be available on the /opt/cloudexchange/cloudexchange path. Change the current directory to the cloudexchange directory.

    $ cd /opt/cloudexchange/cloudexchange
  2. Execute the setup script on the primary node and provide the requested information required for the setup:
    $ sudo python3 ./setup

    • Respond with “yes” when prompted for HA (High Availability) parameters, and additional HA-related parameters will be subsequently requested.
    • The script will request further HA-related parameters, such as the Shared Volume path, IP list, and the current node’s IP address.
      NOTE: For migration from standalone to HA, please ensure that the primary node’s IP address is listed first in the IP list, followed by the secondary nodes.
    • If you are transitioning from a standalone configuration to a High Availability (HA) setup, it is mandatory to keep the same maintenance password as in the previous configuration. If the password is lost, the data could not be retained.
  3. Execute the setup script on the remaining nodes with the –location option. Provide the path of the NFS-mounted directory in the –location option. And provide the IP address of the current machine.
    $ sudo python3 ./setup --location /path/to/mounted/directory

  4. If you are migrating from the standalone to HA, and you were using the custom plugins and/or custom repos, copy the plugins to the shared directory. The repos and custom_plugins directories will be created once the previous step is completed.
    $ cp <standalone>/data/custom_plugins/ <shared-storage>/custom_plugins
    $ cp <standalone>/data/repos/ <shared-storage>/repos
  5. Launch Cloud Exchange on the primary node first. The script will wait until the migrations are complete. Then launch the Cloud Exchange on the remaining nodes to join the cluster:
    $ sudo ./start
  6. The UI should be accessible with all the IPs joined in the cluster (e.g., https://<ip>).
    The user can add an external load balancer to the list of IPs in order to distribute the load on all the machines. It is important to note that if the CE UI has a configuration to use TLSv1.3, then we need to use the load balancer which supports the TLSv1.3 for redirecting requests from the HAProxy (Or any other Load Balancer). Make sure your load balancer supports TLSv1.3.

Note

If you want to add your SSL certificate, you can add them to the “<path-to-shared-volume>/config/ ssl_certs” directory. The name of certificate file should be “cte_cert.crt” and “cte_cert_key.key”

Migration from Standalone (4.2.0) to HA (5.0.0)

Migration for Pre-Checks

Before starting with the migration steps, ensure that the machine has at least 5x amount of free storage space as the current data size in RabbitMQ.
Find RabbitMQ storage using the below command, this should show an occupied storage based on the directory.

$ du -sh

Find total available storage of specific partitions.

$ df -h

Migration Steps

  1. Before proceeding with any configuration changes, it is advisable to take a backup of your MongoDB and RabbitMQ data to a secure temporary location. This ensures that previous data can be recovered in the event of unexpected issues or data loss during the configuration process.
    Stop the standalone deployment and copy the data. If you are using custom plugins:
    $ sudo ./stop
    $ sudo cp -R data/mongo-data/data/* <temp-mongo-path>/
    $ sudo cp -R data/rabbitmq/data/* <temp-rabbit-path>/
  2. To convert the RabbitMQ data as per the new specification, we need to rename the RabbitMQ node. The node should be renamed as per the IP address or the hostname you will be using for the primary machine. Start the RabbitMQ container using the following command after exporting the environment file. If you are using podman compose then use “podman run” instead of “docker run”. This command will rename the RabbitMQ node. This is required to import the standalone RabbitMQ into an HA cluster.
    • $ . ./.env

      If you face below error while running the command, just ignore the error and move to the next step.

    • $ export RABBIT_NEW_DIR_NAME=<ip-or-hostname-of-primary-node>
      • Run this command if you are using IP addresses to configure the HA. Use the IP address of primary machine.
        $ export RABBIT_NEW_DIR_NAME=10.10.10.10
      • Run below command if you are using Hostnames to configure the HA. Use the Hostname of primary machine.
        $ export RABBIT_NEW_DIR_NAME=subdomain.hostname.local
    • $ sudo docker run --rm -v ${RABBITMQ_CUSTOM_CONF_PATH:-./data/rabbitmq/custom.conf}:/etc/rabbitmq/conf.d/custom.conf:z -v ./data/rabbitmq/data:/var/lib/rabbitmq/mnesia:z -u 1001:1001 -e RABBITMQ_NODENAME=rabbit@rabbitmq-stats -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=${MAINTENANCE_PASSWORD} index.docker.io/rabbitmq:3.11.11-management /bin/bash -c "echo ${MAINTENANCE_PASSWORD_ESCAPED} > /var/lib/rabbitmq/.erlang.cookie && chmod 600 /var/lib/rabbitmq/.erlang.cookie && rabbitmqctl rename_cluster_node rabbit@rabbitmq-stats rabbit@${RABBIT_NEW_DIR_NAME} && mv /var/lib/rabbitmq/mnesia/rabbit\@rabbitmq-stats /var/lib/rabbitmq/mnesia/rabbit\@${RABBIT_NEW_DIR_NAME} && mv /var/lib/rabbitmq/mnesia/rabbit\@rabbitmq-stats-feature_flags /var/lib/rabbitmq/mnesia/rabbit\@${RABBIT_NEW_DIR_NAME}-feature_flags && mv /var/lib/rabbitmq/mnesia/rabbit\@rabbitmq-stats-plugins-expand /var/lib/rabbitmq/mnesia/rabbit\@${RABBIT_NEW_DIR_NAME}-plugins-expand && mv /var/lib/rabbitmq/mnesia/rabbit\@rabbitmq-stats-rename /var/lib/rabbitmq/mnesia/rabbit\@${RABBIT_NEW_DIR_NAME}-rename"

    Note

    During the node renaming process, message storage will temporarily double. For instance, if you initially have 10 GB of data in RabbitMQ, it will require 20 GB of space until the renaming is complete. Additionally, this 20 GB of data will need to be moved to a new location for high availability.

  3. Follow these below steps to transfer data files to the Primary node. You can skip this step if you are using the same machine and directory to migrate to HA.Note: Use cp command for file transfer, if you are moving data from one directory to another directory within same VM.
    1. Create a zip file for Mongo and RabbitMQ data.
      $ sudo zip -r ce_backup.zip data/mongo-data/ data/rabbitmq/data
    2. Copy the backup to the new machine using scp command.
      $ scp ce_backup.zip <user>@<ip-of-vm>:<path-to-ta_cloud_exchange>
    3. SSH in to the new Primary machine.
      $ ssh <user>@<ip-of-vm>
    4. Change the current directory to ta_cloud_exchange.
      $ cd <path-to-ta_cloud_exchange>
    5. Restore the backup data using the unzip command.
    $ sudo unzip ce_backup.zip

    You might will be asked to replace the data, select “A” to replace all the data.

  4. To transition copied data into the new HA configuration, please refer to the Fresh HA deployment guide for instructions on adding the necessary HA parameters and initializing the cluster. This process will facilitate the migration of MongoDB data into the replica set. Additionally, it will involve the importation of RabbitMQ messages into the new HA machine, with subsequent integration of other nodes into the cluster.
  5. If custom plugins and/or custom repositories are being used, please ensure to copy those to the designated shared location following the execution of the setup script. This step will be included in the installation steps.
  6. Follow the installation steps mentioned in the previous section, to run the Netskope CE.

    Note

    Following the successful completion of the HA migration, it’s important to note that there may be a brief delay during the initial few minutes. This is attributable to MongoDB’s replication process, which involves the distribution of data across all nodes in the system.

Add a new Node to the Cluster

    1. Run the setup script in the primary node and update the IP list.
      $ sudo python3 ./setup
    2. Run the setup script in the remaining machines to add the connection info.
      $ sudo python3 ./setup --location /path/to/mounted/directory
    3. Run the start script in the primary node first and then run the start script for the remaining machines as well. At last run the start script in the new node.
      $ sudo ./start

Note

An existing node restart is required; otherwise, there will be inconsistency between the UI dashboards and connectivity.

Remove a Node from the Cluster

Execute the stop script on the node, and it will remove the node from the cluster.

Note

As per the HA proxy list available in the environment variables, the dashboard will show all the IP addresses in the status.

Upgrading HA

Upgrade HA to backward compatible version (Rolling Upgrade)

  1. Fetch the most recent version of the docker-compose repository.
  2. Execute the setup script as outlined in the comprehensive setup guide.
  3. Launch the start script.
  4. Repeat the above steps for all the remaining nodes in the cluster.

Upgrade HA to non-backward compatible version (Full Stop Upgrade)

  1. Fetch the most recent version of the docker-compose repository.
  2. Stop all nodes, excluding the primary node.
  3. Execute the setup script as outlined in the comprehensive setup guide.
  4. Launch the start script.
  5. Repeat the setup and start script procedures on the remaining nodes to ensure uniform configuration.

Note

During this process, it’s important to be aware that there may be a brief period during which the CE will experience a temporary outage or downtime. It is recommended to keep a small maintenance window during the upgrade process.

Hardening Guidelines

  1. In order to establish the necessary connectivity between Docker services across various machines and facilitate the integration of nodes into the cluster, we have exposed the ports listed below from the Docker services. It is imperative that these ports remain accessible from all machines where the Netskope CE HA deployment is intended. To enhance security measures, it is also advisable to restrict access to these ports from other IP addresses.
    The below ports will be used for the clustering of the MongoDB and RabbitMQ services. Also the UI ports are required to perform a health check from all the machines.
    • 4369 (A peer discovery service used by RabbitMQ nodes and CLI tools)
    • 5672 (Used by AMQP 0-9-1 and AMQP 1.0 clients without and with TLS)
    • 15672 (HTTP API clients, management UI and rabbitmqadmin, without and with TLS)
    • 25672 (Used for inter-node and CLI tools communication)
    • 27017 (The default port for mongod and mongos instances.)
    • Selected UI port (Default 443 for HTTPS and 80 for HTTP) (To access the UI and internode healthcheck)
  2. As we are utilizing NFS for shared storage, it is important to configure the NFS volume in a manner that restricts access exclusively to NetskopeCE instances. This precaution is of utmost importance in order to strengthen security measures.

Limitations

  1. The customer should ensure the availability of NFS volume.
  2. Running multiple redundant instances of the Netskope CE requires additional hardware and computing resources.
  3. In some cases, HA setups may introduce increased latency due to the need to replicate data or coordinate between active instances.
  4. Whenever there are modifications to IP addresses or node-related configurations, it will be necessary to execute both the setup and start scripts on all machines within the system to ensure that the changes are properly propagated and synchronized across the cluster. It is recommended to add all the required machines at once and use a fixed IP addresses for the machines.
  5. It is crucial to maintain the majority of nodes operational at all times, as the failure to do so would result in cluster failure. i.e. In a 3-node cluster, at least 2 nodes must be operational at any given time; similarly, in a 5-node cluster, at least 3 nodes should be up and running. The pattern continues accordingly for larger clusters. This is because MongoDB will not initiate the Primary election process under such circumstances. Additionally, we have implemented a “pause_minority” option for RabbitMQ to mitigate the challenges associated with network partition-induced inconsistencies. This should recover once the node comes back up and connects to the cluster again.
    Ref: https://www.mongodb.com/docs/v5.0/core/replica-set-elections/#network-partition | https://www.rabbitmq.com/partitions.html#options
  6. In rare cases, when a cluster is facing back-to-back node failures, there are chances of data loss in the RabbitMQ because the queue will become leader and mirror very frequently, and in such cases, other nodes may lose the data which is not synchronized yet.
    Ref: https://www.rabbitmq.com/ha.html#behaviour
  7. To use SSO, we will be able to add only one node’s IP address as a redirect URL. Optionally, we can configure the load balancer against all the IP addresses, and use IP address of the load balancer, to configure the SSO. And it will redirect the request to any available node.

Troubleshooting HA in Cloud Exchange

  • SE Linux must be disabled, before running the HA. This is required to access the NFS server from the machine. And SE Linux won’t allow that.
  • Make sure you have given permission to read and write both for the NFS server. Check if you are able to create a new file and are able to modify the permissions of the file.
  • During migration, if you face any issues with node renaming check the prerequisite and restore the backup from 1st step. Then retry the migration from 2nd step.
    $ sudo rm -rf <path-to-ta_cloud_exchange>/data/mongo-data/data/* 
    $ sudo cp -R <temp-mongo-path>/* <path-to-ta_cloud_exchange>/data/mongo-data/data/
  • Make sure docker/podman services are running. If not run the services using below command
    $ sudo systemctl restart docker
  • If UI shows down status for any container for a long period of time, check the status of the container and perform reboot if required. Use podman wherever applicable, in below listed commands.
    • Check for the container status. If all the services are running move to 2nd step. Otherwise, run the start script again to recover the container.
      $ docker ps
    • Check for the MongoDB cluster status. Execute the commands below to run the command inside the MongoDB container.
      $ docker-compose exec -- mongodb-primary bash 
      $ mongosh -u root -p $MONGO_INITDB_ROOT_PASSWORD 
      $ rs.status()

      If the status shows anything other than PRIMARY and SECONDARY for long period of time, run below command to restart the container on specific machine.
      $ docker-compose restart mongodb-primary

    • Check for the RabbitMQ cluster status. Execute the commands below to run the command inside the RabbitMQ container.
      $ docker-compose exec -- rabbitmq-stats bash 
      $ rabbitmqctl cluster_status

      Check the Running node section in the output and figure out the connected and running nodes. Restart the RabbitMQ container if required.

      $ docker-compose restart rabbitmq-stats
  • If any node experiences a worker lost error, i.e. if the worker is not connected with RabbitMQ, restart the specific core container to recover the workers. This issue is likely to occurre when the RabbitMQ cluster is down for long period of time.
    $ docker-compose restart core

 

 

Share this Doc
In this topic ...