Commit b39d43d8 authored by Larkin Heintzman's avatar Larkin Heintzman
parents 13e20b4f 9db708c1
# ROS-Kubernetes # ROS-Kubernetes
Contains scrips and files to create a fielded ros kubernetes cluster. Contains scrips and files to create a fielded ros Kubernetes cluster.
# Table of Contents: # Table of Contents:
- [ROS-Kubernetes](#ros-kubernetes) - [ROS-Kubernetes](#ros-Kubernetes)
* [Image building](#image-building) * [Getting The Images](#getting-the-images)
+ [NFS Setup](#nfs-setup) + [Pushing To Local Registry](#pushing-to-local-registry)
* [Cluster Setup](#cluster-setup) * [Cluster Setup](#cluster-setup)
+ [Without Setup Script](#without-setup-script) + [Without Setup Script](#without-setup-script)
* [Highly Available Cluster Setup](#highly-available-cluster-setup)
+ [ETCD Cluster Setup](#etcd-cluster-setup)
+ [Virtual IP Load Balancer](#virtual-ip-load-balancer)
+ [Starting HA Cluster](#starting-ha-cluster)
* [ROS Commands](#ros-commands) * [ROS Commands](#ros-commands)
+ [Bash Shell in Pods](#bash-shell-in-pods) + [Bash Shell in Pods](#bash-shell-in-pods)
+ [SSH Server Bash shell](#ssh-server-bash-shell) + [SSH Server Bash shell](#ssh-server-bash-shell)
* [Imaging ROS Test](#imaging-ros-test) * [Imaging ROS Test](#imaging-ros-test)
* [Handy Troubleshooting Commands](#handy-troubleshooting-commands) * [Handy Troubleshooting Commands](#handy-troubleshooting-commands)
* [Links With More Information](#links-with-more-information) * [Links With More Information](#links-with-more-information)
* [Old Stuff That Might Still Be Useful](#old-stuff-that-might-still-be-useful)
+ [NFS Setup](#nfs-setup)
+ [Highly Available Cluster Setup](#highly-available-cluster-setup)
<!-- remote desktop tutorial: <!-- remote desktop tutorial:
https://www.e2enetworks.com/help/knowledge-base/how-to-install-remote-desktop-xrdp-on-ubuntu-18-04/ https://www.e2enetworks.com/help/knowledge-base/how-to-install-remote-desktop-xrdp-on-ubuntu-18-04/
...@@ -26,21 +25,21 @@ CUDA container install and tutorial: ...@@ -26,21 +25,21 @@ CUDA container install and tutorial:
https://abhishekbose550.medium.com/deep-learning-for-production-deploying-yolo-using-docker-2c32bb50e8d6 https://abhishekbose550.medium.com/deep-learning-for-production-deploying-yolo-using-docker-2c32bb50e8d6
start docker registry: "docker run -d -p 5000:5000 --name registry registry:2.7" start docker registry: "docker run -d -p 5000:5000 --name registry registry:2.7"
then build and tag images "docker tag <orig>:latest localhost:5000/<orig>:latest" and follow with "docker push localhost:5000/<orig>:latest" then kubernetes should be able to find the image then build and tag images "docker tag <orig>:latest localhost:5000/<orig>:latest" and follow with "docker push localhost:5000/<orig>:latest" then Kubernetes should be able to find the image
test darknet: ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights -ext_output shopping-crowded-mall-17889570.jpg test darknet: ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights -ext_output shopping-crowded-mall-17889570.jpg
https://blog.roboflow.com/how-to-train-scaled-yolov4/ https://blog.roboflow.com/how-to-train-scaled-yolov4/
--> -->
## Image building ## Getting The Images
Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts, each type of image has it's own directory. Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts, each type of image has it's own directory.
- Building the base image is relatively straight forward: - Building the base image is relatively straight forward:
`docker build -t <tag_name> -f Dockerfile .` `docker build -t <tag_name> -f Dockerfile .`
- Which should load the image into your local docker registry (or whatever registry you're using). To build the remote images, designed to be run on Nivida Jetsons mounted on a UAV, the build command is slightly more complex due to the different architectures: - Which should load the image into your local docker registry (or whatever registry you're using). To cross build the remote images, designed to be run on Nivida Jetsons mounted on a UAV, the build command is slightly more complex due to the different architectures. Though you can of course simply build the images on the Jetson they're intended for:
`docker buildx build --platform linux/arm64 -o type=oci,dest=<image_name>.tar -t <tag_name> .` `docker buildx build --platform linux/arm64 -o type=oci,dest=<image_name>.tar -t <tag_name> .`
- Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor. There is an extra step in that the tarball needs to be sent to the remote processor and loaded into the relevant registry: - Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor. There is an extra step in that the tarball needs to be sent to the remote processor and loaded into the relevant registry:
...@@ -55,51 +54,45 @@ Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts, ...@@ -55,51 +54,45 @@ Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts,
- And we're looking for the most recently added un-tagged image. Copy that ID and run: - And we're looking for the most recently added un-tagged image. Copy that ID and run:
`docker image tag <image_ID> <image_name>` `docker image tag <image_ID> <image_name>`
- The image names used in this project are `llh/basestation:v0` for the base image, and `llh/drone:v0` for the drone image. Keeping image names the same will omit the need to edit the deployment yaml files. Once tagged we can verify the image is in fact the correct architecture with: Some notes on this process. The image you'll want to build on the Jetsons is the YOLO detector image stored in the [yolo directory](yolo_folder). The Dockerfile is fairly lengthy, including ROS, openCV, as well as darknet itself, so the build process takes approximately 1-2 hours.
`docker image inspect <tag_name>`
### NFS Setup Currently the base-station image is stored on the Dockerhub and does not have a Dockerfile to build it yet. You can find the AMD architecture image [Here](base_station_image).
The pod deployments also make use of an NFS hosted from the control-plane node (the machine on which the cluster setup script is run). The NFS allows all pods to save images or otherwise to the same shared directory across the cluster. The setup steps are as follows: ### Pushing To Local Registry
- Install relevant NFS tools (usually already installed on standard linux distros): Sometimes Kubernetes does not find the images available to Docker unless they are pushed to a locally run registry or Dockerhub, this can be seen as `ImagePullErr` when attempting to run a deployment in Kubernetes. The steps to solve this are below:
`sudo apt-get install nfs-utils nfs-utils-lib`
- To configure which directories are included in the filesystem, edit the `/etc/exports` file on the control-plane to allow connections from all pods as well as the host itself (note, these rules might be overkill for the application, and may need edits for a larger cluster). Also, the shared folder certainly does not need to be in the home directory but it's the example the author used. The contents of the exports file should look similar to: - Start a local Docker registry:
``` `docker run -d -p 5000:5000 --restart=always --name registry registry:2`
/home/<user_name>/imageNFS 10.244.0.0/24(rw)
/home/<user_name>/imageNFS 10.244.1.0/24(rw)
/home/<user_name>/imageNFS <control_plane_ip>/24(rw)
```
- The only other thing to do on the host machine is to check that the NFS service is running with: - Retag the image you wish to push onto this registry to include the local registry's address at the front:
`service nfs-server status` `docker tag <original_image_name>:<image_tag> localhost:5000/<original_image_name>:<image_tag>`
- Fortunately, kubernetes handles the mounting process for remote storage, so the `image\*` files specified in the [kubernetes][kubernetes_folder] directory should handle the details of the NFS. - The image can then be pushed as follows, this will take a moment depending on how large the image is:
`docker push localhost:5000/<original_image_name>:<image_tag>`
- A quick test can be performed to make sure the NFS is working as intended. Mount the directory from a remote machine with: A similar idea is used to push an image to Dockerhub but the `localhost` is replaced with your username.
`mount -t nfs <host_machine_ip>:/home/<user_name>/imageNFS /home/<remote_user_name>/imageNFS`
Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
## Cluster Setup ## Cluster Setup
After the images are built, the next step is likely to bring up a cluster. One note about nomenclature, both ROS and kubernetes have the concept of nodes however they refer to two different object in our cluster. ROS-nodes are processes that run on kubernetes pods, and kubernetes nodes are remote machines that run pods. After the images are built or obtained, the next step is likely to bring up a cluster. One note about nomenclature, both ROS and Kubernetes have the concept of nodes however they refer to two different object in our cluster. ROS nodes are processes that run on Kubernetes pods, and Kubernetes nodes are remote machines that run pods.
The `clusterSetup.sh` script should setup a kubernetes cluster, with some assumptions about the application. It can be run via (should request password due to swap memory): The `clusterSetup.sh` script should setup a Kubernetes cluster, with some assumptions about the application. It can be run via (should request password due to swap memory):
`/clusterSetup.sh` `./clusterSetup.sh`
Once the cluster is running, the next step is likely to add remote nodes to the cluster. The cluster setup script also copies the node join command generated when creating a cluster to make it easier to add remote machines to the cluster. To add a node, start an SSH session on the remote machine and run: It is worth noting that the cluster setup script is very much a work in progress and includes many commands that are only used in testing a highly-available cluster architecture, however the default settings will work fine for general use. Once the cluster is running, the next step is likely to add remote nodes to the cluster. The cluster setup script also copies the node join command generated when creating a cluster to make it easier to add remote machines to the cluster. To add a node, start an SSH session on the remote machine and run:
`sudo <pasted_join_command>` which should look similar to: `sudo swapoff -a` then the pasted join command which should look similar to:
``` ```
kubeadm join <control_plane_ip>:<port> --token qwf631.i58dqjxebj01u3h9 --discovery-token-ca-cert-hash sudo kubeadm join <control_plane_ip>:<port> --token qwf631.i58dqjxebj01u3h9 --discovery-token-ca-cert-hash
sha256:5701ff981a1b22145b4e8a631bd9567525adb76643bece39c0533173b5de579a sha256:5701ff981a1b22145b4e8a631bd9567525adb76643bece39c0533173b5de579a
``` ```
And now you see why the command gets copied by the setup script. And now you see why the command gets copied by the setup script.
Deployments can be added be running (note that the `nodeSelector` and `image` fields may require editing depending on hostnames and image names): Deployments can be added be running (note that the `nodeSelector` and `image` fields may require editing depending on hostnames and image names):
`kubectl apply -f kubernetes/` `kubectl apply -f Kubernetes/<desired_deployment>`
Which will add all objects/pods listed in the [kubernetes directory][kubernetes_folder]. Which will add the pods listed in the deployment. To add all deployments in the [Kubernetes directory][Kubernetes_folder], you can run:
`kubectl apply -f Kubernetes/`
### Without Setup Script ### Without Setup Script
...@@ -114,21 +107,21 @@ If the cluster setup script does not work for the application the low level step ...@@ -114,21 +107,21 @@ If the cluster setup script does not work for the application the low level step
- Move config over: - Move config over:
``` ```
mkdir -p $HOME/.kube mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo cp -i /etc/Kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
``` ```
- Apply flannel networking: - Apply flannel networking:
`kubectl apply -f flannel.yml` `kubectl apply -f flannel.yml`
- Remove taint from control plane: - Remove taint from control plane (not necessary if taint is tolerated in deployment):
`kubectl taint nodes --all node-role.kubernetes.io/master-` `kubectl taint nodes --all node-role.Kubernetes.io/master-`
- Label node(s) if deployment calls for it (ones in this repo only rely on hostname, but eh): - Label node(s) if deployment calls for it (ones in this repo only rely on hostname, but eh):
`kubectl label nodes <node_name> <tag=<value>` `kubectl label nodes <node_name> <tag=<value>`
- Start applying deployments starting with base: - Start applying deployments starting with base:
`kubectl apply -f kubernetes/base-deployment.yml` `kubectl apply -f Kubernetes/base-deployment.yml`
- Continue applying deployments for as many as needed - Continue applying deployments for as many as needed
...@@ -136,52 +129,7 @@ sudo chown $(id -u):$(id -g) $HOME/.kube/config ...@@ -136,52 +129,7 @@ sudo chown $(id -u):$(id -g) $HOME/.kube/config
`kubeadm reset` `kubeadm reset`
potentially followed by [clean-up][cleanupCluster] potentially followed by [clean-up][cleanupCluster]
[cleanupCluster]: [https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-reset/] [cleanupCluster]: [https://Kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-reset/]
## Highly Available Cluster Setup
The cluster can also be brought up using the highly-available paradigm (more details and steps [here][HAtutorialK8] and on the official source [here][HAtutorial]), allowing multiple control-plane nodes to be in the cluster. To achieve this setup, we will be creating an external etcd cluster with all the nodes to be used in the K8 cluster, configure `kubeadm` to use that external etcd option, create a load balancer with kube-vip pods so there is a single point of contact, and finally add each node to the K8 cluster as a control-plane.
[HAtutorial]: [https://medium.com/velotio-perspectives/demystifying-high-availability-in-kubernetes-using-kubeadm-3d83ed8c458b]
[HAtutorialK8]: [https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#stacked-control-plane-and-etcd-nodes]
### ETCD Cluster Setup
We can create an external etcd cluster using files from this repo with the following steps:
- Make sure etcd is installed on the machine, steps [here][etcdInstall], though we will be using a different `etcd.conf` file so stop after step 1. Confirm installation with `etcd --version` (expect a warning about unsupported ARM arch if using jetson, `export ETCD_UNSUPPORTED_ARCH=arm64` fixes it). This also installs the `etcdctl` tool which we'll need later.
- Create the file `/etc/systemd/system/etcd3.service`, making it a copy of the `etcd3.service` template in the [HA cluster][ha_folder] folder. Then we can remove any other `etcdx.service` files to reduce confusion, can also edit the existing `etcd.service` file. Do this for all nodes to be added to the cluster. No changes need to be made to the service file as we will be using an environment file to specify addresses.
- Create the environment, file specified in `etcd3.service`, as `/etc/etcd.env` with the contents of the template `etcd.env` file in the [HA cluster][ha_folder] folder. In this environment file we specify each host's ip address and hostname, and the current host ip address and hostname. Note, only the last two lines need to be changed per host, only lines with `THIS_*`, in this way each host can know the others' addresses. On arm architecture nodes, the line `ETCD_UNSUPPORTED_ARCH=arm64` should be added to the top of the environment file.
- Bringing up the etcd cluster can be tricky because each host needs to be able to detect the others during start up, otherwise errors get thrown, so we first need to stop the etcd service (`sudo systemctl stop etcd3.service` on each host prior) and reload the service definition on all hosts one after another:
`sudo rm -r /var/lib/etcdCluster/
sudo systemctl stop etcd3.service
sudo systemctl daemon-reload
sudo systemctl enable etcd3.service
sudo systemctl start etcd3.service` <!-- rm will be /etcdCluster btw -->
Note, if a cluster has already been created in this way, we need to remove the data directory of that cluster by `rm -r /var/lib/etcd/` whilst the `etcd3.service` is stopped (may require super user shell). Also `journalctl -xe` and `systemctl status etcd3.service` can be helpful for troubleshooting. Also disabling the existing etcd service can make for less work on reboot.
- If all is well we should be able to confirm all nodes are present in the cluster by first specifying the addresses in the command line then requesting a member list:
`export ENDPOINTS=<HOST_1>:2379,<HOST_2>:2379,<HOST_3>:2379
etcdctl --endpoints=$ENDPOINTS member list`
Where `HOST_x` is the ip address of each host, the same values as in `etcd.env`. If all members are accounted for, we can move on to the load balancer steps.
<!-- in our case it'll be: export ENDPOINTS=192.168.111.200:2379,192.168.111.202:2379,192.168.111.201:2379 -->
[etcdInstall][https://docs.portworx.com/reference/knowledge-base/etcd-quick-setup/]
### Virtual IP Load Balancer
In order to use multiple nodes as control planes, we need to create a single point of access for the kubernetes API server (there are about a million ways to do this, keepalived, google, aws, but we'll be using the fancy new [kube-vip][load-balancer] option). We'll be running a `kube-vip` pod on each node by placing the `kube-vip.yaml` file in the `/etc/kubernetes/manifests/` directory of each node (`kube-vip.yaml` template can be found in the [HA cluster][ha_folder] folder as per usual). In `kube-vip.yaml` there are some networking options, in our case we will likely need to change the interface option depending on the node's configuration. There are other options as well, the default values are usually fine, we should note the load-balancer ip address however. Note that the pod definition needs to be in the manifests directory before starting the cluster.
### Starting HA Cluster
To start the cluster, we can still use the cluster startup script however we now need to specify how to reach the etcd cluster and what nodes it uses. In `kubeadm-config.yaml` (in root directory) change the external etcd nodes' ip addresses to the value used in the etcd cluster setup step. We can now run
`./clusterStartup` on the first node (order doesn't matter, probably).
[load-balancer][https://kube-vip.io/control-plane/]
## ROS Commands ## ROS Commands
...@@ -191,7 +139,7 @@ Once the cluster is up, either with the cluster setup bash script or otherwise, ...@@ -191,7 +139,7 @@ Once the cluster is up, either with the cluster setup bash script or otherwise,
To get into a bash shell of pod (usually find which pod up via `kubectl get pods`), can be done with `kubectl exec -it <pod_name> -- bash`. This will start a persistent bash terminal in the pod specified. Once inside a pod the file system can be viewed as any linux system (`ls`, `cd` and so on). To get into a bash shell of pod (usually find which pod up via `kubectl get pods`), can be done with `kubectl exec -it <pod_name> -- bash`. This will start a persistent bash terminal in the pod specified. Once inside a pod the file system can be viewed as any linux system (`ls`, `cd` and so on).
- To start using ROS commands, the workspace needs to be sourced first. The source command is `source /opt/ros_ws/devel/setup.bash` (depending on which image is in use, the location of ros_ws folder may change but it should be close to where the bash script started you). - To start using ROS commands, the workspace needs to be sourced first. The Darknet pods automatically does this via the `~/.bashrc` file but in case something goes wrong, the source command is `source /opt/ros_ws/devel/setup.bash` (depending on which image is in use, the location of ros_ws folder may change but it should be close to where the bash script started you).
- Now that the ros workspace has been sourced, all ros commands should work. - Now that the ros workspace has been sourced, all ros commands should work.
* To check what topics are available and or check if there is a ros master running `rostopic list` can be used which will return a list of all topics currently being published * To check what topics are available and or check if there is a ros master running `rostopic list` can be used which will return a list of all topics currently being published
...@@ -205,34 +153,35 @@ Alternately, an SSH connection can be created if using the built images discusse ...@@ -205,34 +153,35 @@ Alternately, an SSH connection can be created if using the built images discusse
`kubectl get pods -o wide` `kubectl get pods -o wide`
- The pod in question can then be SSH'ed into via (the pod's IPs are usually in the form of 10.244.\*.\* depending on your particular cluster set up): - The pod in question can then be SSH'ed into via (the pod's IPs are usually in the form of 10.244.\*.\* depending on your particular cluster set up):
`ssh ssher@<pod_ip>` `ssh root@<pod_ip>`
- If prompted for a password, use `Meepp973` as it is the default password for the `root` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now.
- If prompted for a password, use `password` as it is the default password for the `ssher` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now. The advantage of using the SSH command instead of `kubectl exec ...` is that the `roslaunch` command can work straight out of the box rather than first having to ssh between the Darknet pod and the Base Station pod. In normal use, only the base pod (also referred to as the master pod in some places) should be needed to control the cluster although both pods should be available via SSH. Headless services in the Base Station and Darknet deployments are used in combination with a `machines.launch` file in the Base Station image to allow seamless ROS launching.
The advantage of using the SSH command instead of `kubectl exec ...` is that the `ssher` user specified already has the correct ROS environment sourced and ready to use. In normal use, only the base pod (also referred to as the master pod in some places) should be needed to control the cluster although both pods should be available via SSH.
## Imaging ROS Test ## Imaging ROS Test
A relatively simple imaging pipeline has been built into the images for testing and characterization purposes. A usb webcam, or action camera with a usb interface, is needed in the current configuration although it can be easily switched to work with existing image files. For the pipeline, two ROS-nodes are launched, one capturing images from the camera and one saving those images to a networked file system (NFS). Note that it only matters where the image capturing ROS-node is running as it requires the camera or other image source to be present, the image saving ROS-node can be scheduled anywhere in the cluster as the NFS and raw image are provided to the cluster as a whole. A relatively simple imaging pipeline has been built into the images for testing and characterization purposes. A usb webcam, or action camera with a usb interface, is needed in the current configuration although it can be easily switched to work with existing image files. For the pipeline, two ROS-nodes are launched, one capturing images from the camera and one saving those images to a storage volume. Note that it only matters where the image capturing ROS-node is running as it requires the camera or other image source to be present, the image saving ROS-node can be scheduled anywhere in the cluster as the storage volume is included in the deployment itself.
The steps to run the image pipeline are below: The steps to run the image pipeline are below:
- Start cluster and add base and drone deployments the [kubernetes folder][kubernetes_folder], discussed above. - Start cluster and add base and drone deployments the [Kubernetes folder][Kubernetes_folder], discussed above.
- Start an SSH session on the base pod, get the pod's IP via: - Start an SSH session on the base pod, get the pod's IP via:
`kubectl get pods -o wide` `kubectl get pods -o wide`
and start the session by: and start the session by:
`ssh ssher@<base pod ip>` `ssh root@<base pod ip>`
- Run the imaging ROS-nodes via a roslaunch file from the base node: - Run the imaging ROS-nodes via a roslaunch file from the base node:
`roslaunch base_station image_test.launch` `roslaunch base_station image_test.launch`
The pods are already configured to have access to each other via services as well as the NFS available on the control-plane node so we don't need to specify anything more. The pods are already configured to have access to each other via services as well as the NFS available on the control-plane node so we don't need to specify anything more.
If the pipeline is working the NFS directory shared from the control-plane should be filling up with images. Not every image received is saved so there may be a delay of a few seconds between images at the moment (kubernetes-based processing pipeline is TBD). If the pipeline is working the storage directory shared from the control-plane should be filling up with images. Not every image received is saved so there may be a delay of a few seconds between images at the moment (Kubernetes-based processing pipeline is TBD).
## Handy Troubleshooting Commands ## Handy Troubleshooting Commands
There are a number of non-intuitive kubernetes commands that are super helpful for specific tasks: There are a number of non-intuitive Kubernetes commands that are super helpful for specific tasks:
- Drain a particular node so that it can be removed: - Drain a particular node so that it can be removed:
`kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets` `kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets`
...@@ -248,6 +197,9 @@ then ...@@ -248,6 +197,9 @@ then
- Another option to see pod location: - Another option to see pod location:
`kubectl get pods --all-namespaces -o wide` `kubectl get pods --all-namespaces -o wide`
- List all admin cluster events (can change namespace or omit):
`kubectl get events -n kube-system`
- Regenerate token join command: - Regenerate token join command:
`kubeadm token create --print-join-command` `kubeadm token create --print-join-command`
Regenerate join command and save it to clipboard: Regenerate join command and save it to clipboard:
...@@ -265,6 +217,15 @@ Alternate: ...@@ -265,6 +217,15 @@ Alternate:
- It has happened that the cluster networking fails entirely, making services unreachable from inside pods. The solution seemed to be to delete the flannel deployment on the cluster and re-apply it afterwards. This error can be especially difficult to detect. - It has happened that the cluster networking fails entirely, making services unreachable from inside pods. The solution seemed to be to delete the flannel deployment on the cluster and re-apply it afterwards. This error can be especially difficult to detect.
- If running the cluster in a scenario without internet access, please follow the guide [here](coredns-fix). Start by `export KUBE_EDITOR="nano"` to edit with nano if you don't know how to use VIM. Then run `kubectl edit configmap coredns -n kube-system` and we're trying to delete the line (don't ask me why or how this works, it just does):
```
forward . /etc/resolv.conf {
max_concurrent 1000
}
```
[coredns-fix]: https://serverfault.com/questions/1081685/kubernetes-coredns-is-in-crashloopbackoff-status-with-no-nameservers-found-err
## Links With More Information ## Links With More Information
Debugging clusters can be somewhat difficult as everything is stuffed into pods, the links below have some steps to try: Debugging clusters can be somewhat difficult as everything is stuffed into pods, the links below have some steps to try:
...@@ -274,9 +235,81 @@ Debugging clusters can be somewhat difficult as everything is stuffed into pods, ...@@ -274,9 +235,81 @@ Debugging clusters can be somewhat difficult as everything is stuffed into pods,
- There's some specific steps that got ARM64 builds working in my case, solution [here][docker-buildx-platform] - There's some specific steps that got ARM64 builds working in my case, solution [here][docker-buildx-platform]
[podcidr-help]: https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md [podcidr-help]: https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md
[crashloop]: https://managedkube.com/kubernetes/pod/failure/crashloopbackoff/k8sbot/troubleshooting/2019/02/12/pod-failure-crashloopbackoff.html [crashloop]: https://managedkube.com/Kubernetes/pod/failure/crashloopbackoff/k8sbot/troubleshooting/2019/02/12/pod-failure-crashloopbackoff.html
[base_station_image]: https://hub.docker.com/layers/larkinh/ros-kubernetes/latest/images/sha256-509ecd7cd9ae232026e602cd474d499bd739a78adc9f490a6380cc42f02db663?context=repo
[docker-buildx]: https://collabnix.com/building-arm-based-docker-images-on-docker-desktop-made-possible-using-buildx/ [docker-buildx]: https://collabnix.com/building-arm-based-docker-images-on-docker-desktop-made-possible-using-buildx/
[docker-buildx-platform]: https://github.com/docker/buildx/issues/464 [docker-buildx-platform]: https://github.com/docker/buildx/issues/464
[kubernetes_folder]: kubernetes/ [Kubernetes_folder]: Kubernetes/
[docker_folder]: docker/ [docker_folder]: docker/
[yolo_folder]: docker/yolo
[ha_folder]: HAcluster/ [ha_folder]: HAcluster/
## Old Stuff That Might Still Be Useful
### NFS Setup
The pod deployments can also make use of an NFS hosted from the control-plane node (the machine on which the cluster setup script is run). The NFS allows all pods to save images or otherwise to the same shared directory across the cluster. The setup steps are as follows:
- Install relevant NFS tools (usually already installed on standard linux distros):
`sudo apt-get install nfs-utils nfs-utils-lib`
- To configure which directories are included in the filesystem, edit the `/etc/exports` file on the control-plane to allow connections from all pods as well as the host itself (note, these rules might be overkill for the application, and may need edits for a larger cluster). Also, the shared folder certainly does not need to be in the home directory but it's the example the author used. The contents of the exports file should look similar to:
```
/home/<user_name>/imageNFS 10.244.0.0/24(rw)
/home/<user_name>/imageNFS 10.244.1.0/24(rw)
/home/<user_name>/imageNFS <control_plane_ip>/24(rw)
```
- The only other thing to do on the host machine is to check that the NFS service is running with:
`service nfs-server status`
- Fortunately, Kubernetes handles the mounting process for remote storage, so the `image\*` files specified in the [Kubernetes][Kubernetes_folder] directory should handle the details of the NFS.
- A quick test can be performed to make sure the NFS is working as intended. Mount the directory from a remote machine with:
`mount -t nfs <host_machine_ip>:/home/<user_name>/imageNFS /home/<remote_user_name>/imageNFS`
Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
### Highly Available Cluster Setup
The cluster can also be brought up using the highly-available paradigm (more details and steps [here][HAtutorialK8] and on the official source [here][HAtutorial]), allowing multiple control-plane nodes to be in the cluster. To achieve this setup, we will be creating an external etcd cluster with all the nodes to be used in the K8 cluster, configure `kubeadm` to use that external etcd option, create a load balancer with kube-vip pods so there is a single point of contact, and finally add each node to the K8 cluster as a control-plane.
[HAtutorial]: [https://medium.com/velotio-perspectives/demystifying-high-availability-in-Kubernetes-using-kubeadm-3d83ed8c458b]
[HAtutorialK8]: [https://Kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#stacked-control-plane-and-etcd-nodes]
#### ETCD Cluster Setup
We can create an external etcd cluster using files from this repo with the following steps:
- Make sure etcd is installed on the machine, steps [here][etcdInstall], though we will be using a different `etcd.conf` file so stop after step 1. Confirm installation with `etcd --version` (expect a warning about unsupported ARM arch if using jetson, `export ETCD_UNSUPPORTED_ARCH=arm64` fixes it). This also installs the `etcdctl` tool which we'll need later.
- Create the file `/etc/systemd/system/etcd3.service`, making it a copy of the `etcd3.service` template in the [HA cluster][ha_folder] folder. Then we can remove any other `etcdx.service` files to reduce confusion, can also edit the existing `etcd.service` file. Do this for all nodes to be added to the cluster. No changes need to be made to the service file as we will be using an environment file to specify addresses.
- Create the environment, file specified in `etcd3.service`, as `/etc/etcd.env` with the contents of the template `etcd.env` file in the [HA cluster][ha_folder] folder. In this environment file we specify each host's ip address and hostname, and the current host ip address and hostname. Note, only the last two lines need to be changed per host, only lines with `THIS_*`, in this way each host can know the others' addresses. On arm architecture nodes, the line `ETCD_UNSUPPORTED_ARCH=arm64` should be added to the top of the environment file.
- Bringing up the etcd cluster can be tricky because each host needs to be able to detect the others during start up, otherwise errors get thrown, so we first need to stop the etcd service (`sudo systemctl stop etcd3.service` on each host prior) and reload the service definition on all hosts one after another:
`sudo rm -r /var/lib/etcdCluster/
sudo systemctl stop etcd3.service
sudo systemctl daemon-reload
sudo systemctl enable etcd3.service
sudo systemctl start etcd3.service` <!-- rm will be /etcdCluster btw -->
Note, if a cluster has already been created in this way, we need to remove the data directory of that cluster by `rm -r /var/lib/etcd/` whilst the `etcd3.service` is stopped (may require super user shell). Also `journalctl -xe` and `systemctl status etcd3.service` can be helpful for troubleshooting. Also disabling the existing etcd service can make for less work on reboot.
- If all is well we should be able to confirm all nodes are present in the cluster by first specifying the addresses in the command line then requesting a member list:
`export ENDPOINTS=<HOST_1>:2379,<HOST_2>:2379,<HOST_3>:2379
etcdctl --endpoints=$ENDPOINTS member list`
Where `HOST_x` is the ip address of each host, the same values as in `etcd.env`. If all members are accounted for, we can move on to the load balancer steps.
<!-- in our case it'll be: export ENDPOINTS=192.168.111.200:2379,192.168.111.202:2379,192.168.111.201:2379 -->
[etcdInstall][https://docs.portworx.com/reference/knowledge-base/etcd-quick-setup/]
#### Virtual IP Load Balancer
In order to use multiple nodes as control planes, we need to create a single point of access for the Kubernetes API server (there are about a million ways to do this, keepalived, google, aws, but we'll be using the fancy new [kube-vip][load-balancer] option). We'll be running a `kube-vip` pod on each node by placing the `kube-vip.yaml` file in the `/etc/Kubernetes/manifests/` directory of each node (`kube-vip.yaml` template can be found in the [HA cluster][ha_folder] folder as per usual). In `kube-vip.yaml` there are some networking options, in our case we will likely need to change the interface option depending on the node's configuration. There are other options as well, the default values are usually fine, we should note the load-balancer ip address however. Note that the pod definition needs to be in the manifests directory before starting the cluster.
#### Starting HA Cluster
To start the cluster, we can still use the cluster startup script however we now need to specify how to reach the etcd cluster and what nodes it uses. In `kubeadm-config.yaml` (in root directory) change the external etcd nodes' ip addresses to the value used in the etcd cluster setup step. We can now run
`./clusterStartup` on the first node (order doesn't matter, probably).
[load-balancer][https://kube-vip.io/control-plane/]
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment