Merge branch 'master' of http://git.caslab.ece.vt.edu/hlarkin3/ros-kubernetes

morge

Merge branch 'master' of http://git.caslab.ece.vt.edu/hlarkin3/ros-kubernetes
morge
845edc97 · Your Name · edc39d65 · 0d150a96 · 845edc97
Commit 845edc97 authored Feb 14, 2022 by Your Name
Show whitespace changes
Inline Side-by-side

Showing with 82 additions and 57 deletions

README.md README.md +82 -57

No files found.
--- a/README.md
+++ b/README.md
@@ -4,34 +4,27 @@ Contains scrips and files to create a fielded ros kubernetes cluster.
 ## Image building
-Go into the `docker` directory for Dockerfiles and bash scripts, each type of image has it's own directory. The master image is required to be on the cluster before others will start.
+Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts, each type of image has it's own directory.
-Building the master image is relatively straight forward:
+- Building the base image is relatively straight forward:  
 `docker build -t <tag_name> -f Dockerfile .`
-Which should load the image into your local docker registry (or whatever registry you're using). To build the remote images, designed to be run on Nivida Jetsons mounted on a UAV, the build command is slightly more complex due to the different architectures:
+- Which should load the image into your local docker registry (or whatever registry you're using). To build the remote images, designed to be run on Nivida Jetsons mounted on a UAV, the build command is slightly more complex due to the different architectures:  
 `docker buildx build --platform linux/arm64 -o type=oci,dest=<image_name>.tar -t <tag_name> .`
-Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor. There is an extra step in that the tarball needs to be sent to the remote processor and loaded into the relevant registry:
+- Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor. There is an extra step in that the tarball needs to be sent to the remote processor and loaded into the relevant registry:  
 `scp <tarball_filename>.tar username@remote:/home/username/Downloads`
-Then the loading command on the remote processor is (from directory where image is):
+- Then the loading command on the remote processor is (from directory where image is):  
 `docker image load <image_name>.tar`
-Since an image loaded from a tarball comes with no tag (that has been the author's experience in any case), we can now tag the newly generated image. However, first we need the ID of said new image:
+- Since an image loaded from a tarball comes with no tag (that has been the author's experience in any case), we can now tag the newly generated image. However, first we need the ID of said new image:  
 `docker image ls`
-And we're looking for the most recently added un-tagged image. Copy that ID and run:
+- And we're looking for the most recently added un-tagged image. Copy that ID and run:  
 `docker image tag <image_ID> <image_name>`
-The image names used in this project are `llh/basestation:v0` for the base image, and `llh/drone:v0` for the drone image. Keeping image names the same will omit the need to edit the deployment yaml files. Once tagged we can verify the image is in fact the correct architecture with:
+- The image names used in this project are `llh/basestation:v0` for the base image, and `llh/drone:v0` for the drone image. Keeping image names the same will omit the need to edit the deployment yaml files. Once tagged we can verify the image is in fact the correct architecture with:  
 `docker image inspect <tag_name>`
 ### NFS Setup
@@ -39,36 +32,40 @@ The image names used in this project are `llh/basestation:v0` for the base image
 The pod deployments also make use of an NFS hosted from the control-plane node (the machine on which the cluster setup script is run). The NFS allows all pods to save images or otherwise to the same shared directory across the cluster. The setup steps are as follows:
 - Install relevant NFS tools (usually already installed on standard linux distros):  
 `sudo apt-get install nfs-utils nfs-utils-lib`
 - To configure which directories are included in the filesystem, edit the `/etc/exports` file on the control-plane to allow connections from all pods as well as the host itself (note, these rules might be overkill for the application, and may need edits for a larger cluster). Also, the shared folder certainly does not need to be in the home directory but it's the example the author used. The contents of the exports file should look similar to:  
+```
-`/home/<user_name>/imageNFS 10.244.0.0/24(rw)
+/home/<user_name>/imageNFS 10.244.0.0/24(rw)
 /home/<user_name>/imageNFS 10.244.1.0/24(rw)
 /home/<user_name>/imageNFS <control_plane_ip>/24(rw)
-`
+```
- The only other thing to do on the host machine is to check that the NFS service is running with: `service nfs-server status`
+- The only other thing to do on the host machine is to check that the NFS service is running with:  
+`service nfs-server status`
 - Fortunately, kubernetes handles the mounting process for remote storage, so the `image\*` files specified in the [kubernetes][kubernetes_folder] directory should handle the details of the NFS.  
- A quick test can be performed to make sure the NFS is working as intended. Mount the directory from a remote machine with: `mount -t nfs <host_machine_ip>:/home/<user_name>/imageNFS /home/<remote_user_name>/imageNFS`. Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
+- A quick test can be performed to make sure the NFS is working as intended. Mount the directory from a remote machine with:  
+`mount -t nfs <host_machine_ip>:/home/<user_name>/imageNFS /home/<remote_user_name>/imageNFS`  
+Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
 ## Cluster Setup
-After the images are built, the next step is likely to bring up a cluster. One note about
+After the images are built, the next step is likely to bring up a cluster. One note about nomenclature, both ROS and kubernetes have the concept of nodes however they refer to two different object in our cluster. ROS-nodes are processes that run on kubernetes pods, and kubernetes nodes are remote machines that run pods.
 The `clusterSetup.sh` script should setup a kubernetes cluster, with some assumptions about the application. It can be run via (should request password due to swap memory):  
 `/clusterSetup.sh`
 Once the cluster is running, the next step is likely to add remote nodes to the cluster. The cluster setup script also copies the node join command generated when creating a cluster to make it easier to add remote machines to the cluster. To add a node, start an SSH session on the remote machine and run:  
+`sudo <pasted_join_command>` which should look similar to:  
-`sudo <pasted_join_command>` which should look similar to `kubeadm join <control_plane_ip>:<port> --token qwf631.i58dqjxebj01u3h9 --discovery-token-ca-cert-hash sha256:5701ff981a1b22145b4e8a631bd9567525adb76643bece39c0533173b5de579a`. And now you see why the command gets copied by the setup script.
+```
+kubeadm join <control_plane_ip>:<port> --token qwf631.i58dqjxebj01u3h9 --discovery-token-ca-cert-hash
-deployments can be added be running (note that the `nodeSelector` and `image` fields may require editing depending on hostnames and image names):
+sha256:5701ff981a1b22145b4e8a631bd9567525adb76643bece39c0533173b5de579a
+```  
+And now you see why the command gets copied by the setup script.
+Deployments can be added be running (note that the `nodeSelector` and `image` fields may require editing depending on hostnames and image names):  
 `kubectl apply -f kubernetes/`
 Which will add all objects/pods listed in the [kubernetes directory][kubernetes_folder].
@@ -77,26 +74,36 @@ Which will add all objects/pods listed in the [kubernetes directory][kubernetes_
 If the cluster setup script does not work for the application the low level steps are as follows:
- Turn off swap: `swapoff -a`
+- Turn off swap:  
+`swapoff -a`
- Start cluster, config is for pod subnet (may take time): `kubeadm init --config kubeadm-config.yml`
+- Start cluster, config is for pod subnet (may take time):  
+`kubeadm init --config kubeadm-config.yml`
 - Move config over:  
-`mkdir -p $HOME/.kube
+```
+mkdir -p $HOME/.kube
 sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
-sudo chown $(id -u):$(id -g) $HOME/.kube/config`
+sudo chown $(id -u):$(id -g) $HOME/.kube/config
+```
- Apply flannel networking: `kubectl apply -f flannel.yml`
+- Apply flannel networking:  
+`kubectl apply -f flannel.yml`
- Remove taint from control plane: `kubectl taint nodes --all node-role.kubernetes.io/master-`
+- Remove taint from control plane:  
+`kubectl taint nodes --all node-role.kubernetes.io/master-`
- Label node(s) if deployment calls for it (ones in this repo only rely on hostname, but eh): `kubectl label nodes <node_name> <tag=<value>`
+- Label node(s) if deployment calls for it (ones in this repo only rely on hostname, but eh):  
+`kubectl label nodes <node_name> <tag=<value>`
- Start applying deployments starting with master: `kubectl apply -f kubernetes/master-deployment.yml`
+- Start applying deployments starting with base:  
+`kubectl apply -f kubernetes/base-deployment.yml`
 - Continue applying deployments for as many as needed
- Turn off cluster: `kubeadm reset` potentially followed by [clean-up][cleanupCluster]
+- Turn off cluster:  
+`kubeadm reset`  
+potentially followed by [clean-up][cleanupCluster]
 [cleanupCluster]: [https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-reset/]
@@ -119,15 +126,12 @@ To get into a bash shell of pod (usually find which pod up via `kubectl get pods
 ### SSH Server Bash shell
 Alternately, an SSH connection can be created if using the built images discussed above (as they include the open-SSH server environment already set up). Note that the pods will need to be fully scheduled in the cluster and in the running state for the IPs to be available and the SSH servers running. Get the relevant IPs via:  
 `kubectl get pods -o wide`
 - The pod in question can then be SSH'ed into via (the pod's IPs are usually in the form of 10.244.\*.\* depending on your particular cluster set up):  
 `ssh ssher@<pod_ip>`
-If prompted for a password, use `password` as it is the default password for the `ssher` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now.
+- If prompted for a password, use `password` as it is the default password for the `ssher` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now.  
 The advantage of using the SSH command instead of `kubectl exec ...` is that the `ssher` user specified already has the correct ROS environment sourced and ready to use. In normal use, only the base pod (also referred to as the master pod in some places) should be needed to control the cluster although both pods should be available via SSH.
 ## Imaging ROS Test
@@ -138,9 +142,14 @@ The steps to run the image pipeline are below:
 - Start cluster and add base and drone deployments the [kubernetes folder][kubernetes_folder], discussed above.
- Start an SSH session on the base pod, get the pod's IP via: `kubectl get pods -o wide` and start the session by: `ssh ssher@<base pod ip>`.
+- Start an SSH session on the base pod, get the pod's IP via:  
+`kubectl get pods -o wide`  
+and start the session by:  
+`ssh ssher@<base pod ip>`
- Run the imaging ROS-nodes via a roslaunch file from the base node: `roslaunch base_station image_test.launch`. The pods are already configured to have access to each other via services as well as the NFS available on the control-plane node so we don't need to specify anything more.
+- Run the imaging ROS-nodes via a roslaunch file from the base node:  
+`roslaunch base_station image_test.launch`  
+The pods are already configured to have access to each other via services as well as the NFS available on the control-plane node so we don't need to specify anything more.
 If the pipeline is working the NFS directory shared from the control-plane should be filling up with images. Not every image received is saved so there may be a delay of a few seconds between images at the moment (kubernetes-based processing pipeline is TBD).
@@ -148,19 +157,34 @@ If the pipeline is working the NFS directory shared from the control-plane shoul
 There are a number of non-intuitive kubernetes commands that are super helpful for specific tasks:
- Drain a particular node so that it can be removed: `kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets` then `kubectl delete node <node_name>`
+- Drain a particular node so that it can be removed:  
+`kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets`  
+then  
+`kubectl delete node <node_name>`
- Get bash terminal inside pod: `kubectl exec -it <pod_name> -- bash`
+- Get bash terminal inside pod:  
+`kubectl exec -it <pod_name> -- bash`
- Get which pods are on which nodes (a travesty this one): `kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName --all-namespaces`
+- Get which pods are on which nodes (a travesty this one):  
+`kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName --all-namespaces`
- Another option to see pod location: `kubectl get pods --all-namespaces -o wide`
+- Another option to see pod location:  
+`kubectl get pods --all-namespaces -o wide`
- Regenerate token join command: `kubeadm token create --print-join-command`. Regenerate join command and save it to clipboard: `kubeadm token create --print-join-command | xclip -selection clipboard`. Note that `sudo` is not required for the token commands.
+- Regenerate token join command:  
+`kubeadm token create --print-join-command`  
+Regenerate join command and save it to clipboard:  
+`kubeadm token create --print-join-command | xclip -selection clipboard`  
+Note that `sudo` is not required for the token commands.
- Add pod cidr to remote nodes, sometimes this is required depending on which interface is used to add a node: `kubectl patch node <NODE_NAME> -p '{"spec":{"podCIDR":"<SUBNET>"}}'` from this really niche [github page][podcidr-help]
+- Add pod cidr to remote nodes, sometimes this is required depending on which network interface is used to add a node (more testing is required for this issue):  
+`kubectl patch node <NODE_NAME> -p '{"spec":{"podCIDR":"<SUBNET>"}}'`  
+From this really niche [github page][podcidr-help]
- Another option to get individual pod cluster IPs: `kubectl get pods -o wide`. Alternate: `kubectl cluster-info dump | grep IP`
+- Another option to get individual pod cluster IPs:  
+`kubectl get pods -o wide`  
+Alternate:  
+`kubectl cluster-info dump | grep IP`
 - It has happened that the cluster networking fails entirely, making services unreachable from inside pods. The solution seemed to be to delete the flannel deployment on the cluster and re-apply it afterwards. This error can be especially difficult to detect.
@@ -177,3 +201,4 @@ Debugging clusters can be somewhat difficult as everything is stuffed into pods,
 [docker-buildx]: https://collabnix.com/building-arm-based-docker-images-on-docker-desktop-made-possible-using-buildx/
 [docker-buildx-platform]: https://github.com/docker/buildx/issues/464
 [kubernetes_folder]: kubernetes/
+[docker_folder]: docker/