How to Create, Troubleshoot and Use NFS type Persistent Storage Volume in Kubernetes

Whether you need to simply persists the data or share data among pods, one of the options is to use Network File System (NFS) type Persistent Volumes (PV).
However, you may encounter multiple issues and a lot of times error message(s) you see in the pod's log not detailed enough or even misleading. In this blog post, I'm going to show you step by step process (with real example) of creating PV, Persistence Volume Claims (PVC) and use them in a pod. We'll also discuss the possible issues and how to resolve them.

Prerequisites for this exercise:

  1. Make sure you have working Kubernetes cluster where you can create resources as needed. 
  2. Make sure you have a working Network File System (NFS) server and is accessible from all Kubernetes nodes in the Kubernetes cluster.

Process steps:

1) Allow Kubernetes pod/container to use NFS

1.1) Check, if selinux is enabled on your Kubernetes cluster nodes/hosts (where Kubernetes pod(s) will be created). If it is enabled, we need to make sure it lets container/pod to access remote NFS share.

$> sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 28

1.2) If it is enabled, find out the value of 'virt_use_nfs'. You can use either 'getsebool' or 'semanage' utilities as shown below:

$> getsebool virt_use_nfs
virt_use_nfs --> off
or
$> sudo semanage boolean -l | grep virt_use_nfs
virt_use_nfs (off , off) Allow virt to use nfs

1.3) If value of 'virt_use_nfs' is 'off', make sure to enable it; otherwise, any attempt by Kubernetes pod to access NFS share may be denied and you may get '403 Forbidden error' from your application.You can use 'setsebool' tool to set value as '1' or 'on'

$> sudo setsebool -P virt_use_nfs 1

$> sudo semanage boolean -l | grep virt_use_nfs
virt_use_nfs (on , on) Allow virt to use nfs

Note: -P option is to set the value permanently.

2) Create NFS share on NFS server

2.1) create  a directory on NFS server. My NFS server's IP is 192.168.56.101. Here I'm creating directory '/var/rabbitmq' on NFS server as a NFS share and assigning the ownership to 'osboxes:osboxes'. We'll discuss the ownership of the share and it's relationship to pod/container security context little later in the post.

# Create directory to be shared.
sudo mkdir -p /var/rabbitmq


# Change the ownership
$> sudo chown osboxes:osboxes /var/rabbitmq


Important: The right ownership of the NFS share is crucial.

2.2) Add NFS share in /etc/exports file. Below, I'm adding all of my kubernetes nodes. Pods running on 192.168.56.101-103 will be able to access the NFS share. 'root_squash' option "squashes" the power of the remote root user to the lowest local user, preventing unauthorized alterations.

/var/rabbitmq/ 192.168.56.101(rw,sync,root_squash)
/var/rabbitmq/ 192.168.56.102(rw,sync,root_squash)
/var/rabbitmq/ 192.168.56.103(rw,sync,root_squash)


2.3) Export the NFS share.

sudo exportfs -a


3) Provisioning of PV and PVC

Let's create a PersistentVolume (PV), PersistentVolumeClaim (PVC) for RabbitMQ.
Note: it's important that the PVC and pod that uses it to be in the same namespace. You can create them all in default namespace. However, here I'm going to create a dedicated namespace for this purpose.

3.1) Create a new namespace or use existing one or default namespace.
Below yaml file (shared-services-ns.yml) defines a namespace object called 'shared-services':

apiVersion: v1
kind: Namespace
metadata:
   name: shared-services

To create the “shared-services” namespace, run the following command:

# Create a new namespace:
$> kubectl create -f shared-services-ns.yml
namespace "shared-services" created

# Verify namespace is created successfully
$> kubectl get namespaces shared-services
NAME              STATUS    AGE
shared-services   Active    36s

3.2) Create a new service account or use existing one or default:
If a service account is not set in the pod definition, the pod uses the default service account for the namespace. Here we are defining a new service account called 'shared-svc-accnt'. File: svcAccnt.yml

apiVersion: v1
kind: ServiceAccount
metadata:
   name: shared-svc-accnt
   namespace: shared-services

To create a new service account 'shared-svc-accnt', run the following command:

# Create service account
$> kubectl create -f svcAccnt.yml
serviceaccount "shared-svc-accnt" created

# Verify service account
$> kubectl describe serviceaccount shared-svc-accnt -n shared-services
Name:                shared-svc-accnt
Namespace:           shared-services
Labels:              
Annotations:         
Image pull secrets:  
Mountable secrets:   shared-svc-accnt-token-mgk9w
Tokens:              shared-svc-accnt-token-mgk9w
Events:              

3.3) Assign role/permission to service account:
Once, service account is created, make sure to provide necessary access permission to service account in the given namespace. Based on your Kubernetes platform, you may do it differently. Since, my Kubernetes is part of Docker Enterprise Edition (EE), I do it through Docker Universal Control Plane (UCP) as described in https://docs.docker.com/ee/ucp/authorization/grant-permissions/#kubernetes-grants. I'll assign 'restricted control' role to my service account 'shared-svc-accnt' in namespace 'shared-services'. If you are using MiniKube or other platform, you may want to refer to generic Kuberentes documents for RBAC and service account permission. Basically, you need to basically create the cluster role(s) and bind it to the service account. Here are some links to corresponding documentation. See https://v1-7.docs.kubernetes.io/docs/admin/authorization/rbac/#service-account-permissions and https://kubernetes.io/docs/reference/access-authn-authz/rbac/#role-and-clusterrole

3.4) Define PV object in a yaml file (rabbitmq-nfs-pv.yml):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: rabbitmq-nfs-pv
  namespace: shared-services
spec:
  capacity:
    storage: 5Gi
  accessModes:
  - ReadWriteMany
  nfs:
    path: /var/rabbitmq/
    server: 192.168.56.101
  persistentVolumeReclaimPolicy: Retain

Note: currently a PVcan have “Retain”, “Recycle”, or “Delete” reclaim policies. For dynamically provisioned PV, the default reclaim policy is “Delete”. Kubernetes supports following access modes:

  • ReadWriteOnce – the volume can be mounted as read-write by a single node
  • ReadOnlyMany – the volume can be mounted read-only by many nodes
  • ReadWriteMany – the volume can be mounted as read-write by many nodes

To create a new PV 'rabbitmq-nfs-pv', run the following command:

# Create PV
$> kubectl create -f rabbitmq-nfs-pv.yml
persistentvolume "rabbitmq-nfs-pv" created

# Verify PV
$> kubectl describe pv rabbitmq-nfs-pv
Name:            rabbitmq-nfs-pv
Labels:          
Annotations:     
Finalizers:      []
StorageClass:
Status:          Available
Claim:
Reclaim Policy:  Retain
Access Modes:    RWX
Capacity:        5Gi
Node Affinity:   
Message:
Source:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    192.168.56.101
    Path:      /var/rabbitmq/
    ReadOnly:  false
Events:        

3.5) Define PVC object in a yaml file ( rabbitmq-nfs-pvc.yml):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rabbitmq-nfs-pvc
  namespace: shared-services
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Note: make sure to create PVC in the same namespace as your pod(s) that use it.

To create a new PVC 'rabbitmq-nfs-pvc', run the following command:

# Create PVC
$> kubectl create -f rabbitmq-nfs-pvc.yml
persistentvolumeclaim "rabbitmq-nfs-pvc" created

# Verify PVC
$> kubectl describe pvc rabbitmq-nfs-pvc -n shared-services
Name:          rabbitmq-nfs-pvc
Namespace:     shared-services
StorageClass:
Status:        Bound
Volume:        rabbitmq-nfs-pv
Labels:        
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
Finalizers:    []
Capacity:      5Gi
Access Modes:  RWX
Events:        

Important: see the status above. It's "Bound" and it's bound to volume "rabbitqm-nfs-pv" that we created in previous step. If your PVC is not able to bind with PV, then it's a problem. It could be problem in defining the PV and PVC. Make sure your PV and PVC are of same storage class (if you are using one. For details refer to https://kubernetes.io/docs/concepts/storage/storage-classes/), and PV can fully satisfy the specification defined in PVC.


3.7) Now let's put together a simple yaml file that defines service and deployment objects for RabbitMQ (rabbitmq-nfs-pv-poc-depl.yml):

apiVersion: v1
kind: Service
metadata:
  name: rabbitmq-nfs-poc-svc
  namespace: shared-services
  labels:
    app: rabbitmq-nfs-poc-svc
spec:
  type: NodePort
  ports:
  - name: http
    port: 15672
    targetPort: 15672
  - name: amqp
    protocol: TCP
    port: 5672
    targetPort: 5672
  selector:
    app: rabbitmq-app
---
apiVersion: apps/v1beta2 # for versions prior to 1.9.0
kind: Deployment
metadata:
  name: rabbitmq-depl
  namespace: shared-services
spec:
  selector:
    matchLabels:
      app: rabbitmq-app
  replicas: 1
  template:
    metadata:
      labels:
        app: rabbitmq-app
    spec:
      serviceAccountName: shared-svc-accnt
      securityContext:
        runAsUser: 1000
        supplementalGroups: [1000,65534]
      containers:
      - name: rabbitmq-cnt
        image: rabbitmq
        imagePullPolicy: IfNotPresent
        #privileged: false
        #securityContext:
          #runAsUser: 1000
        ports:
        - containerPort: 15672
          name: http-port
          protocol: TCP
        - containerPort: 5672
          name: amqp
          protocol: TCP
        volumeMounts:
          # 'name' must match the volume name below.
          - name: rabbitmq-mnt
            # Where to mount the volume.
            mountPath: "/var/lib/rabbitmq/"
      volumes:
      - name: rabbitmq-mnt
        persistentVolumeClaim:
          claimName: rabbitmq-nfs-pvc
 

Note:
As seen in the rabbitmq-nfs-pv-poc-depl.yml above, I'm defining the security context in the pod level as:

securityContext:
  runAsUser: 1000
  supplementalGroups: [1000,65534]

Here runAsUser's value '1000' and supplementalGroups' value '1000' belong to user 'osboxes' and group 'osboxes'. gid '65534' belongs to group 'nfsnobody'.

$> id osboxes
uid=1000(osboxes) gid=1000(osboxes) groups=1000(osboxes),10(wheel),983(docker)

$> id nfsnobody
uid=65534(nfsnobody) gid=65534(nfsnobody) groups=65534(nfsnobody)

My NFS share '/var/rabbitmq' is owned by 'osboxes:osboxes', so I'm specifying those values that belong to osboxes in the securityContext.

Security context can be defined both on pod level as well as container level. Security context defined in the pod level is applied to all containers in the pod. https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ has details about configuring security context for pod or container.


Following command creates rabbitmq deployment and service:

# Create objects 
$> kubectl create -f rabbitmq-nfs-pv-poc-depl.yml
service "rabbitmq-nfs-poc-svc" created
deployment.apps "rabbitmq-depl" created

# Get pods $> kubectl get pods -n shared-services
NAME                            READY     STATUS    RESTARTS   AGE
rabbitmq-depl-775496b9b-d85l7   1/1       Running   0          7s


Let's check the rabbitmq processes inside the container and files under '/var/rabbitmq' share on NFS server.

# Check process inside the container
$> kubectl exec -it rabbitmq-depl-775496b9b-d85l7 /bin/bash -n shared-services
$> ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
1000         1     0  0 12:38 ?        00:00:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
1000       162     1  0 12:38 ?        00:00:00 /usr/lib/erlang/erts-9.3.3.2/bin/epmd -daemon
1000       321     1  5 12:38 ?        00:00:03 /usr/lib/erlang/erts-9.3.3.2/bin/beam.smp -W w -

# Connect to NFS server 


$> ssh osboxes@192.168.56.101
Last login: Sun Aug 26 14:48:19 2018 from centosddcclnt

Make sure rabbitmq successfully created the file and review the file ownership
$> cd /var/rabbitmq
$> ls -la
total 28
drwxr-xr-x.  5 osboxes osboxes   4096 Aug 26 13:40 .
drwxr-xr-x. 25 root    root      4096 Aug 26 13:34 ..
-rw-------.  1 osboxes nfsnobody   40 Aug 26 13:40 .bash_history
drwxr-xr-x.  3 osboxes nfsnobody 4096 Aug 26 13:38 config
-r--------.  1 osboxes nfsnobody   20 Aug 26 01:00 .erlang.cookie
drwxr-xr-x.  4 osboxes nfsnobody 4096 Aug 26 13:38 mnesia
drwxr-xr-x.  2 osboxes nfsnobody 4096 Aug 26 13:38 schema



4) Possible issues & troubleshooting

4.1) Pod remain in pending state and pod description shows 'mount failed: exit status 32' as shown below:

$> kubectl describe pod rabbitmq-shared-app -n shared-services
Name:         rabbitmq-shared-app
Namespace:    shared-services
Node:         centosddcwrk01/192.168.56.103
Start Time:   Thu, 16 Aug 2018 17:03:19 +0100
Labels:       name=rabbitmq-shared-app
Annotations:  
Status:       Pending
IP:
  ...
  ...
...
Events:
  Type     Reason                 Age   From                   Message
  ----     ------                 ----  ----                   -------
  ...
  Warning  FailedMount            50s   kubelet, centosddcucp  MountVolume.SetUp failed for volume .... : mount failed: exit status 32

If you try to run the mount manually from inside the container, you may see following:

$> kubectl exec -it rabbitmq-depl-bd9689c8-7md48 /bin/bash -n shared-services
root@rabbitmq-depl-bd9689c8-7md48:/# pwd
/


root@rabbitmq-depl-bd9689c8-7md48:/# mount -t nfs 192.168.56.101:/var/rabbitmq /tmp/test
mount: wrong fs type, bad option, bad superblock on 192.168.56.101:/var/rabbitmq,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount. helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

In this case, review the '/etc/exports' file on NFS server.  This file controls which file systems are exported to remote hosts and specifies options. If your Kubernetes host/node is not listed
in this file with appropriate option(s), a pod running on that node will not be able to mount. Make sure to run the command 'sudo exportfs -a' once you have updated the /etc/exports. You can also try to manually mount from your host (instead of from within the container) in order to test if that host/node is authorized to mount. Refer to https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/deployment_guide/s1-nfs-server-config-exports for details.


4.2) Pod fails to instantiate and you see 'chown: changing ownership of '/var/lib/rabbitmq': Operation not permitted' error in the log as shown below:

$> kubectl create -f rabbitmq-nfs-pv-poc-depl.yml
service "rabbitmq-nfs-poc-svc" created
deployment.apps "rabbitmq-depl" created

$> kubectl get pods -n shared-services
NAME                             READY     STATUS             RESTARTS   AGE
rabbitmq-depl-5fff645d95-429vd   0/1       CrashLoopBackOff   1          14s

$> kubectl logs rabbitmq-depl-5fff645d95-429vd -n shared-services
chown: changing ownership of '/var/lib/rabbitmq': Operation not permitted

This means that the pod is able to mount successfully, however, it's not able to change the ownership of file/directory. The easiest way to resolve this issue is to have a common user that owns NFS share on NFS server and runAsUser of Kubernetes pod. For example, for this demo, I have used 'osboxes' user which owns the NFS share and also use this user's uid '1000' in the pod level security context.

$> ls -lZ /var/rabbitmq
drwxr-xr-x. osboxes nfsnobody system_u:object_r:var_t:s0       ...

$> id osboxes
uid=1000(osboxes) gid=1000(osboxes) groups=1000(osboxes),10(wheel),983(docker)

In reality, it may not be that easy. You may not have access to remote NFS server or system administrator of NFS server is not willing to change the ownership of NFS share on NFS server. In this case (as a work-around), you can use 'root' as runAsUser like below in the container level:

securityContext:
  runAsUser: 0

However, for this to work properly, the /etc/exports file on NFS server should not squash (use 'no_root_squash') the root. It should look something like this:

/var/rabbitmq/ 192.168.56.103(rw,sync,no_root_squash)

'no_root_squash' has it's own security consequences. See details here https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/deployment_guide/s1-nfs-server-config-exports

In summary, in order to grant pod's access to PVs you need to take considerations of:

  • Finding the group ID and/or user ID assigned to the actual storage (on NFS server)
  • SELinux considerations,
  • Also making sure that the IDs allowed to access physical storage match the requirements of the particular pod.

The Group IDs, the user ID, and SELinux values can be defined in the pod's SecurityContext section. User IDs can also be defined to each container. So, in short you can use the following user, group and options to control and find the right combination:
  • supplementalGroups
  • fsGroup
  • runAsUser
  • seLinuxOptions

Hope, it helps you a little bit!

Note: yaml files used in this post can be downloaded from Github location: https://github.com/pppoudel/kube-pv-pvc-demo

Upgrade to Docker EE 2.0 and UCP 3.x for Choice of Swarm or Kubernetes Orchestration

Docker Enterprise Edition (EE) 2.0 has introduced integrated Kubernetes orchestration engine along with SWARM. Since Kubernetes is installed and configured as part of the of the upgrade to Docker EE 2.0 and Universal Control Plane (UCP) 3.x, it saves a lot of time which otherwise is needed to install and setup Kubernetes environment.


In this blog post, I'm discussing the upgrade process (not going to go through each step though. Because official Docker documentation is detail enough for this) and going to direct you to the right documentation and also discuss few issues that I encountered during the upgrade and how I resolved them.


Planning for Upgrade

1) Prerequisite check for hardware/software - Docker recommends at least 8 GB of physical memory available on UCP and Docker Trusted Registry (DTR) nodes and 4 GB for other worker nodes. See details hardware and software requirement here: https://docs.docker.com/ee/ucp/admin/install/system-requirements/

2) Firewall ports - since Kubernetes master and worker nodes will be part of the upgraded environment, additional ports required for Kubernetes need to open. Details on port used can be found here: https://docs.docker.com/ee/ucp/admin/install/system-requirements/#ports-used. I put together few lines of shell script to open firewall ports (uses firewall-cmd utility). Use/modify it as needed.

openFWPortsForDockerEE.sh

#!/bin/sh
# openFWPortsForDockerEE.sh
# Opens required ports for Docker EE 2.0/UCP 3.x
# Ref:
# https://docs.docker.com/ee/ucp/admin/install/system-requirements/#ports-used
# https://docs.docker.com/datacenter/ucp/2.1/guides/admin/install/system-requirements/#network-requirements
tcp_ports="179,443,80,2375,2376,2377,2380,4001,4443,4789,6443,6444,7001,7946,8080,10250,12376-12387"
udp_ports="4789,7946"

openFW() {
   IFS=",";
   for _port in $1; do
      echo "Opening ${_port}/$2";
      sudo firewall-cmd --permanent --zone=public --add-port=${_port}/$2;
   done
   IFS=" ";
}

openFW "${tcp_ports}" tcp;
openFW "${udp_ports}" udp;

# Recycle firewall
sudo firewall-cmd --reload;

Backup Docker EE

You need to backup Docker Swarm, UCP, and DTR . Please follow this document (https://docs.docker.com/ee/backup/) for backup.

Upgrade Docker Engine

Very well documented step by step process can be found here: https://docs.docker.com/ee/upgrade/#upgrade-docker-engine

Upgrade UCP

UCP can be upgraded from UCP Web user interface (Web UI) or  command line interface (CLI). Both options are documented here: https://docs.docker.com/ee/ucp/admin/install/upgrade/#use-the-cli-to-perform-an-upgrade.

Note: If all possible try to use CLI instead of Web UI. I had upgraded my personal DEV environment using CLI and did not encounter any issue, however, one of my colleagues initially tried to use Web UI and  had issue. The upgrade process went forever, and failed.

Note: If you have less than 4 GB of memory, you'll get warning during the upgrade. It may complete successfully (as you see below) or may fail. So, it is best practice to fulfil the minimum requirement whenever possible. Below is output from my UCP 3.0 upgrade:

$> sudo docker container run --rm -it --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:3.0.0 upgrade --interactive

INFO[0000] Your engine version 17.06.2-ee-10, build 66261a0 (3.10.0-514.el7.x86_64) is compatible
FATA[0000] Your system does not have enough memory. UCP suggests a minimum of 4.00 GB, but you only have 2.92 GB. You may have unexpected errors. You may proceed by specifying the '--force-minimums' fla g, but you may experience scale and performance problems as a result
[osboxes@centosddcucp scripts]$ sudo docker container run --rm -it --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:3.0.0 upgrade --interactive --force-minimums
INFO[0000] Your engine version 17.06.2-ee-10, build 66261a0 (3.10.0-514.el7.x86_64) is compatible
WARN[0000] Your system does not have enough memory. UCP suggests a minimum of 4.00 GB, but you only have 2.92 GB. You may have unexpected errors.
WARN[0002] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
INFO[0006] Upgrade the UCP 3.0.0 installation on this cluster to 3.0.0 for UCP ID: nufs9fb696bs6rm4kxaauewly
INFO[0006] Once this operation completes, all nodes in this cluster will be upgraded.
Do you want proceed with the upgrade? (y/n): y
INFO[0017] Pulling required images... (this may take a while)
INFO[0017] Pulling docker/ucp-interlock:3.0.0
INFO[0048] Pulling docker/ucp-compose:3.0.0
INFO[0130] Pulling docker/ucp-dsinfo:3.0.0
INFO[0183] Pulling docker/ucp-interlock-extension:3.0.0
WARN[0000] Your system does not have enough memory. UCP suggests a minimum of 4.00 GB, but you only have 2.92 GB. You may have unexpected errors.
WARN[0002] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
INFO[0007] Checking for version compatibility
INFO[0007] Updating configuration for Interlock service
INFO[0038] Updating configuration for existing UCP service
INFO[0141] Waiting for cluster to finish upgrading
INFO[0146] Success! Please log in to the UCP console to verify your system.

Note: You may also find your upgrade to UCP 3.x process getting stuck while updating ucp-kv, just like we had in one of our environments. The symptom and resolution are documented here: https://success.docker.com/article/upgrade-to-ucp-3-gets-stuck-updating-ucp-kv


After the Upgrade

If you run 'docker ps' after upgrade on UCP host, all UCP related processes (like docker/ucp-*) should be of version '3.x', if you notice any of those processes still in version '2.x', meaning upgrade is not quite successful. You can also run 'docker version' and make sure the output shows 'ucp/3.x'

If your upgrade is successful, after the upgrade, you are going to notice few things right way, some of them are listed below:

1) UCP Web UI looks different now. You are going to see Kubernetes and related resources standing out as the first class citizen.

2) You may also notice that your application is not accessible any more even though corresponding service(s) may seem to be running (specifically, if you used HTTP Routing Mesh (HRM) before the upgrade). We encountered an issue (related to HRM) in our DEV environment. Before the upgrade, we had something like this configuration (fragment from  our yaml file):

version: "3.1"
services:
   testsvc:
      ...
      ...
      ports:
         - "9080"
         - "9443"
      deploy:
         ...
         ...
         labels:
            - "com.docker.ucp.mesh.http.9080=external_route=http://testsvc.devdte.com:8080,internal_port=9080"
            - "com.docker.ucp.mesh.http.9443=external_route=sni://testsvc.devdte.com:8443,internal_port=9443"
...
...



As shown above, internal port 9080 is mapped to external port 8080 (http) and internal port 9443 is mapped to external port 8443 (https) and 'testsvc.devdte.com' is configured as a host. And our routing mesh setting looked like as shown below:


Before the upgrade, the above configuration allowed us to access the service as shown below:

  • http://testsvc.devdte.com:8080/xxx
    or
  • https://testsvc.devdte.com:/8443/xxx

However, after the upgrade, we could access the application only on port 8443. If you encounter similar issue, refer to Layer 7 routing upgrade for more details.


3) Another interesting issue we encountered after the upgrade was related to HTTP header parameter being rejected. One of our applications relied on HTTP header parameter and the parameter had a underscore '_' (something like 'user_name'). After the upgrade, suddenly, application started responding with HTTP status code 502. After investigation, we found out that the Nginx - that's a part of Layer 7 routing solution, was silently rejecting this parameter because it had underscore '_'. Refer to my blog How to override Kubernetes Ingress-Nginx-Controller and Docker UCP Layer 7 Routing Configuration for details.

4) Lastly, if you are planning to use Kubernetes orchestration and 'kubectl' utility to connect to Kubernetes master, you need to download your client certificates bundle again. env.sh/env.cmd has been updated to set Kubernetes cluster, context and credentials configuration so that 'kubectl' command can securely establish connection to Kubernetes master and be able to communicate. Refer to CLI based access and Install the Kubernetes CLI for more details. Once you have installed 'kubectl' and downloaded and extracted client certificates bundle, test connectivity to Kubernetes master as follows:

# Change directory to the folder where you extracted you client certificates bundle
# and run following command to set kubernetes context, credentials and cluster configuration

$> eval "$(<env.sh)"
Cluster "ucp_ddcucphost:6443_ppoudel" set.
User "ucp_ddcucphost:6443_ppoudel" set.
Context "ucp_ddcucphost:6443_ppoudel" created.

# Confirm the connection to UCP. You should see something like this:


$> kubectl config current-context
ucp_ddcucphost:6443_ppoudel

# Inspect Kubernetes resources

$> kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 443/TCP 6d


How to override Kubernetes Ingress-Nginx-Controller and Docker UCP Layer 7 Routing Configuration

One of our dockerized applications mysteriously stopped working after we upgraded to Docker Enterprise Edition (EE) 2.0/Universal Control Plane (UCP) 3.x. After investigation, we found out that the Nginx that is being used as part of Docker Layer 7 routing solution was silently dropping  HTTP header parameter (refer to Missing (disappearing) HTTP Headers) which had underscore '_' (something like 'user_name') in it and our application required the value from that HTTP header parameter in order to function correctly. Note: our name based virtual hosting relied in Docker Layer 7 routing solution.
Later on, as part of migration to Kubernetes from Docker SWARM, we again encountered this issue as we were using Kubernetes' Ingress-Nginx-Controller.
In this post, I'm going to show how to resolve this issue whether it is with Docker UCP Layer 7 routing or Kubernetes' Ingress-Nginx-Controller.


Overriding Kubernetes' Ingress-Nginx-Controller configuration

Create a configMap as shown below. In this example, I'm overriding the 'underscores_in_headers' Nginx configuration to 'on' from default 'off'. Refer to this post to see what parameters are allowed in configMap.

ingress-nginx-config.yml
apiVersion: v1
kind: ConfigMap
data:
   enable-underscores-in-headers: "true"
metadata:
   name: nginx-configuration
   namespace: ingress-nginx
   labels:
      app: ingress-nginx

The key here is:
data:
   enable-underscores-in-headers: "true"


If you have existing configMap object 'nginx-configuration', then you can edit and update the parameter's value that you want to override. If configMap object does not exist, then you can create it using 'kubectl' as shown below, however, make sure you are referring this configMap object in your controller's container spec.

#edit
$> kubectl edit configMap/nginx-configuration -n ingress-nginx
# It opens the configuration into your editor, you can update any configuration and save. Saving the yaml will update the resource in the API server.

# Create
$> kubectl create -f ingress-nginx-config.yml -n ingress-nginx

In order to verify whether the configuration of ingress-nginx-controller has been updated, you can do the following:
  1. Find the ingress-nginx-controller pod using following command;
    $>kubectl get pods -n ingress-nginx
  2. See nginx.conf file and make sure the parameter you are overriding has been updated. In this case we are looking underscores_in_headers value updated from 'off' to 'on'
    $> kubectl exec nginx-ingress-controller-68db848949-ncvj7 -n ingress-nginx cat /etc/nginx/nginx.conf | grep underscores_in_headers
    underscores_in_headers on;


Overriding/customizing Docker Layer 7 routing solution configuration 

The following steps you can using Docker CLI. Make sure, secure connection has been established from where you are running Docker CLI to UCP. You can do it using Client Certificate Bundle.


  1. # export current ucp-interlock configuration to CURRENT_CONFIG_NAME variable
    $> CURRENT_CONFIG_NAME=$(docker service inspect --format '{{ (index .Spec.TaskTemplate.ContainerSpec.Configs 0).ConfigName }}' ucp-interlock)


  2. # Write information to config.toml file
    $> docker config inspect --format '{{ printf "%s" .Spec.Data }}' $CURRENT_CONFIG_NAME > config.toml

  3. # Update config.toml as below. In this case we are overriding the value of nginx
    # configuration 'underscores_in_headers' from 'off' to 'on' by changing ucp-interlock service
    # configuration 'UnderscoresInHeaders' value from 'false' to 'true'

  4. # Create updated config object
    $> docker config create UPDATED_CONFIG_NAME config.toml

  5. # Verify the object created:
    $> docker config ls
    ID NAME CREATED UPDATED
    061xu64qyotlbtrdz9l5e1s0h UPDATED_CONFIG_NAME 6 seconds ago 6 seconds ago

  6. # Update the ucp-interlock service to start using the new configuration:
    $> docker service update \
    --config-rm $CURRENT_CONFIG_NAME \
    --config-add source=$UPDATED_CONFIG_NAME,target=/config.toml \
    ucp-interlock

  7. # Wait for a minute, make sure interlock service started successfully. Look the timestamp
    $> docker ps | grep interlock

  8. # Rollback (if necessary)
    $> docker service update --update-failure-action rollback ucp-interlock

Note: the above steps can be used to update/override any other Layer 7 routing configuration. Refer to Layer 7 routing configuration reference to find out all other configurable properties.

Note: Everytime you restart (disable/enable) the Layer 7 routing solution from UCP UI, it starts with default configuration, so you have to perform above steps again to override the configuration.

Port Conflict and Resolution in WebSphere Application Server and Liberty Profile

Port conflict is one of the frequent issues that integrator and administrator encounter. In this post, I'm going to discuss how to resolve port conflict in WebSphere Application Server Traditional (WASt) and WebSphere Liberty Profile (WLP). Recently, while doing a Proof of Concept (PoC) work, I had to install and launch a WLP server on the same host where WASt was running and I received following error in the WLP's messages.log:

[4/15/18 16:59:00:048 BST] 00000017 ibm.ws.transport.iiop.security.config.ssl.yoko.SocketFactory E CWWKS9580E: The server socket could not be opened on localhost:2,809.  The exception message is Address already in use (Bind failed).
[3/19/18 9:42:51:037 EDT] 00000017 LogService-271-com.ibm.ws.management.j2ee.mejb               E CWWKE0701E: [com.ibm.ws.management.j2ee.mejb.service.ManagementEJBService(77)] The setServerStarted method has thrown an exception Bundle:com.ibm.ws.management.j2ee.mejb(id=271) java.lang.IllegalStateException: com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeException: com.ibm.ws.exception.RuntimeError: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBContainerImpl.startSystemModule(EJBContainerImpl.java:230)
        at com.ibm.ws.management.j2ee.mejb.service.ManagementEJBService.startManagementEJB(ManagementEJBService.java:161)
        ...
        Caused by: com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeException: com.ibm.ws.exception.RuntimeError: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeImpl.startSystemModule(EJBRuntimeImpl.java:968)
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBContainerImpl.startSystemModule(EJBContainerImpl.java:228)
        ... 39 more
Caused by: com.ibm.ws.exception.RuntimeError: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.runtime.AbstractEJBRuntime.startModule(AbstractEJBRuntime.java:587)
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeImpl.startSystemModule(EJBRuntimeImpl.java:964)
        ... 40 more
Caused by: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.remote.internal.EJBRemoteRuntimeImpl.bind(EJBRemoteRuntimeImpl.java:189)

As per error message above, the WLP servr could not start object request broker (ORB) service, because port 2809 is in use. Note: WLP (by default) uses port 2809 for ORB.
Next step is to find out which process is (already) using this port, you can use netstat -ntlp | grep <port> like netstat -ntlp | grep 2809.
Once you find out the PID (process id) that is listening on that particular port, you can get more detail about the process by using ps -ef | grep <PID> In my case the port 2809 was being used by WASt Nodeagent process.
Now, I had two choices: either change the WASt Nodeagent process's port number or change the IIOP Endpoint's port in WLP server. I have tested both and below I'm going to show both options.

Change the iiopEndpoint port for WLP server


It is simple, just open the server.xml for the WLP server and add following lines. Here I'm changing iiopport to 2709 and iiopsport to 9403.
Note: Make sure those new ports are not currently being used by any process. You can find out whether they are currently being used using command: netstat -na | egrep '(2709|9403)'

<iiopendpoint id="defaultIiopEndpoint" iiopport="2709">
   <iiopsoptions iiopsport="9403" sslref="defaultSSLConfig">
   </iiopsoptions>

</iiopendpoint>

Refer to https://www.ibm.com/support/knowledgecenter/en/SS7K4U_liberty/com.ibm.websphere.wlp.zseries.doc/ae/rwlp_portnums.html to find out WLP default port numbers.
Restart the WLP server after updating the port number in server.xml and review the messages.log again to make sure port binding errors gone.

Update/Change the port on WASt side


You can change/update port on WASt side two ways. Either using Administration console or by using wsadmin command. To find out more details about the PortManagement command group for the AdminTask object, refer to https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rxml_atportmgt.html

Using wsadmin command:

  1. Connect to Dmgr
    $> ./wsadmin.sh
    WASX7209I: Connected to process "dmgr" on node ubuntuwas9CellManager01 using SOAP connector; The type of process is: DeploymentManager
    WASX7031I: For help, enter: "print Help.help()"
  2. Find out the current port in question. In my case, it is for nodeagent
    wsadmin>AdminTask.listServerPorts('nodeagent', '[-nodeName ubuntuwas9Node02]') 

    u'[[IPC_CONNECTOR_ADDRESS [[[host localhost] [node ubuntuwas9Node02] [server nodeagent] [port 9629] ]]] ]\n[[CSIV2_SSL_SERVERAUTH_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 9201] ]]] ]\n[[XDAGENT_PORT [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 7062] ]]] ]\n[[OVERLAY_UDP_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 11003] ]]] ]\n[[DCS_UNICAST_ADDRESS [[[host *] [node ubuntuwas9Node02] [server nodeagent] [port 9353] ]]] ]\n[[NODE_DISCOVERY_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 7272] ]]] ]\n[[BOOTSTRAP_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 2809] ]]] ]\n[[NODE_IPV6_MULTICAST_DISCOVERY_ADDRESS [[[host ff01::1] [node ubuntuwas9Node02] [server nodeagent] [port 5001] ]]] ]\n[[SAS_SSL_SERVERAUTH_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 9901] ]]] ]\n[[SOAP_CONNECTOR_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 8878] ]]] ]\n[[NODE_MULTICAST_DISCOVERY_ADDRESS [[[host 232.133.104.73] [node ubuntuwas9Node02] [server nodeagent] [port 5000] ]]] ]\n[[ORB_LISTENER_ADDRESS [[[host ubuntuwas9][node ubuntuwas9Node02] [server nodeagent] [port 9900] ]]] ]\n[[CSIV2_SSL_MUTUALAUTH_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 9202] ]]] ]\n[[OVERLAY_TCP_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 11004] ]]] ]'
  3. As seen from command output above, the nodeagent is listening on port 2809, let's change  it to 2709.
    wsadmin>AdminTask.modifyServerPort ('nodeagent', '[-nodeName ubuntuwas9Node02 -endPointName BOOTSTRAP_ADDRESS -port 2709 -modifyShared true]')

    u'[[BOOTSTRAP_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 2709] ]]] ]'
  4. Save the changes:
    wsadmin>AdminConfig.save()
  5. If it's a federated environment, make sure to synchronise the configuration with the node(s). If your Nodeagent is currently stopped, go to the host machine where Nodeagent is installed and run syncNode.sh <dmgr-host> as shown below:
    wasadmin@ubuntuwas9:/opt/ibm/WebSphere/AppServer/profiles/AppSrv01/bin$ ./syncNode.sh ubuntuwas9 

    ADMU0116I: Tool information is being logged in file /opt/ibm/WebSphere/AppServer/profiles/AppSrv01/logs/syncNode.log
    ADMU0128I: Starting tool with the AppSrv01 profile
    ADMU0401I: Begin syncNode operation for node ubuntuwas9Node02 with Deployment
    Manager ubuntuwas9: 8879
    ADMU0016I: Synchronizing configuration between node and cell.
    ADMU0402I: The configuration for node ubuntuwas9Node02 has been synchronized with Deployment Manager ubuntuwas9: 8879

    If your nodeagent is running. You can use the following command to sync:

    wsadmin>dmgrObj=AdminControl.queryNames('WebSphere:type=DeploymentManager,*') wsadmin>AdminControl.invoke(dmgrObj, 'multiSync', '[false]', '[java.lang.Boolean]')
  6. (Re)start the Nodeagent:
    if nodeagent is stopped, start the nodeagent:
    Note: you can use the stop[start]Node.sh command from profile_root/bin directory. ./startNode.sh
    if it is currently running, stop and start it:
    ./stopNode.sh
    ./startNode.sh

Using WASt Administration console:

  1. Access WASt Administration console: https://<host>:<port>/admin
  2. Once logged in, on the left panel, navigate to "SystemAdministration" and click on "Node agents"
  3. On the right hand side, click on "Ports" link under "Additional Properties". You can see a table that contains port name and port numbers.
  4. Click on "Details" button beside the table in order to update any port.
  5. Select any and click on any port name link.
  6. Editable screen appears, update the port and click "Apply"
  7. Make sure to Synchronise the change with the nodes.
  8. Restart the affected server. (Nodeagent in this case).

Hope, this post will be helpful in your next port conflict resolution!

It's in You to Give

Today (March 27, 2018), I proudly wore my jacket with the Lapel Pin of Canadian Blood Services  on it again. And yes, I donated blood for the 2nd time, and hoping to do so much more often going forward. Don't get me wrong, this post is not really to celebrate my donation, but to encourage others like myself who are just starting to donate or thinking about it. We all need to come forward and do this noble thing, because our people, communities and countries need blood all the time.

It does not cost anything. As the Canadian Blood Services puts it in simple words, "It's in you to give." Surprisingly, blood donors get some health benefits as well. See Donor health benefits section of Wikipedia.
Finally, voluntary blood donation is a very important concept and we need to support it. See World Health Organisation's paper entitled "Towards 100% Voluntary Blood Donation - A Global Framework for Action".
Believe me, it's not hard. If I can do it, anyone can. Just make sure you're well hydrated before sitting in for the donation. If you are in Canada call the Canadian Blood Services at 1 888 2 DONATE (1-888-236-6283) or visit their website at www.blood.ca to schedule your appointment. If you are in any other jurisdiction, contact your national blood services to donate!

Update as of July 05, 2019:
Cheers again! Donated today for the third time and felt awesome.

How to Parse Apache error_log for Troubleshooting & Reporting


Note: if you haven't already, see Log Parsing, Analysis, Correlation, and Reporting Engine post first.

Apache error_log can be useful while troubleshooting production problem. So parsing and analysing the content of this file regularly helps in maintaining the overall health of the system. If mod_mpmstats enabled, error_log also contains Multi-Processing Modules (MPM) stats data. MPM stats can be used for both troubleshooting and performance tuning. http://publib.boulder.ibm.com/httpserv/manual70/mod/mod_mpmstats.html provides more details about MPM stats.
Since, error_log does not contain the Web server name, in order to co-relate the data to corresponding Web server, it is advisable that you put error_log files for each Web server under corresponding directory, named after the Web server. It is specially important when you are parsing logs from multiple Web servers. Script takes directory name as Web server name for the purpose of reporting and analysis. For example, let's say, you have Web servers 'webSrv01, webSrv02, webSrv03 ... etc., then put logs from each Web server under corresponding directories as shown below:

 /tmp/webSrv01
    error_log
    error_log_2017.09.05.log
    access_log
 /tmp/webSrv02
    error_log
    error_log_2017.09.05.log
    access_log

The naming suffix for historical files can be different from one environment to another. So, if you have different suffix for historical files, you can tweak the find script. Currently the fragment of script that finds error_log looks like this:

find $rootcontext -name "error_log*" -type f | grep "$logFileName"
where $rootcontext is root path.

Review the actual script available in github - https://github.com/pppoudel/log-parser/blob/master/webErrorLogParser.sh for details.

Note: script is written to parse the date format like '[Thu Dec 14 08:13:08 2017]' in error_log. If your error_log uses different date format, you may need to tweak the section of script which parses the date.

How to execute:
You can see all the available options, by just launching:
$> ./webErrorLogParser.sh

See below for few examples:
# processing current day's logs
$> ./webErrorLogParser.sh --rootcontext <log-path>

# processing yesterday's logs with historical report updates
$> ./webErrorLogParser.sh --rootcontext <log-path> --rpttype daily

# processing any day's logs updates
$> ./webErrorLogParser.sh --rootcontext <log-path> --recorddate <date in (YYYY-MM-DD) format>


Output
Report/Output files:
  • $rptDir/00_Alert.txt
  • $rptDir/03_WebErrorLogSummaryRpt.txt
  • $rptDir/WebErrorLogMpmStatsRpt_all.csv
  • $rptDir/WebErrorLogRpt_all.csv
Where $rptDir is report directory. Default value is $TMP/$recDate

History Report/Output files:
# These are historical reports. Each run will append record in existing report file.
  • $pDir/RecycleHistoryRpt_all.csv
  • $pDir/MPMStatsHistoryRpt.csv
Where $pDir is parent of $rptDir.

See sample summary report in github - https://github.com/pppoudel/log-parser/blob/master/sample_reports/03_WebErrorLogSummaryRpt.txt
And here is a sample MPM stats report https://github.com/pppoudel/log-parser/blob/master/sample_reports/WebErrorLogMpmStatsRpt_all.csv

See my other posts in this series
  1. websphereLogParser.sh for parsing, analyzing and reporting WebSphere Application Server (WAS) SystemOut.log
  2. webAccessLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) access_log
  3. javaGCStatsParser.sh for parsing, analyzing and reporting Java verbose Garbage Collection (GC) log

Log Parsing, Analysis, Correlation, and Reporting Engine

 
   In the last few months, I have been helping to identify and resolve production issues (both performance and product related). I had to analyze vast amount of logs, identify performance degradation and deviation, and issues related to Java heap and Garbage Collection (GC), as well as  different issues affecting the health of WebSphere Application Server (WAS). In order to do the above-mentioned tasks efficiently, I have employed different tools (both open source and commercial). Even-though these tools are readily available, and usually good what they do, they may not be as effective as we like for our particular circumstances and we end-up writing our own custom tool or script to complement in certain areas. Same story here, I ended up writing a custom tool (let me call it a Log Parser)  for log parsing, analyzing, making correlation, and reporting. I'm sharing my custom Log Parser here, hoping that it may be useful for other people as well. It is written in AWK and Shell script. It processes the following logs:
  • SystemOut.log generated by IBM WebSphere Application Server (WAS)  
  • access_log and error_log logs generated by Apache or IBM HTTP Server(IHS)
  • native_stdout.log or verbose GC logs generated by Java Virtual Machine (JVM).
Let me shed some light on the internal functioning of the Parser visually. See the diagram 1.0 below




Diagram 1.0

As depicted in the diagram, the Parser is made of a set of script files (collection of different Parsers) and wrapper script - together acting like a suite. Each parser can be executed independently or invoked by the wrapper script. The Parser is driven by the logic in the script and is controlled by the input parameters and their values (control parameters, threshold parameters, correlation parameters, and transaction baseline values). It consumes the logs and writes different reports as an output.
Most interesting part is the input here.  The Feedback loop/mechanism as shown in the diagram is to let the analyst know that he/she should continually refine the threshold and other applicable input parameter values based on output analysis.This feedback mechanism makes the Parser - a kind of expert engine. So, it is very important to regularly update your threshold values, filter keywords, and maintain an well established performance baseline. Parser helps you to maintain feedback mechanism, because it collects vital statistics and updates historical data files, doing so, it is not only collecting important data, but also quantifying the system events. Quantification helps to compare, generate alerts and make decisions. For example, you quantify in average how many particular errors you get per day or per hour or per server, or per transaction and based on that you define your threshold value. Let's say, based on a month long of observation, average number of daily transaction errors from server A, fluctuates from 10 to 30 in normal situation. So, your high mark for normal situation is 30. Based on this data, you can define your threshold value 35 for that particular error for that server.
What are the key benefits of using this Parser:
  1. Make troubleshooting faster and effective with built-in intelligence from lesson learned and baseline data. Parser identifies critical errors, their frequency, location, key performance numbers, current state of the environment like how many users, sessions, transactions, (if any) anomalies in the system, which help to narrow down the issue(s). 
  2. Automatically collect key statistical data (performance, error or usage) and build a data mart. Parser collects all vital statistics like performance numbers, performance range, hourly user/session statistics, heap snapshots etc. and updates historical data files. These data can be used to generate history report and also in decision making process.
  3.  Auto generate key summary reports for internal consumption and create delimited data files, which can be imported to spread sheet like Excel to prepare management reports. Basically, it can provide visibility to your entire application infrastructure. 
  4.  Create correlation. Parser creates correlation so that it becomes easier to identify and map transaction path (Web server to the Application server). 
  5. Generate warning for possible future incidents/events. Parser can provide early warning of possible future events. Here is an example of generated warning: "2.18383e+06 : average of Perm Generation After Full GC exceeds threshold 2097152 (K).  There is a possibility of OutOfMemory in near future because of Not sufficient PermGen Space for AppSrv04"

Getting started is very simple. No big-bang installation, or configuration is required. If you are running in Unix like environment, you just download the script, and launch the Parser from the directory where it is downloaded. If you are on Windows, you need Cygwin or Bash Shell that comes with MINGW to execute it.

How to execute?

You can see all the available options, by just executing:

$> ./masterLogParser.sh

Manadatory option '--rootcontext' or '-c' missing

-c|--rootcontext: Required. Source path from where log files are read.
-t|--rpttype: Optional. Values are: 'daily' or 'ondemand'. 'ondemand' is default value.
It is used to control logic - like whether or not to update historical data files.
Only 'daily' option creates and updates historical data files.
-d|--recorddate: Optional. It is the log entry date. Meaning log entries with that date will be processed.
It takes the format 'YYYY-MM-DD'. Default is to use current date. However, if 'daily' is chosen as 2nd argument, and log entry date is not provided, it defaults to 'date - 1 day'.
-l|--rptloc: Optional. It is report directory where all generated reports are written.
Default value is /tmp/
-o|--procoption: Optional. It represents the processing option. Values can be 'full' or 'partial'.
Default value is 'partial'. This option is currently being used only for Verbose GC log parser.

Here are few examples:

# processing current day's logs
$> ./masterLogParser.sh --rootcontext <log-path>

# processing yesterday's logs with historical report updates
$> ./masterLogParser.sh --rootcontext <log-path> --rpttype daily

# processing any day's logs updates
$> ./masterLogParser.sh --rootcontext <log-path> --recorddate <date in (YYYY-MM-DD) format>

See masterLogParser.sh in github: https://github.com/pppoudel/log-parser/blob/master/masterLogParser.sh

Input

1. thresholdValues.csv

As name implies, this file contains pre-defined name and threshold value pair for certain condition or events. Parser lookups these pre-defined condition, and when it detects one in a log file, it compares with threshold value and triggers/writes notification into output file (00_Alert.txt) if logged event exceeds the threshold value. Threshold can be performance based like 'notify if maximum response time exceeds 9 seconds' or event based like 'notify if maximum fatal count for a JVM exceeds 5'
Format:
Each line in thresholdValues.csv has multiple columns separated by pipe '|' and represent threshold definition for one complete event condition. See below:

event-name|value|server-identifier|event-description
e.g.
httpAvgRespTimeTh|2.5|http|Threshold for Average response time in sec.



Where:
event-name: name of the event like httpAvgRespTimeTh (http) Average Response Time threshold.
value: threshold value for this specific event. In this case it is 2.5 seconds.
server-identifier: Which log/server this value belongs to. In this case it is 'http' server.
event-description: provides some details what this threshold is about.

See a sample thresholdValues.csv in github: https://github.com/pppoudel/log-parser/blob/master/thresholdValues.csv

2. perfBaseLine.csv

This file contains pre-defined transactions (request URIs) and their baseline response time (in seconds). I suppose, you can get content for this file from your performance test result.

Format:
Each line in perfBaseLine.csv has two columns separated by pipe '|' which represent performance value for a given transaction (request/response). See below:

request-name|response-time (in seconds)
e.g.
finManagement/account_add.do|1.57756

Where:
request-name: represents request/response URI or transaction name, whatever you call it. In this case it is finManagement/account_add.do
response-time: response time for the transaction to complete in seconds. 1.5 seconds in this case.
See a sample perfBaseLine.csv in github: https://github.com/pppoudel/log-parser/blob/master/perfBaseLine.csv

3. WASCustomFilter.txt

Currently this input file is only consumed by websphereLogParser. It  defines some custom error/keywords. It is to tell parser that you're interested and like to know if certain keywords or string in general are logged (because of certain condition) in a log file, which may be  non-standard and specific to your environment/application.

Format:
It uses Regular expression to define custom error/keywords. Each new error definition goes to new line. See below:

Error.*Getting.*Folder
503.*Service.*Temporarily.*Unavailable
CORBA.*NO_RESPONSE
ORA-01013:

See a sample WASCustomFilter.txt in github: https://github.com/pppoudel/log-parser/blob/master/WASCustomFilter.txt

4. WAS_CloneIDs.csv

This file contains information that defines relationship (mapping) between HTTP session clone ID and WAS name. Clone ID constitutes part of HTTP session and can be logged into Web Server access_log. With the relationship in hand, we can generate helpful analytical data that helps to identify transaction/request path end to end. Easiest way to find out clone ID for each WAS is to look your plugin-cfg.xml file.

Format:
Each line in WAS_CloneIDs.csv four columns separated by pipe '|'. See below:

cloneID|WAS-name|hostname
e.g.
23532em3r|AppSrv01|washost082

Where:
cloneID cloneID is part of jSession. 23532em3r in above example. Refer to https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/txml_httpsessionclone.html
WAS-name  WebSphere Application Server (WAS) name. AppSrv01 in above example.
hostname Hostname of machine/server where particular WAS resides. washost082 in above example.


See a sample WAS_CloneIDs.csv in github: https://github.com/pppoudel/log-parser/blob/master/WAS_CloneIDs.csv

Output:

Each Parser update Alert file, and history reports (only if report type is 'daily') as well as generate summary report and other report files. For the complete list, see '#--------- Report/Output files -------#'  and '#--------- History Report/Output files -------#' sections in each script file.

For further detail of each individual parser, visit the following blog posts:
  1. websphereLogParser.sh for parsing, analyzing and reporting WebSphere Application Server (WAS) SystemOut.log
  2. webAccessLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) access_log
  3. webErrorLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) error_log
  4. javaGCStatsParser.sh for parsing, analyzing and reporting Java verbose Garbage Collection (GC) log