Upgrade to Docker EE 2.0 and UCP 3.x for Choice of Swarm or Kubernetes Orchestration

Docker Enterprise Edition (EE) 2.0 has introduced integrated Kubernetes orchestration engine along with SWARM. Since Kubernetes is installed and configured as part of the of the upgrade to Docker EE 2.0 and Universal Control Plane (UCP) 3.x, it saves a lot of time which otherwise is needed to install and setup Kubernetes environment.


In this blog post, I'm discussing the upgrade process (not going to go through each step though. Because official Docker documentation is detail enough for this) and going to direct you to the right documentation and also discuss few issues that I encountered during the upgrade and how I resolved them.


Planning for Upgrade

1) Prerequisite check for hardware/software - Docker recommends at least 8 GB of physical memory available on UCP and Docker Trusted Registry (DTR) nodes and 4 GB for other worker nodes. See details hardware and software requirement here: https://docs.docker.com/ee/ucp/admin/install/system-requirements/

2) Firewall ports - since Kubernetes master and worker nodes will be part of the upgraded environment, additional ports required for Kubernetes need to open. Details on port used can be found here: https://docs.docker.com/ee/ucp/admin/install/system-requirements/#ports-used. I put together few lines of shell script to open firewall ports (uses firewall-cmd utility). Use/modify it as needed.

openFWPortsForDockerEE.sh

#!/bin/sh
# openFWPortsForDockerEE.sh
# Opens required ports for Docker EE 2.0/UCP 3.x
# Ref:
# https://docs.docker.com/ee/ucp/admin/install/system-requirements/#ports-used
# https://docs.docker.com/datacenter/ucp/2.1/guides/admin/install/system-requirements/#network-requirements
tcp_ports="179,443,80,2375,2376,2377,2380,4001,4443,4789,6443,6444,7001,7946,8080,10250,12376-12387"
udp_ports="4789,7946"

openFW() {
   IFS=",";
   for _port in $1; do
      echo "Opening ${_port}/$2";
      sudo firewall-cmd --permanent --zone=public --add-port=${_port}/$2;
   done
   IFS=" ";
}

openFW "${tcp_ports}" tcp;
openFW "${udp_ports}" udp;

# Recycle firewall
sudo firewall-cmd --reload;

Backup Docker EE

You need to backup Docker Swarm, UCP, and DTR . Please follow this document (https://docs.docker.com/ee/backup/) for backup.

Upgrade Docker Engine

Very well documented step by step process can be found here: https://docs.docker.com/ee/upgrade/#upgrade-docker-engine

Upgrade UCP

UCP can be upgraded from UCP Web user interface (Web UI) or  command line interface (CLI). Both options are documented here: https://docs.docker.com/ee/ucp/admin/install/upgrade/#use-the-cli-to-perform-an-upgrade.

Note: If all possible try to use CLI instead of Web UI. I had upgraded my personal DEV environment using CLI and did not encounter any issue, however, one of my colleagues initially tried to use Web UI and  had issue. The upgrade process went forever, and failed.

Note: If you have less than 4 GB of memory, you'll get warning during the upgrade. It may complete successfully (as you see below) or may fail. So, it is best practice to fulfil the minimum requirement whenever possible. Below is output from my UCP 3.0 upgrade:

$> sudo docker container run --rm -it --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:3.0.0 upgrade --interactive

INFO[0000] Your engine version 17.06.2-ee-10, build 66261a0 (3.10.0-514.el7.x86_64) is compatible
FATA[0000] Your system does not have enough memory. UCP suggests a minimum of 4.00 GB, but you only have 2.92 GB. You may have unexpected errors. You may proceed by specifying the '--force-minimums' fla g, but you may experience scale and performance problems as a result
[osboxes@centosddcucp scripts]$ sudo docker container run --rm -it --name ucp -v /var/run/docker.sock:/var/run/docker.sock docker/ucp:3.0.0 upgrade --interactive --force-minimums
INFO[0000] Your engine version 17.06.2-ee-10, build 66261a0 (3.10.0-514.el7.x86_64) is compatible
WARN[0000] Your system does not have enough memory. UCP suggests a minimum of 4.00 GB, but you only have 2.92 GB. You may have unexpected errors.
WARN[0002] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
INFO[0006] Upgrade the UCP 3.0.0 installation on this cluster to 3.0.0 for UCP ID: nufs9fb696bs6rm4kxaauewly
INFO[0006] Once this operation completes, all nodes in this cluster will be upgraded.
Do you want proceed with the upgrade? (y/n): y
INFO[0017] Pulling required images... (this may take a while)
INFO[0017] Pulling docker/ucp-interlock:3.0.0
INFO[0048] Pulling docker/ucp-compose:3.0.0
INFO[0130] Pulling docker/ucp-dsinfo:3.0.0
INFO[0183] Pulling docker/ucp-interlock-extension:3.0.0
WARN[0000] Your system does not have enough memory. UCP suggests a minimum of 4.00 GB, but you only have 2.92 GB. You may have unexpected errors.
WARN[0002] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
INFO[0007] Checking for version compatibility
INFO[0007] Updating configuration for Interlock service
INFO[0038] Updating configuration for existing UCP service
INFO[0141] Waiting for cluster to finish upgrading
INFO[0146] Success! Please log in to the UCP console to verify your system.

Note: You may also find your upgrade to UCP 3.x process getting stuck while updating ucp-kv, just like we had in one of our environments. The symptom and resolution are documented here: https://success.docker.com/article/upgrade-to-ucp-3-gets-stuck-updating-ucp-kv


After the Upgrade

If you run 'docker ps' after upgrade on UCP host, all UCP related processes (like docker/ucp-*) should be of version '3.x', if you notice any of those processes still in version '2.x', meaning upgrade is not quite successful. You can also run 'docker version' and make sure the output shows 'ucp/3.x'

If your upgrade is successful, after the upgrade, you are going to notice few things right way, some of them are listed below:

1) UCP Web UI looks different now. You are going to see Kubernetes and related resources standing out as the first class citizen.

2) You may also notice that your application is not accessible any more even though corresponding service(s) may seem to be running (specifically, if you used HTTP Routing Mesh (HRM) before the upgrade). We encountered an issue (related to HRM) in our DEV environment. Before the upgrade, we had something like this configuration (fragment from  our yaml file):

version: "3.1"
services:
   testsvc:
      ...
      ...
      ports:
         - "9080"
         - "9443"
      deploy:
         ...
         ...
         labels:
            - "com.docker.ucp.mesh.http.9080=external_route=http://testsvc.devdte.com:8080,internal_port=9080"
            - "com.docker.ucp.mesh.http.9443=external_route=sni://testsvc.devdte.com:8443,internal_port=9443"
...
...



As shown above, internal port 9080 is mapped to external port 8080 (http) and internal port 9443 is mapped to external port 8443 (https) and 'testsvc.devdte.com' is configured as a host. And our routing mesh setting looked like as shown below:


Before the upgrade, the above configuration allowed us to access the service as shown below:

  • http://testsvc.devdte.com:8080/xxx
    or
  • https://testsvc.devdte.com:/8443/xxx

However, after the upgrade, we could access the application only on port 8443. If you encounter similar issue, refer to Layer 7 routing upgrade for more details.


3) Another interesting issue we encountered after the upgrade was related to HTTP header parameter being rejected. One of our applications relied on HTTP header parameter and the parameter had a underscore '_' (something like 'user_name'). After the upgrade, suddenly, application started responding with HTTP status code 502. After investigation, we found out that the Nginx - that's a part of Layer 7 routing solution, was silently rejecting this parameter because it had underscore '_'. Refer to my blog How to override Kubernetes Ingress-Nginx-Controller and Docker UCP Layer 7 Routing Configuration for details.

4) Lastly, if you are planning to use Kubernetes orchestration and 'kubectl' utility to connect to Kubernetes master, you need to download your client certificates bundle again. env.sh/env.cmd has been updated to set Kubernetes cluster, context and credentials configuration so that 'kubectl' command can securely establish connection to Kubernetes master and be able to communicate. Refer to CLI based access and Install the Kubernetes CLI for more details. Once you have installed 'kubectl' and downloaded and extracted client certificates bundle, test connectivity to Kubernetes master as follows:

# Change directory to the folder where you extracted you client certificates bundle
# and run following command to set kubernetes context, credentials and cluster configuration

$> eval "$(<env.sh)"
Cluster "ucp_ddcucphost:6443_ppoudel" set.
User "ucp_ddcucphost:6443_ppoudel" set.
Context "ucp_ddcucphost:6443_ppoudel" created.

# Confirm the connection to UCP. You should see something like this:


$> kubectl config current-context
ucp_ddcucphost:6443_ppoudel

# Inspect Kubernetes resources

$> kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 443/TCP 6d


How to override Kubernetes Ingress-Nginx-Controller and Docker UCP Layer 7 Routing Configuration

One of our dockerized applications mysteriously stopped working after we upgraded to Docker Enterprise Edition (EE) 2.0/Universal Control Plane (UCP) 3.x. After investigation, we found out that the Nginx that is being used as part of Docker Layer 7 routing solution was silently dropping  HTTP header parameter (refer to Missing (disappearing) HTTP Headers) which had underscore '_' (something like 'user_name') in it and our application required the value from that HTTP header parameter in order to function correctly. Note: our name based virtual hosting relied in Docker Layer 7 routing solution.
Later on, as part of migration to Kubernetes from Docker SWARM, we again encountered this issue as we were using Kubernetes' Ingress-Nginx-Controller.
In this post, I'm going to show how to resolve this issue whether it is with Docker UCP Layer 7 routing or Kubernetes' Ingress-Nginx-Controller.


Overriding Kubernetes' Ingress-Nginx-Controller configuration

Create a configMap as shown below. In this example, I'm overriding the 'underscores_in_headers' Nginx configuration to 'on' from default 'off'. Refer to this post to see what parameters are allowed in configMap.

ingress-nginx-config.yml
apiVersion: v1
kind: ConfigMap
data:
   enable-underscores-in-headers: "true"
metadata:
   name: nginx-configuration
   namespace: ingress-nginx
   labels:
      app: ingress-nginx

The key here is:
data:
   enable-underscores-in-headers: "true"


If you have existing configMap object 'nginx-configuration', then you can edit and update the parameter's value that you want to override. If configMap object does not exist, then you can create it using 'kubectl' as shown below, however, make sure you are referring this configMap object in your controller's container spec.

#edit
$> kubectl edit configMap/nginx-configuration -n ingress-nginx
# It opens the configuration into your editor, you can update any configuration and save. Saving the yaml will update the resource in the API server.

# Create
$> kubectl create -f ingress-nginx-config.yml -n ingress-nginx

In order to verify whether the configuration of ingress-nginx-controller has been updated, you can do the following:
  1. Find the ingress-nginx-controller pod using following command;
    $>kubectl get pods -n ingress-nginx
  2. See nginx.conf file and make sure the parameter you are overriding has been updated. In this case we are looking underscores_in_headers value updated from 'off' to 'on'
    $> kubectl exec nginx-ingress-controller-68db848949-ncvj7 -n ingress-nginx cat /etc/nginx/nginx.conf | grep underscores_in_headers
    underscores_in_headers on;


Overriding/customizing Docker Layer 7 routing solution configuration 

The following steps you can using Docker CLI. Make sure, secure connection has been established from where you are running Docker CLI to UCP. You can do it using Client Certificate Bundle.


  1. # export current ucp-interlock configuration to CURRENT_CONFIG_NAME variable
    $> CURRENT_CONFIG_NAME=$(docker service inspect --format '{{ (index .Spec.TaskTemplate.ContainerSpec.Configs 0).ConfigName }}' ucp-interlock)


  2. # Write information to config.toml file
    $> docker config inspect --format '{{ printf "%s" .Spec.Data }}' $CURRENT_CONFIG_NAME > config.toml

  3. # Update config.toml as below. In this case we are overriding the value of nginx
    # configuration 'underscores_in_headers' from 'off' to 'on' by changing ucp-interlock service
    # configuration 'UnderscoresInHeaders' value from 'false' to 'true'

  4. # Create updated config object
    $> docker config create UPDATED_CONFIG_NAME config.toml

  5. # Verify the object created:
    $> docker config ls
    ID NAME CREATED UPDATED
    061xu64qyotlbtrdz9l5e1s0h UPDATED_CONFIG_NAME 6 seconds ago 6 seconds ago

  6. # Update the ucp-interlock service to start using the new configuration:
    $> docker service update \
    --config-rm $CURRENT_CONFIG_NAME \
    --config-add source=$UPDATED_CONFIG_NAME,target=/config.toml \
    ucp-interlock

  7. # Wait for a minute, make sure interlock service started successfully. Look the timestamp
    $> docker ps | grep interlock

  8. # Rollback (if necessary)
    $> docker service update --update-failure-action rollback ucp-interlock

Note: the above steps can be used to update/override any other Layer 7 routing configuration. Refer to Layer 7 routing configuration reference to find out all other configurable properties.

Note: Everytime you restart (disable/enable) the Layer 7 routing solution from UCP UI, it starts with default configuration, so you have to perform above steps again to override the configuration.

Port Conflict and Resolution in WebSphere Application Server and Liberty Profile

Port conflict is one of the frequent issues that integrator and administrator encounter. In this post, I'm going to discuss how to resolve port conflict in WebSphere Application Server Traditional (WASt) and WebSphere Liberty Profile (WLP). Recently, while doing a Proof of Concept (PoC) work, I had to install and launch a WLP server on the same host where WASt was running and I received following error in the WLP's messages.log:

[4/15/18 16:59:00:048 BST] 00000017 ibm.ws.transport.iiop.security.config.ssl.yoko.SocketFactory E CWWKS9580E: The server socket could not be opened on localhost:2,809.  The exception message is Address already in use (Bind failed).
[3/19/18 9:42:51:037 EDT] 00000017 LogService-271-com.ibm.ws.management.j2ee.mejb               E CWWKE0701E: [com.ibm.ws.management.j2ee.mejb.service.ManagementEJBService(77)] The setServerStarted method has thrown an exception Bundle:com.ibm.ws.management.j2ee.mejb(id=271) java.lang.IllegalStateException: com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeException: com.ibm.ws.exception.RuntimeError: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBContainerImpl.startSystemModule(EJBContainerImpl.java:230)
        at com.ibm.ws.management.j2ee.mejb.service.ManagementEJBService.startManagementEJB(ManagementEJBService.java:161)
        ...
        Caused by: com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeException: com.ibm.ws.exception.RuntimeError: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeImpl.startSystemModule(EJBRuntimeImpl.java:968)
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBContainerImpl.startSystemModule(EJBContainerImpl.java:228)
        ... 39 more
Caused by: com.ibm.ws.exception.RuntimeError: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.runtime.AbstractEJBRuntime.startModule(AbstractEJBRuntime.java:587)
        at com.ibm.ws.ejbcontainer.osgi.internal.EJBRuntimeImpl.startSystemModule(EJBRuntimeImpl.java:964)
        ... 40 more
Caused by: java.lang.IllegalStateException: The orb is not available
        at com.ibm.ws.ejbcontainer.remote.internal.EJBRemoteRuntimeImpl.bind(EJBRemoteRuntimeImpl.java:189)

As per error message above, the WLP servr could not start object request broker (ORB) service, because port 2809 is in use. Note: WLP (by default) uses port 2809 for ORB.
Next step is to find out which process is (already) using this port, you can use netstat -ntlp | grep <port> like netstat -ntlp | grep 2809.
Once you find out the PID (process id) that is listening on that particular port, you can get more detail about the process by using ps -ef | grep <PID> In my case the port 2809 was being used by WASt Nodeagent process.
Now, I had two choices: either change the WASt Nodeagent process's port number or change the IIOP Endpoint's port in WLP server. I have tested both and below I'm going to show both options.

Change the iiopEndpoint port for WLP server


It is simple, just open the server.xml for the WLP server and add following lines. Here I'm changing iiopport to 2709 and iiopsport to 9403.
Note: Make sure those new ports are not currently being used by any process. You can find out whether they are currently being used using command: netstat -na | egrep '(2709|9403)'

<iiopendpoint id="defaultIiopEndpoint" iiopport="2709">
   <iiopsoptions iiopsport="9403" sslref="defaultSSLConfig">
   </iiopsoptions>

</iiopendpoint>

Refer to https://www.ibm.com/support/knowledgecenter/en/SS7K4U_liberty/com.ibm.websphere.wlp.zseries.doc/ae/rwlp_portnums.html to find out WLP default port numbers.
Restart the WLP server after updating the port number in server.xml and review the messages.log again to make sure port binding errors gone.

Update/Change the port on WASt side


You can change/update port on WASt side two ways. Either using Administration console or by using wsadmin command. To find out more details about the PortManagement command group for the AdminTask object, refer to https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rxml_atportmgt.html

Using wsadmin command:

  1. Connect to Dmgr
    $> ./wsadmin.sh
    WASX7209I: Connected to process "dmgr" on node ubuntuwas9CellManager01 using SOAP connector; The type of process is: DeploymentManager
    WASX7031I: For help, enter: "print Help.help()"
  2. Find out the current port in question. In my case, it is for nodeagent
    wsadmin>AdminTask.listServerPorts('nodeagent', '[-nodeName ubuntuwas9Node02]') 

    u'[[IPC_CONNECTOR_ADDRESS [[[host localhost] [node ubuntuwas9Node02] [server nodeagent] [port 9629] ]]] ]\n[[CSIV2_SSL_SERVERAUTH_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 9201] ]]] ]\n[[XDAGENT_PORT [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 7062] ]]] ]\n[[OVERLAY_UDP_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 11003] ]]] ]\n[[DCS_UNICAST_ADDRESS [[[host *] [node ubuntuwas9Node02] [server nodeagent] [port 9353] ]]] ]\n[[NODE_DISCOVERY_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 7272] ]]] ]\n[[BOOTSTRAP_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 2809] ]]] ]\n[[NODE_IPV6_MULTICAST_DISCOVERY_ADDRESS [[[host ff01::1] [node ubuntuwas9Node02] [server nodeagent] [port 5001] ]]] ]\n[[SAS_SSL_SERVERAUTH_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 9901] ]]] ]\n[[SOAP_CONNECTOR_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 8878] ]]] ]\n[[NODE_MULTICAST_DISCOVERY_ADDRESS [[[host 232.133.104.73] [node ubuntuwas9Node02] [server nodeagent] [port 5000] ]]] ]\n[[ORB_LISTENER_ADDRESS [[[host ubuntuwas9][node ubuntuwas9Node02] [server nodeagent] [port 9900] ]]] ]\n[[CSIV2_SSL_MUTUALAUTH_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 9202] ]]] ]\n[[OVERLAY_TCP_LISTENER_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 11004] ]]] ]'
  3. As seen from command output above, the nodeagent is listening on port 2809, let's change  it to 2709.
    wsadmin>AdminTask.modifyServerPort ('nodeagent', '[-nodeName ubuntuwas9Node02 -endPointName BOOTSTRAP_ADDRESS -port 2709 -modifyShared true]')

    u'[[BOOTSTRAP_ADDRESS [[[host ubuntuwas9] [node ubuntuwas9Node02] [server nodeagent] [port 2709] ]]] ]'
  4. Save the changes:
    wsadmin>AdminConfig.save()
  5. If it's a federated environment, make sure to synchronise the configuration with the node(s). If your Nodeagent is currently stopped, go to the host machine where Nodeagent is installed and run syncNode.sh <dmgr-host> as shown below:
    wasadmin@ubuntuwas9:/opt/ibm/WebSphere/AppServer/profiles/AppSrv01/bin$ ./syncNode.sh ubuntuwas9 

    ADMU0116I: Tool information is being logged in file /opt/ibm/WebSphere/AppServer/profiles/AppSrv01/logs/syncNode.log
    ADMU0128I: Starting tool with the AppSrv01 profile
    ADMU0401I: Begin syncNode operation for node ubuntuwas9Node02 with Deployment
    Manager ubuntuwas9: 8879
    ADMU0016I: Synchronizing configuration between node and cell.
    ADMU0402I: The configuration for node ubuntuwas9Node02 has been synchronized with Deployment Manager ubuntuwas9: 8879

    If your nodeagent is running. You can use the following command to sync:

    wsadmin>dmgrObj=AdminControl.queryNames('WebSphere:type=DeploymentManager,*') wsadmin>AdminControl.invoke(dmgrObj, 'multiSync', '[false]', '[java.lang.Boolean]')
  6. (Re)start the Nodeagent:
    if nodeagent is stopped, start the nodeagent:
    Note: you can use the stop[start]Node.sh command from profile_root/bin directory. ./startNode.sh
    if it is currently running, stop and start it:
    ./stopNode.sh
    ./startNode.sh

Using WASt Administration console:

  1. Access WASt Administration console: https://<host>:<port>/admin
  2. Once logged in, on the left panel, navigate to "SystemAdministration" and click on "Node agents"
  3. On the right hand side, click on "Ports" link under "Additional Properties". You can see a table that contains port name and port numbers.
  4. Click on "Details" button beside the table in order to update any port.
  5. Select any and click on any port name link.
  6. Editable screen appears, update the port and click "Apply"
  7. Make sure to Synchronise the change with the nodes.
  8. Restart the affected server. (Nodeagent in this case).

Hope, this post will be helpful in your next port conflict resolution!

It's in You to Give

Today (March 27, 2018), I proudly wore my jacket with the Lapel Pin of Canadian Blood Services  on it again. And yes, I donated blood for the 2nd time, and hoping to do so much more often going forward. Don't get me wrong, this post is not really to celebrate my donation, but to encourage others like myself who are just starting to donate or thinking about it. We all need to come forward and do this noble thing, because our people, communities and countries need blood all the time.

It does not cost anything. As the Canadian Blood Services puts it in simple words, "It's in you to give." Surprisingly, blood donors get some health benefits as well. See Donor health benefits section of Wikipedia.
Finally, voluntary blood donation is a very important concept and we need to support it. See World Health Organisation's paper entitled "Towards 100% Voluntary Blood Donation - A Global Framework for Action".
Believe me, it's not hard. If I can do it, anyone can. Just make sure you're well hydrated before sitting in for the donation. If you are in Canada call the Canadian Blood Services at 1 888 2 DONATE (1-888-236-6283) or visit their website at www.blood.ca to schedule your appointment. If you are in any other jurisdiction, contact your national blood services to donate!

Update as of July 05, 2019:
Cheers again! Donated today for the third time and felt awesome.

How to Parse Apache error_log for Troubleshooting & Reporting


Note: if you haven't already, see Log Parsing, Analysis, Correlation, and Reporting Engine post first.

Apache error_log can be useful while troubleshooting production problem. So parsing and analysing the content of this file regularly helps in maintaining the overall health of the system. If mod_mpmstats enabled, error_log also contains Multi-Processing Modules (MPM) stats data. MPM stats can be used for both troubleshooting and performance tuning. http://publib.boulder.ibm.com/httpserv/manual70/mod/mod_mpmstats.html provides more details about MPM stats.
Since, error_log does not contain the Web server name, in order to co-relate the data to corresponding Web server, it is advisable that you put error_log files for each Web server under corresponding directory, named after the Web server. It is specially important when you are parsing logs from multiple Web servers. Script takes directory name as Web server name for the purpose of reporting and analysis. For example, let's say, you have Web servers 'webSrv01, webSrv02, webSrv03 ... etc., then put logs from each Web server under corresponding directories as shown below:

 /tmp/webSrv01
    error_log
    error_log_2017.09.05.log
    access_log
 /tmp/webSrv02
    error_log
    error_log_2017.09.05.log
    access_log

The naming suffix for historical files can be different from one environment to another. So, if you have different suffix for historical files, you can tweak the find script. Currently the fragment of script that finds error_log looks like this:

find $rootcontext -name "error_log*" -type f | grep "$logFileName"
where $rootcontext is root path.

Review the actual script available in github - https://github.com/pppoudel/log-parser/blob/master/webErrorLogParser.sh for details.

Note: script is written to parse the date format like '[Thu Dec 14 08:13:08 2017]' in error_log. If your error_log uses different date format, you may need to tweak the section of script which parses the date.

How to execute:
You can see all the available options, by just launching:
$> ./webErrorLogParser.sh

See below for few examples:
# processing current day's logs
$> ./webErrorLogParser.sh --rootcontext <log-path>

# processing yesterday's logs with historical report updates
$> ./webErrorLogParser.sh --rootcontext <log-path> --rpttype daily

# processing any day's logs updates
$> ./webErrorLogParser.sh --rootcontext <log-path> --recorddate <date in (YYYY-MM-DD) format>


Output
Report/Output files:
  • $rptDir/00_Alert.txt
  • $rptDir/03_WebErrorLogSummaryRpt.txt
  • $rptDir/WebErrorLogMpmStatsRpt_all.csv
  • $rptDir/WebErrorLogRpt_all.csv
Where $rptDir is report directory. Default value is $TMP/$recDate

History Report/Output files:
# These are historical reports. Each run will append record in existing report file.
  • $pDir/RecycleHistoryRpt_all.csv
  • $pDir/MPMStatsHistoryRpt.csv
Where $pDir is parent of $rptDir.

See sample summary report in github - https://github.com/pppoudel/log-parser/blob/master/sample_reports/03_WebErrorLogSummaryRpt.txt
And here is a sample MPM stats report https://github.com/pppoudel/log-parser/blob/master/sample_reports/WebErrorLogMpmStatsRpt_all.csv

See my other posts in this series
  1. websphereLogParser.sh for parsing, analyzing and reporting WebSphere Application Server (WAS) SystemOut.log
  2. webAccessLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) access_log
  3. javaGCStatsParser.sh for parsing, analyzing and reporting Java verbose Garbage Collection (GC) log

Log Parsing, Analysis, Correlation, and Reporting Engine

 
   In the last few months, I have been helping to identify and resolve production issues (both performance and product related). I had to analyze vast amount of logs, identify performance degradation and deviation, and issues related to Java heap and Garbage Collection (GC), as well as  different issues affecting the health of WebSphere Application Server (WAS). In order to do the above-mentioned tasks efficiently, I have employed different tools (both open source and commercial). Even-though these tools are readily available, and usually good what they do, they may not be as effective as we like for our particular circumstances and we end-up writing our own custom tool or script to complement in certain areas. Same story here, I ended up writing a custom tool (let me call it a Log Parser)  for log parsing, analyzing, making correlation, and reporting. I'm sharing my custom Log Parser here, hoping that it may be useful for other people as well. It is written in AWK and Shell script. It processes the following logs:
  • SystemOut.log generated by IBM WebSphere Application Server (WAS)  
  • access_log and error_log logs generated by Apache or IBM HTTP Server(IHS)
  • native_stdout.log or verbose GC logs generated by Java Virtual Machine (JVM).
Let me shed some light on the internal functioning of the Parser visually. See the diagram 1.0 below




Diagram 1.0

As depicted in the diagram, the Parser is made of a set of script files (collection of different Parsers) and wrapper script - together acting like a suite. Each parser can be executed independently or invoked by the wrapper script. The Parser is driven by the logic in the script and is controlled by the input parameters and their values (control parameters, threshold parameters, correlation parameters, and transaction baseline values). It consumes the logs and writes different reports as an output.
Most interesting part is the input here.  The Feedback loop/mechanism as shown in the diagram is to let the analyst know that he/she should continually refine the threshold and other applicable input parameter values based on output analysis.This feedback mechanism makes the Parser - a kind of expert engine. So, it is very important to regularly update your threshold values, filter keywords, and maintain an well established performance baseline. Parser helps you to maintain feedback mechanism, because it collects vital statistics and updates historical data files, doing so, it is not only collecting important data, but also quantifying the system events. Quantification helps to compare, generate alerts and make decisions. For example, you quantify in average how many particular errors you get per day or per hour or per server, or per transaction and based on that you define your threshold value. Let's say, based on a month long of observation, average number of daily transaction errors from server A, fluctuates from 10 to 30 in normal situation. So, your high mark for normal situation is 30. Based on this data, you can define your threshold value 35 for that particular error for that server.
What are the key benefits of using this Parser:
  1. Make troubleshooting faster and effective with built-in intelligence from lesson learned and baseline data. Parser identifies critical errors, their frequency, location, key performance numbers, current state of the environment like how many users, sessions, transactions, (if any) anomalies in the system, which help to narrow down the issue(s). 
  2. Automatically collect key statistical data (performance, error or usage) and build a data mart. Parser collects all vital statistics like performance numbers, performance range, hourly user/session statistics, heap snapshots etc. and updates historical data files. These data can be used to generate history report and also in decision making process.
  3.  Auto generate key summary reports for internal consumption and create delimited data files, which can be imported to spread sheet like Excel to prepare management reports. Basically, it can provide visibility to your entire application infrastructure. 
  4.  Create correlation. Parser creates correlation so that it becomes easier to identify and map transaction path (Web server to the Application server). 
  5. Generate warning for possible future incidents/events. Parser can provide early warning of possible future events. Here is an example of generated warning: "2.18383e+06 : average of Perm Generation After Full GC exceeds threshold 2097152 (K).  There is a possibility of OutOfMemory in near future because of Not sufficient PermGen Space for AppSrv04"

Getting started is very simple. No big-bang installation, or configuration is required. If you are running in Unix like environment, you just download the script, and launch the Parser from the directory where it is downloaded. If you are on Windows, you need Cygwin or Bash Shell that comes with MINGW to execute it.

How to execute?

You can see all the available options, by just executing:

$> ./masterLogParser.sh

Manadatory option '--rootcontext' or '-c' missing

-c|--rootcontext: Required. Source path from where log files are read.
-t|--rpttype: Optional. Values are: 'daily' or 'ondemand'. 'ondemand' is default value.
It is used to control logic - like whether or not to update historical data files.
Only 'daily' option creates and updates historical data files.
-d|--recorddate: Optional. It is the log entry date. Meaning log entries with that date will be processed.
It takes the format 'YYYY-MM-DD'. Default is to use current date. However, if 'daily' is chosen as 2nd argument, and log entry date is not provided, it defaults to 'date - 1 day'.
-l|--rptloc: Optional. It is report directory where all generated reports are written.
Default value is /tmp/
-o|--procoption: Optional. It represents the processing option. Values can be 'full' or 'partial'.
Default value is 'partial'. This option is currently being used only for Verbose GC log parser.

Here are few examples:

# processing current day's logs
$> ./masterLogParser.sh --rootcontext <log-path>

# processing yesterday's logs with historical report updates
$> ./masterLogParser.sh --rootcontext <log-path> --rpttype daily

# processing any day's logs updates
$> ./masterLogParser.sh --rootcontext <log-path> --recorddate <date in (YYYY-MM-DD) format>

See masterLogParser.sh in github: https://github.com/pppoudel/log-parser/blob/master/masterLogParser.sh

Input

1. thresholdValues.csv

As name implies, this file contains pre-defined name and threshold value pair for certain condition or events. Parser lookups these pre-defined condition, and when it detects one in a log file, it compares with threshold value and triggers/writes notification into output file (00_Alert.txt) if logged event exceeds the threshold value. Threshold can be performance based like 'notify if maximum response time exceeds 9 seconds' or event based like 'notify if maximum fatal count for a JVM exceeds 5'
Format:
Each line in thresholdValues.csv has multiple columns separated by pipe '|' and represent threshold definition for one complete event condition. See below:

event-name|value|server-identifier|event-description
e.g.
httpAvgRespTimeTh|2.5|http|Threshold for Average response time in sec.



Where:
event-name: name of the event like httpAvgRespTimeTh (http) Average Response Time threshold.
value: threshold value for this specific event. In this case it is 2.5 seconds.
server-identifier: Which log/server this value belongs to. In this case it is 'http' server.
event-description: provides some details what this threshold is about.

See a sample thresholdValues.csv in github: https://github.com/pppoudel/log-parser/blob/master/thresholdValues.csv

2. perfBaseLine.csv

This file contains pre-defined transactions (request URIs) and their baseline response time (in seconds). I suppose, you can get content for this file from your performance test result.

Format:
Each line in perfBaseLine.csv has two columns separated by pipe '|' which represent performance value for a given transaction (request/response). See below:

request-name|response-time (in seconds)
e.g.
finManagement/account_add.do|1.57756

Where:
request-name: represents request/response URI or transaction name, whatever you call it. In this case it is finManagement/account_add.do
response-time: response time for the transaction to complete in seconds. 1.5 seconds in this case.
See a sample perfBaseLine.csv in github: https://github.com/pppoudel/log-parser/blob/master/perfBaseLine.csv

3. WASCustomFilter.txt

Currently this input file is only consumed by websphereLogParser. It  defines some custom error/keywords. It is to tell parser that you're interested and like to know if certain keywords or string in general are logged (because of certain condition) in a log file, which may be  non-standard and specific to your environment/application.

Format:
It uses Regular expression to define custom error/keywords. Each new error definition goes to new line. See below:

Error.*Getting.*Folder
503.*Service.*Temporarily.*Unavailable
CORBA.*NO_RESPONSE
ORA-01013:

See a sample WASCustomFilter.txt in github: https://github.com/pppoudel/log-parser/blob/master/WASCustomFilter.txt

4. WAS_CloneIDs.csv

This file contains information that defines relationship (mapping) between HTTP session clone ID and WAS name. Clone ID constitutes part of HTTP session and can be logged into Web Server access_log. With the relationship in hand, we can generate helpful analytical data that helps to identify transaction/request path end to end. Easiest way to find out clone ID for each WAS is to look your plugin-cfg.xml file.

Format:
Each line in WAS_CloneIDs.csv four columns separated by pipe '|'. See below:

cloneID|WAS-name|hostname
e.g.
23532em3r|AppSrv01|washost082

Where:
cloneID cloneID is part of jSession. 23532em3r in above example. Refer to https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/txml_httpsessionclone.html
WAS-name  WebSphere Application Server (WAS) name. AppSrv01 in above example.
hostname Hostname of machine/server where particular WAS resides. washost082 in above example.


See a sample WAS_CloneIDs.csv in github: https://github.com/pppoudel/log-parser/blob/master/WAS_CloneIDs.csv

Output:

Each Parser update Alert file, and history reports (only if report type is 'daily') as well as generate summary report and other report files. For the complete list, see '#--------- Report/Output files -------#'  and '#--------- History Report/Output files -------#' sections in each script file.

For further detail of each individual parser, visit the following blog posts:
  1. websphereLogParser.sh for parsing, analyzing and reporting WebSphere Application Server (WAS) SystemOut.log
  2. webAccessLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) access_log
  3. webErrorLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) error_log
  4. javaGCStatsParser.sh for parsing, analyzing and reporting Java verbose Garbage Collection (GC) log

How to Parse WebSphere Application Server Logs for Troubleshooting & Reporting


Note: if you haven't already, see Log Parsing, Analysis, Correlation, and Reporting Engine post first.

WebsphereLogParser parses IBM WebSphere Application Server SystemOut.log. This is one of the parsers included in the suite that I have posted. This particular parser script expects that the SystemOut log follows the default/basic message formats outlined by IBM in JVM log interpretation document. Since, SystemOut.log does not contain the WAS server name, in order to relate the data to corresponding WAS JVM, it is advisable that you put SystemOut logs for each WAS under corresponding directory, named after the WAS name. It is specially important when you are parsing logs from multiple WAS servers. Script takes directory name as WAS name for the purpose of reporting. For example, let's say, you have Application servers 'appSrv01, appSrv02, appSrv03 ... etc.), then put logs from each Application Server under corresponding directories like:

 /tmp/appSrv01
    SystemOut.log
    SystemOut_2017.09.05.log
    SystemErr.log
 /tmp/appSrv02
    SystemOut.log
    SystemOut_2017.09.05.log
    SystemErr.log

It parses both zipped file and or regular file. By default, it finds and processes following files in a given path:

SystemOut.log
SystemOut.log.zip
SystemOut.zip
SystemOut_'$recYY'.'$rec0MM'.'$rec0DD'_.*
SystemOut_'$recNYY'.'$recN0MM'.'$recN0DD'_.*
Where:
recYY is Year like 17 (17 represent year of 2017)
rec0MM is Month like 01 (01 represent month of January)
rec0DD is Day like 01 (01 represents the first day of a month)
recNYY/recN0M/recN0DD = (recYY/rec0MM/rec0DD)+1 day

The naming suffix for historical files can be different from one environment to another. So, if you have different suffix for historical files, you need to tweak the find script. Currently it looks like this:

find $rootcontext -name "SystemOut*" -type f | \
  egrep '(SystemOut.log$|SystemOut.log.zip$|SystemOut.zip$|SystemOut_'$recYY'.'$rec0MM'.'$rec0DD'_.*|SystemOut_'$recNYY'.'$recN0MM'.'$recN0DD'_.*)'
where $rootcontext is root path.

Review the actual script available in github - https://github.com/pppoudel/log-parser/blob/master/websphereLogParser.sh for details.

Note: script is written to parse the date format like '[4/23/17 8:13:22:137 EDT]' in SystemOut.log. If your SystemOut.log uses different date format, you may need to tweak the section of script which parses the date.

How to execute:

You can see all the available options, by just launching:
$> ./websphereLogParser.sh

Few examples are here:
# processing current day's logs
$> ./websphereLogParser.sh --rootcontext <log-path>

# processing yesterday's logs with historical report updates
$> ./websphereLogParser.sh --rootcontext <log-path> --rpttype daily

# processing any day's logs updates
$> ./websphereLogParser.sh --rootcontext <log-path> --recorddate <date in (YYYY-MM-DD) format>


Output
Report/Output files:
  • $rptDir/00_Alert.txt
  • $rptDir/01_WASLogSummaryRpt.txt
  • $rptDir/WASLogErrRpt_all.csv
  • $rptDir/WASLogFilteredErrRpt.csv
  • $rptDir/WASLogSummaryByErrCmpRpt.csv
  • $rptDir/WASLogSummaryByErrClassRpt.csv
  • $rptDir/WASLogSummaryByErrExpRpt.csv
  • $rptDir/WASLogSummaryByErrMsgRpt.csv
  • $rptDir/WASLogSummaryByWarnCmpRpt.csv
  • $rptDir/WASLogSummaryByWarnClassRpt.csv
  • $rptDir/WASLogSummaryByWarnExpRpt.csv
  • $rptDir/WASLogSummaryByWarnMsgRpt.csv
Where $rptDir is report directory. Default value is $TMP/$recDate

History Report/Output files:
# These are historical reports. Each run will append record in existing report file.
  • $pDir/RecycleHistoryRpt_all.csv
  • $pDir/WASOutOfMemoryHistoryRpt.csv
  • $pDir/WASTransactionTimeOutHistoryRpt.csv
  • $pDir/WASSHungThreadHistoryRpt.csv
Where $pDir is parent of $rptDir.

See sample summary report in github - https://github.com/pppoudel/log-parser/blob/master/sample_reports/01_WASLogSummaryRpt.txt
See my other posts in this series
  1. webAccessLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) access_log
  2. webErrorLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) error_log
  3. javaGCStatsParser.sh for parsing, analyzing and reporting Java verbose Garbage Collection (GC) log