In the last few months, I have been helping to identify and resolve production issues (both performance and product related). I had to analyze vast amount of logs, identify performance degradation and deviation, and issues related to Java heap and Garbage Collection (GC), as well as different issues affecting the health of WebSphere Application Server (WAS). In order to do the above-mentioned tasks efficiently, I have employed different tools (both open source and commercial). Even-though these tools are readily available, and usually good what they do, they may not be as effective as we like for our particular circumstances and we end-up writing our own custom tool or script to complement in certain areas. Same story here, I ended up writing a custom tool (let me call it a Log Parser) for log parsing, analyzing, making correlation, and reporting. I'm sharing my custom Log Parser here, hoping that it may be useful for other people as well. It is written in AWK and Shell script. It processes the following logs:
- SystemOut.log generated by IBM WebSphere Application Server (WAS)
- access_log and error_log logs generated by Apache or IBM HTTP Server(IHS)
- native_stdout.log or verbose GC logs generated by Java Virtual Machine (JVM).
Diagram 1.0 |
As depicted in the diagram, the Parser is made of a set of script files (collection of different Parsers) and wrapper script - together acting like a suite. Each parser can be executed independently or invoked by the wrapper script. The Parser is driven by the logic in the script and is controlled by the input parameters and their values (control parameters, threshold parameters, correlation parameters, and transaction baseline values). It consumes the logs and writes different reports as an output.
Most interesting part is the input here. The Feedback loop/mechanism as shown in the diagram is to let the analyst know that he/she should continually refine the threshold and other applicable input parameter values based on output analysis.This feedback mechanism makes the Parser - a kind of expert engine. So, it is very important to regularly update your threshold values, filter keywords, and maintain an well established performance baseline. Parser helps you to maintain feedback mechanism, because it collects vital statistics and updates historical data files, doing so, it is not only collecting important data, but also quantifying the system events. Quantification helps to compare, generate alerts and make decisions. For example, you quantify in average how many particular errors you get per day or per hour or per server, or per transaction and based on that you define your threshold value. Let's say, based on a month long of observation, average number of daily transaction errors from server A, fluctuates from 10 to 30 in normal situation. So, your high mark for normal situation is 30. Based on this data, you can define your threshold value 35 for that particular error for that server.
What are the key benefits of using this Parser:
- Make troubleshooting faster and effective with built-in intelligence from lesson learned and baseline data. Parser identifies critical errors, their frequency, location, key performance numbers, current state of the environment like how many users, sessions, transactions, (if any) anomalies in the system, which help to narrow down the issue(s).
- Automatically collect key statistical data (performance, error or usage) and build a data mart. Parser collects all vital statistics like performance numbers, performance range, hourly user/session statistics, heap snapshots etc. and updates historical data files. These data can be used to generate history report and also in decision making process.
- Auto generate key summary reports for internal consumption and create delimited data files, which can be imported to spread sheet like Excel to prepare management reports. Basically, it can provide visibility to your entire application infrastructure.
- Create correlation. Parser creates correlation so that it becomes easier to identify and map transaction path (Web server to the Application server).
- Generate warning for possible future incidents/events. Parser can provide early warning of possible future events. Here is an example of generated warning: "
2.18383e+06 : average of Perm Generation After Full GC exceeds threshold 2097152 (K). There is a possibility of OutOfMemory in near future because of Not sufficient PermGen Space for AppSrv04
"
Getting started is very simple. No big-bang installation, or configuration is required. If you are running in Unix like environment, you just download the script, and launch the Parser from the directory where it is downloaded. If you are on Windows, you need Cygwin or Bash Shell that comes with MINGW to execute it.
How to execute?
$> ./masterLogParser.sh
Manadatory option '--rootcontext' or '-c' missing
-c|--rootcontext:
|
Required. Source path from where log files are read.
|
-t|--rpttype:
|
Optional. Values are: 'daily' or 'ondemand'. 'ondemand' is default value.
|
-d|--recorddate:
|
Optional. It is the log entry date. Meaning log entries with that date will be processed.
|
-l|--rptloc:
|
Optional. It is report directory where all generated reports are written.
|
-o|--procoption:
|
Optional. It represents the processing option. Values can be 'full' or 'partial'.
|
Here are few examples:
# processing current day's logs
|
See masterLogParser.sh in github: https://github.com/pppoudel/log-parser/blob/master/masterLogParser.sh
Input
1. thresholdValues.csv
As name implies, this file contains pre-defined name and threshold value pair for certain condition or events. Parser lookups these pre-defined condition, and when it detects one in a log file, it compares with threshold value and triggers/writes notification into output file (00_Alert.txt) if logged event exceeds the threshold value. Threshold can be performance based like 'notify if maximum response time exceeds 9 seconds' or event based like 'notify if maximum fatal count for a JVM exceeds 5'Format:
Each line in thresholdValues.csv has multiple columns separated by pipe '|' and represent threshold definition for one complete event condition. See below:
event-name|value|server-identifier|event-description
|
Where:
event-name:
|
name of the event like httpAvgRespTimeTh (http) Average Response Time threshold.
|
value:
|
threshold value for this specific event. In this case it is 2.5 seconds.
|
server-identifier:
|
Which log/server this value belongs to. In this case it is 'http' server.
|
event-description:
|
provides some details what this threshold is about.
|
See a sample thresholdValues.csv in github: https://github.com/pppoudel/log-parser/blob/master/thresholdValues.csv
2. perfBaseLine.csv
This file contains pre-defined transactions (request URIs) and their baseline response time (in seconds). I suppose, you can get content for this file from your performance test result.Format:
Each line in perfBaseLine.csv has two columns separated by pipe '|' which represent performance value for a given transaction (request/response). See below:
request-name|response-time (in seconds)
|
Where:
request-name:
|
represents request/response URI or transaction name, whatever you call it. In this case it is finManagement/account_add.do
|
response-time:
|
response time for the transaction to complete in seconds. 1.5 seconds in this case.
|
3. WASCustomFilter.txt
Currently this input file is only consumed by websphereLogParser. It defines some custom error/keywords. It is to tell parser that you're interested and like to know if certain keywords or string in general are logged (because of certain condition) in a log file, which may be non-standard and specific to your environment/application.Format:
It uses Regular expression to define custom error/keywords. Each new error definition goes to new line. See below:
Error.*Getting.*Folder
|
See a sample WASCustomFilter.txt in github: https://github.com/pppoudel/log-parser/blob/master/WASCustomFilter.txt
4. WAS_CloneIDs.csv
This file contains information that defines relationship (mapping) between HTTP session clone ID and WAS name. Clone ID constitutes part of HTTP session and can be logged into Web Server access_log. With the relationship in hand, we can generate helpful analytical data that helps to identify transaction/request path end to end. Easiest way to find out clone ID for each WAS is to look your plugin-cfg.xml file.Format:
Each line in WAS_CloneIDs.csv four columns separated by pipe '|'. See below:
cloneID|WAS-name|hostname
|
Where:
cloneID
|
cloneID is part of jSession. 23532em3r in above example. Refer to https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/txml_httpsessionclone.html
|
WAS-name
|
WebSphere Application Server (WAS) name. AppSrv01 in above example.
|
hostname
|
Hostname of machine/server where particular WAS resides. washost082 in above example.
|
|
See a sample WAS_CloneIDs.csv in github: https://github.com/pppoudel/log-parser/blob/master/WAS_CloneIDs.csv
Output:
Each Parser update Alert file, and history reports (only if report type is 'daily') as well as generate summary report and other report files. For the complete list, see '#--------- Report/Output files -------#' and '#--------- History Report/Output files -------#' sections in each script file.For further detail of each individual parser, visit the following blog posts:
- websphereLogParser.sh for parsing, analyzing and reporting WebSphere Application Server (WAS) SystemOut.log
- webAccessLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) access_log
- webErrorLogParser.sh for parsing, analyzing and reporting Apache/IBM HTTP Server (IHS) error_log
- javaGCStatsParser.sh for parsing, analyzing and reporting Java verbose Garbage Collection (GC) log
No comments:
Post a Comment