Logging in distributed environment
In distributed environment, logging and tools to analyze logs become more important. In most of trouble shooting sessions, we need logs to answer:
- What was the problem?
- When did the problem happen?
- Where was the problem(node, service, region, etc)?
- How did the problem happen?
However, while distributed system is more scalable, it casts challenges for logging.
In the document, we will summarize logging for data intensive applications deployed in the cloud:
- What to log?
- How to log?
- How to categorize the information logging?
General concepts
What information to log?
Logging is used for different purposes, therefore we need to keep in mind all of those:
- Trouble shooting
- Auditing
- Profiling
- Statistics
Logging is for human analysis, therefore we should consider logging the following information in humanly readable text:
- Timestamp
- Identification id for the request, session, service, etc;
- Log in text format;
- Make it developer friendly, and machine parseable;
- Put semantic context in the log;
- Log the source: file, class, function.
Logging can cause security issues too, therefore we should not log sensitive information:
- User information
- Financial information
- Health related information
- Credentials
In the end, we need to learn and adjust based on the service usage. It also takes time for teams to learn what is the appropriate amount of logging.
How to log?
As is all distributed services, we design logging for failures:
- Log locally to files, which still function with network failures, or system congestion.
- Use rotation policies to avoid log files grow too big;
- Collect events from everything, everywhere;
- Log at the right level.
How to log at the right level?
For logging at the right level, it is easier to said than done. Here we try to provide a guide to categorize information in different levels defined in Log4j.
Level | Description |
---|---|
TRACE | Designate finer-grained informational events than the DEBUG. This should only be used during development to track bugs, but never committed to your code repository. This is a code smell if used in production |
DEBUG | Designate fine-grained information for debugging. Log at this level about anything that happens in the program. This is mostly used during debugging, and trim down the number of debug statement before entering the production stage, so that only the most meaningful entries are left, and can be activated during troubleshooting. |
INFO | Designate information to highlight application progress in coarse-grainded level. Log at this level all actions that are user-driven, or system specific (ie regularly scheduled operations…). Typically used in production. |
WARN | Designate potentially harmful situations. Log at this level all events that could potentially become an error. For instance if one API call took more than a predefined time, or if an in-memory cache is near capacity. This will allow proper automated alerting, and during troubleshooting will allow to better understand how the system was behaving before the failure. |
ERROR | Designates error events that might still allow the application to continue running. Log every error condition at this level. That can be API calls that return errors or internal error conditions. |
FATAL | Designate severe error events that will lead the service down. Too bad, it’s doomsday. Use this very scarcely, this shouldn’t happen a lot in a real program. Usually logging at this level signifies the end of the program. For instance, if a network daemon can’t bind a network socket, log at this level and exit is the only sensible thing to do. |
Logging at the right level may also depend on which stage is the deployment:
Stage | Log level |
---|---|
Desktop | DEBUG |
Alpha | DEBUG |
Beta | DEBUG |
Gamma | INFO |
Prod | INFO |
Good and bad examples
Usually we don’t make mistakes when errors need to be logged. The tricky ones are always with debug and info, sometimes warning. This also varies from one application to another. Logging every event with INFO may be OK for control plane service, it is too excessive for data intensive applications.
- Excessive development information is logged at INFO level for each event write
override fun process(key: String, timestampedData: TimestampedData<String>) { val value = timestampedData.data try { val headerReader = HeaderReader(context.headers()) if (headerReader.isIntegrationTest) { ... } else { // While the information is important for diagnosis, log every event to the log is excessive. We have try/catch if any error happens log.info() .message("Writing event ($value) to SNS (${config.snsTopicArn}) for topic (${config.kafkaTopic})") .log() snsProxy.publishObject(config.snsTopicArn, value) } } catch (exception: Exception) { log.error() ... cloudWatchMetricsLogger.addCount(SNS_PUBLISH_FAIL_METRIC_NAME, 1.0, dimensions) } }
- A good example of logging warnings for system delays
It isn’t an error, but may need some investigation.
val wallClockTime = clock.millis()
if (checkpointTimestampMillis < wallClockTime) {
// It shows warning sign if the checkpoint is in the past, which is worthy investigation
log.warn("Received an alert event where the checkpoint is in the past. wall clock time: $wallClockTime, key, $key, event: $alertEvent, ${context.contextInfoString()}")
return
}