How (not) to troubleshoot

As developers and engineers we’re blessed with having to play around with (almost) completely deterministic systems. Even so, debugging them can often become overwhelming and push us to reconsider what we know or understand. We’ll try to showcase a particular troubleshooting adventure we recently endured and highlight what mistakes we did and how we overcame the despair.

TLDR

For the short on time, here are the main principles in our troubleshooting bible:

  1. If operating on a live system and under extreme stress, look for ways to work around or patch the problem without losing data.
  2. Try to reproduce in a local environment so you have full control and no restraints.
  3. Always try to narrow down the problem under investigation.
  4. Don’t jump to immediate conclusions without finding substantial proof.
  5. When finding clues, always catalog them. This might come in handy to you or you can present this catalog as part of a bug report to whoever picks up the task.
  6. Methodically familiarize yourself with the architecture of the environment in which the problem is reproduced.
  7. Once familiar with the architecture, try to find logs for each of the moving parts following a strict top-down approach.

As simple as those principles are, there is always temptation to stray away from them and hack away like the smart superhumans we all pretend to be. In our experience, this usually does not end well and causes damage both to the software and to the engineer’s ego. A nice example of that occurred very recently to a developer of ours – let’s call him agent Smith. Through his misdeeds we’ll illustrate how to properly apply those principles to a practical problem.

Context

We have a non-critical component that requires a set of statistics to be regularly extracted from a significant amount of HTTP proxy access logs. For this purpose, we are using a boring MapReduce cluster – specifically Amazon EMR, which bundles Apache Hive and Hadoop. Smith was tasked with extracting a new piece of info from the logs and needed to parse the user agent, stored in the logs. After a bit of research, the well documented and performant library of YAUAA (Yet Another UserAgent Analyzer) was chosen. This library provides a standard way to be used in a Hive job as a UDF (User Defined Function).

If you are not particularly familiar with Hive, EMR or YAUAA, that’s even better for the purposes of this doc – agent Smith was not well versed in them either. The premise was that this is a straightforward task in a non-crucial component of our services – so bang away Smith.

Unfortunately, Smith quickly hit a strange issue that culminates in the unhelpful error “Vertex killed”.

Smith’s path

At first, Smith followed procedure and:

  1. He tried to reproduce the issue on a local environment, but could not. He opted for a stage one instead, which is still fine.
  2. He tried to narrow down the issue by removing some of the other UDFs used in the Hive script.
  3. Searched for more verbose logs of the components of the system.

Around this time, Smith started making grave mistakes:

  1. Smith found some verbose logs but thought the messages there didn’t point to a root cause either. As he wasn’t completely familiar with the infrastructure architecture and he found the logs by haphazard searching, he had no reason to put more weight to those logs, so he logically discarded them as inconsequential.
  2. He knew a version of the script that included the YAUAA UDF had run on production the day before. Therefore, the problem was definitely caused by another UDF in the script – a GeoIP lookup one.
  3. As Smith could not reproduce the same issue without the YAUAA UDF, he jumped to another conclusion – it is the combination of the two UDFs that is causing the issue.
  4. As it could only be the combination of the two UDFs, which were both relatively big for simple JARs, the issue could only be with the size of those libs.

At this point, Smith was doomed. He was now poking at the code haphazardly and praying to pagan gods and demons. Amidst the prayers, an ancient demon by the name of Stackoverflow answered and in a cryptic message hinted that the issue might be with the Hive engine itself. Smith hastily tried running the script with another, deprecated, Hive engine and upon success, confirmed two things:

  1. He has a workaround solution for production!
  2. The root cause is so deep in the Hive / Hadoop / Java ecosystem, that a simple engineer such as Smith will need eons to dig through to that.

Luckily, Smith was not completely out of his mind and decided to escalate the issue to another engineer – of course we’ll call him Neo.

Initial bug report (by Smith)

When loading both yauaa and geoip UDF jars in Hive, any query that needs a reducer (any GROUP BY query) crashes in a very generic and incomprehensive error “Vertex killed”. This happens only when using the default for Amazon EMR Hive engine Tez – with the deprecated engine MR, all works fine.

Steps to reproduce

For some reason hive readline breaks by the newlines in following snippets – please make sure to copy line by line to reproduce

On a stage cluster (As Tez is not available in the local dockerized Hive cluster), execute the following in a hive shell:

SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=50000; SET hive.exec.max.dynamic.partitions.pernode=20000; ADD JAR /home/hadoop/crtlibs/cf-log.jar;
Code language: JavaScript (javascript)

If this is a brand new cluster, we need to init table as well (pre-uploading data so it’s not completely empty):

CREATE EXTERNAL TABLE logs ROW FORMAT SERDE 'com.snowplowanalytics.hive.serde.CfLogDeserializer' LOCATION 's3://crtemrdev/cflogs/';
Code language: JavaScript (javascript)

Then following query succeeds

SELECT count(*), distid FROM (SELECT *, regexp_extract(domain, '([a-zA-Z0-9]*).cloudfront.net', 1) AS distid FROM logs) augmented GROUP BY distid;
Code language: PHP (php)

When loading geoip jar all is still good:

ADD JAR /home/hadoop/crtlibs/elsix-geoip.jar; CREATE TEMPORARY FUNCTION geoip as 'com.spuul.hive.GeoIP2'; ADD FILE /home/hadoop/crtlibs/GeoLite2-Country.mmdb; SELECT count(*), distid FROM (SELECT *, regexp_extract(domain, '([a-zA-Z0-9]*).cloudfront.net', 1) AS distid FROM logs) augmented GROUP BY distid;
Code language: PHP (php)

When we load the yauaa jar:

ADD JAR /home/hadoop/crtlibs/yauaa-hive-5.11-udf.jar; CREATE TEMPORARY FUNCTION ParseUserAgent AS 'nl.basjes.parse.useragent.hive.ParseUserAgent'; SELECT count(*), distid FROM (SELECT *, regexp_extract(domain, '([a-zA-Z0-9]*).cloudfront.net', 1) AS distid FROM logs) augmented GROUP BY distid;
Code language: PHP (php)

it fails with this error:

Container id: container_1565940228452_0030_01_000008 Exit code: 255 Stack trace: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:972) at org.apache.hadoop.util.Shell.run(Shell.java:869) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Container exited with a non-zero exit code 255 ]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1565940228452_0030_1_00 [Map 4] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 3, vertexId=vertex_1565940228452_0030_1_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1565940228452_0030_1_02 [Reducer 3] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1565940228452_0030_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1565940228452_0030_1_01 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2
Code language: JavaScript (javascript)

Neo’s path

As Neo was not a Hive guru as well, he first got familiar with how to run hive queries, what stdout logs, how results and possible stack traces look like – all on a local system.

As the initial report stated that query works on MR but not on Tez, Neo read through documentation to check for all possible alternatives for engines – currently MR, Tez, and Spark. Spark was suggested for fragmented data in smaller chunks (which is our case). Made a mental note of that, but for such low-maintenance projects as this one, we prefer to use the preselected defaults chosen by the IaaS provider as much as possible to ensure we’re reusing the community crowd testing as much as possible and also quicker community support in case of future issues.

Even though the report was for a multi-jar cause of the issues and there was also a mention of a working version with another geoip UDF a day prior, Neo chose to ignore this info, as it was still an unproven hypothesis. As the report was a bit fragmented, it looked like the issue can be narrowed down further.

  1. Neo combined snippets into one chunk that can be easily rerun (repasted) on a fresh hive shell. While working on the problem, he found an unrelated readline bug in the hive shell that caused issues for some of the multi-line snippets, so to minimize headaches he reformatted into a single-line one. The presented snippets are formatted back to multi-line for better readability,
ADD JAR /home/hadoop/crtlibs/cf-log.jar; ADD JAR /home/hadoop/crtlibs/elsix-geoip.jar; CREATE TEMPORARY FUNCTION geoip as 'com.spuul.hive.GeoIP2'; ADD FILE /home/hadoop/crtlibs/GeoLite2-Country.mmdb; ADD JAR /home/hadoop/crtlibs/yauaa-hive-5.11-udf.jar; CREATE TEMPORARY FUNCTION ParseUserAgent AS 'nl.basjes.parse.useragent.hive.ParseUserAgent'; SELECT count(*), distid FROM (SELECT *, regexp_extract(domain, '([a-zA-Z0-9]*).cloudfront.net', 1) AS distid FROM logs) augmented GROUP BY distid;
Code language: PHP (php)
  1. Neo tried to simplify by dropping the geoip jar altogether and still got a failure.
ADD JAR /home/hadoop/crtlibs/cf-log.jar; ADD JAR /home/hadoop/crtlibs/yauaa-hive-5.11-udf.jar; CREATE TEMPORARY FUNCTION ParseUserAgent AS 'nl.basjes.parse.useragent.hive.ParseUserAgent'; SELECT count(*), distid FROM (SELECT *, regexp_extract(domain, '([a-zA-Z0-9]*).cloudfront.net', 1) AS distid FROM logs) augmented GROUP BY distid;
Code language: PHP (php)
  1. Neo tried to further simplify by dropping references of the source data (and data parsing jar with that).
ADD JAR /home/hadoop/crtlibs/yauaa-hive-5.11-udf.jar; CREATE TEMPORARY FUNCTION ParseUserAgent AS 'nl.basjes.parse.useragent.hive.ParseUserAgent'; SELECT "foo.cloudfront.net" as domain UNION SELECT "bar.cloudfront.net" as domain;
Code language: PHP (php)

Having simplified the script thus far, it was very evident that the issue was solely inside of the yauaa jar. However the stdout logs were completely insufficient to debug:

Container id: container_1565940228452_0030_01_000008 Exit code: 255 Stack trace: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:972) at org.apache.hadoop.util.Shell.run(Shell.java:869) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Container exited with a non-zero exit code 255 ]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1565940228452_0030_1_00 [Map 4] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 3, vertexId=vertex_1565940228452_0030_1_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1565940228452_0030_1_02 [Reducer 3] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1565940228452_0030_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1565940228452_0030_1_01 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2
Code language: JavaScript (javascript)

Having only partial knowledge of the Hive / Hadoop / Tez architecture, Neo needed a visual representation of the architecture and searched for a web admin interface for Hive and Tez and especially ones that are preconfigured on Amazon EMR. That quickly led to https://docs.aws.amazon.com/emr/latest/ReleaseGuide/tez-web-ui.html
Of course, that does not work out of the box as UI ports are blocked by the security policies but Neo navigated through those issues with the help of some SSH port forwards and managed to clarify that:

  • Hive breaks down the specific job into mappers
  • Each mapper is executed in one or more tasks
  • Each task is started in a separate container
  • Each container is executed on the Hadoop DataNode and seems to have logs stored there

Going on the datanode and searching the filesystem for the failing container name Neo managed to find the actual log of the mapper job run that was failing:

2019-08-16 08:46:15,118 [INFO] [main] |task.TezChild|: Refreshing UGI since Credentials have changed. Credentials : #Tokens=1, #SecretKeys=0 2019-08-16 08:46:15,138 [INFO] [main] |metrics.TaskCounterUpdater|: Using ResourceCalculatorProcessTree : org.apache.tez.util.TezMxBeanResourceCalculator 2019-08-16 08:46:15,138 [INFO] [main] |runtime.LogicalIOProcessorRuntimeTask|: Initializing LogicalIOProcessorRuntimeTask with TaskSpec: DAGName : SELECT "foo.cloudfront.net" as doma...domain(Stage-1), VertexName: Map 4, VertexParallelism: 1, TaskAttemptID:attempt_1565940228452_0030_1_00_000000_3, processorName=org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor, inputSpecListSize=1, outputSpecListSize=2, inputSpecList=[{{ sourceVertexName=_dummy_table, physicalEdgeCount=1, inputClassName=org.apache.tez.mapreduce.input.MRInputLegacy }}, ], outputSpecList=[{{ destinationVertexName=out_Map 4, physicalEdgeCount=0, outputClassName=org.apache.tez.mapreduce.output.MROutput }}, {{ destinationVertexName=Reducer 3, physicalEdgeCount=1, outputClassName=org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput }}, ], taskConfEntryCount=3 2019-08-16 08:46:15,140 [INFO] [main] |resources.MemoryDistributor|: InitialMemoryDistributor (isEnabled=true) invoked with: numInputs=1, numOutputs=2, JVM.maxFree=2291138560, allocatorClassName=org.apache.tez.runtime.library.resources.WeightedScalingMemoryDistributor 2019-08-16 08:46:15,148 [ERROR] [main] |yarn.YarnUncaughtExceptionHandler|: Thread Thread[main,5,main] threw an Error. Shutting down now... java.lang.NoSuchMethodError: com.google.common.util.concurrent.Futures.addCallback(Lcom/google/common/util/concurrent/ListenableFuture;Lcom/google/common/util/concurrent/FutureCallback;)V at org.apache.tez.runtime.task.TaskReporter.registerTask(TaskReporter.java:110) at org.apache.tez.runtime.task.TezTaskRunner2.run(TezTaskRunner2.java:157) at org.apache.tez.runtime.task.TezChild.run(TezChild.java:264) at org.apache.tez.runtime.task.TezChild.main(TezChild.java:508) 2019-08-16 08:46:15,152 [INFO] [main] |util.ExitUtil|: Exiting with status -1
Code language: PHP (php)

As it later turned out, that was the exact same log Smith had previously found and discarded. That’s why cataloging clues is so important – so much energy would have been spared…

Unlike Smith, Neo was convinced this log message brings him closer to the root cause, so he did not discard it. Instead, he hypothesized that it is caused by a wrong classpath resolution.

Neo resisted the temptation to jump to conclusions and methodically checked that the nonexisting method was actually inside a needed lib – Google guava in particular (at least in the version yauaa depends on). Neo also checked the classpath on the DataNode and saw that the version of guava loaded is a different version than the one the yauaa lib depends on. Combined with the yauaa jar we were loading, Neo narrowed the problem down to the build and shading of the jar itself.

Now that Neo was almost certain of his hypothesis, he proceeded with further reading and some custom (with dependency shading) building of the yauaa jar. When the newly built version of yauaa worked on production, Neo had his final proof and could calmly follow up with the developers of yauaa so they can address the issue on their side as well.

Important lessons

To reiterate over the principles we began with:

  1. Always look for ways to work around an issue if it’s on a live system, but first persist as much of the data, needed to reproduce and investigate.
  2. Prefer reproducing locally to stage and on stage to production.
  3. Don’t jump to immediate conclusions without finding substantial proof of those – e.g. “The issue is caused by the size of the two jar libraries being loaded”
  4. When finding clues/logs, always catalog them and present as part of the bug report – e.g. the log found in 6 was actually noticed and discarded by Smith at the beginning of his troubleshooting.
  5. Always try to simplify the testing environment/snippet/codebase – e.g. try to discard unneeded libraries, try to shorten complicated queries, try to remove parts of the environment setup.
  6. Methodically familiarize with the architecture/infrastructure on which the problem is reproduced.
  7. Once familiar with the architecture, try to find logs for each of the moving parts following a strict top-down approach.