Date

Attendees

Goals

    1. Identify a way to get the logs we need in the right format given the current implementation or...
    2. Identify what needs to be done/changed/re-configured to support current reporting needs and enable simple triaging of operational applications. 

Discussion items

  • NGAP-2296 - Getting issue details... STATUS
  • Doug noticed a new format in the logs coming from 1.1. Upon so, he noticed that all queries needed to be updated in order to meet the current metric reporting.
    • That's a bit of a beat down given that the hope was just to update sourcetypes and everything just works
    • This works for EDSC with some manual work.
    • The biggest concern for right now is triaging issues.
  • Applications are just providing text. Splunk has some logic/black magic that identifies events as such and formats them certain ways.i
  • Potential solution that needs to be researched - https://answers.splunk.com/answers/390219/how-to-parse-docker-logs-with-multiple-events-from.html
  • Another link from Tim - https://github.com/moby/moby/issues/22920


Questions

  • What is Splunk doing to format the text file into what we see in Splunk for NGAP 1.0?
    • Marcus: No reformatting is happening. Splunk adds sourcetype and timestamp, but not heavy reformatting or anything.
  • What was done in NGAP 1.0 for formatting/sourcetypes and can it be done for 1.1?
    • Who put the sourcetypes together? How did they do it?
    • Marcus suspects it is someone on the NGAP side. Maybe Andrew?
  • Did Docker break this? Hard to tell if we don't have sourcetypes.
    • Should be able to use sourcetypes if we want (and configure them).
  • Would using Cloudwatch solve our problems? Doesn't solve concatenate, but would provide raw lines.
  • Assess what is a go/no-go for EDSC into Prod on Wednesday.

Type your task here. Use "@" to assign a user and "//" to select a due date.Action items

1 Comment

  1. Some more notes on what Docker/NGAP 1.1 is doing to set up application logs:

    We use the Splunk log driver that's built into Docker, configuring it with the Splunk URL and HEC Token that Docker uses to send events (every line of the container's output). The events that Docker sends package the line of text in a JSON object, containing metadata about the event, container, ECS service, and instance; this info is critical for debugging any issues that come up, as it's the only way we can track a log event down to the specific ECS service, task and container that produced it. Example event:

    {
      "line": "198.x.x.x - - [02/Nov/2017:20:57:04 +0000] \"GET / HTTP/1.1\" 200 1157 \"-\" \"ELB-HealthChecker/2.0\" \"-\"",
      "source": "stdout",
      "tag": "x.dkr.ecr.us-east-1.amazonaws.com/ngap-app/ecs-prod/cumulus-dashboard-lpdaac-sit:v27-431/ecs-ngap-app-ecs-prod-7f91dde7a5e6205a-431-web-1-ngap-app-ecs-prod-7f91dde7a5e6205a-431-web-aef49ece8a8bfa95ad01/8aab07c47b28672d00bb0db7d5de6e04f30f25f4510c61485d89fdd4a5f60d47",
      "attrs": {
        "com.amazonaws.ecs.cluster": "gsfc-eosdis-ngap-ecs-prod-ecs-cluster-cumulus-dashboard-lpdaac-sit-ECSCluster-1TIHCJ1ATGEMC",
        "com.amazonaws.ecs.container-name": "ngap-app-ecs-prod-7f91dde7a5e6205a-431-web",
        "com.amazonaws.ecs.task-arn": "arn:aws:ecs:us-east-1:OBSCURED:task/7bd3d8d5-ff9a-48b0-870c-b3528faf350a",
        "com.amazonaws.ecs.task-definition-family": "ngap-app-ecs-prod-7f91dde7a5e6205a-431-web",
        "com.amazonaws.ecs.task-definition-version": "1"
      }
    }

    Screenshot

    Extracting just the line field can be done through queries such as `... | table line`. Searching within a specific field can be done like: `... | search line="*error*"`.


    This is using the Splunk log driver's default "inline" format. There are also "json" and "raw" options:

    "json" - assumes the text (line) is JSON, and parses it (falls back to `inline` if it can't be parsed). The parsed JSON is still, as with the `inline` format, passed under the `line` field; just as a structured JSON object, rather than as a string.

    "raw" - Prefixes each line of text w/ all of the attributes and tags. The text of the line is otherwise unchanged. This still leaves alot for consumers to wade through, but possibly easier to read/trim
    than the JSON-formatted output--as long as we don't lose all the container metadata (or preserve it in event attributes rather than in the text of the event). Example from docs:

    MyImage/MyContainer env1=val1 label1=label1 my message

    Since the "raw" format adds metadata at the beginning of each line, it's likely to still break whatever logic Splunk has for concatenating lines of a stack trace.

    All three formats treat a single line of output as an event.


    NGAP also has a CloudWatch-to-Splunk forwarder that's used for e.g. getting Lambda logs into Splunk. We're starting to rely on this more and more, as CloudWatch is the one logging mechanism that's (most likely to be) built into any given AWS service. Docker also has a CloudWatch log driver, that we could use instead of the Splunk driver, to get the logs to where they would be picked up by the CloudWatch-to-Splunk forwarder. The CloudWatch driver uses the log group and log stream to provide container-specific metadata, rather than embedding it into the log text:

    Screenshot

    (Query: "index=main log_group::/ngap/ecs-prod/app/nclark4-prod-test-app")

    This doesn't solve the problem of concatenating stack trace lines, though, unless Splunk has built-in logic for doing this concatenation that is now (because the event text is just the raw output text, without the additional metadata) able to recognize and handle the traces.