Centralized Logging Solution for Google Cloud Platform (Cloud Next '18... (2)
So my first stop and first stop for many customers
is let's check out Error Reporting
to see if it's automatically identified anything for us.
And sure enough, having looked through our logs,
out of all of those errors, it's identified a specific error
in our application that someone tried
to set the quantity to less than zero,
which caused a specific error.
And I can click on that.
I can see exactly how many customers have
been impacted by this error.
I can see the stack trace so that I
can tell exactly where in my code just came from,
and I could go to Stackdriver Debugging
to investigate further, if I wanted to, from here.
I can also jump directly back to the logs to view the raw logs.
I can link to an issue, so this one here
links to a GitHub issue, for example.
And I can also track the resolution status.
So for example, if someone tells me one of our developers says,
hey, I fixed this error.
It should be all set.
It's deployed.
I can go ahead and say, this error should be resolved,
and then we say, OK, no known errors.
That's great.
If I go back to my catalog here--
and we'll test it to see if this actually works--
and I add something to my cart, let's see
what happens if we try to set the quantity
to a negative number.
We'll update the basket, and I'll
come back over to Error Reporting,
and reload it here for just a moment.
And this usually updates in about five to 10 seconds,
and we can see that it automatically
identified that the error had been seen again and reopened
to the issue.
If I wanted to, I can also turn on notifications
so that rather than me going to this dashboard,
it'll proactively push alerts for new or re-opened errors
to my inbox.
So that's a very common use case we
hear from a lot of developers who
are using Stackdriver Logging.
But we hear a number of other examples
for, like, security use cases.
So something that I hear a lot is,
I want to be alerted if anyone adds an email,
let's say from a Gmail.com domain.
So in the UI, I can interact with it
in sort of a point and click mode,
but I can also interact with the logs in the Advanced Filter,
and then type in more advanced queries.
So in the case of identifying something
that comes from a Gmail.com account,
I need to create a logs-based metric so that I can then
alert on that.
So I'm going to go over to my Logs-Based Metrics,
and I've created one here.
I'm going to go ahead and edit that metric.
For anything that comes from logs
that are set IAM policy, that binds a member to @gmail.com,
I want to count all of these.
And that's kind of the first step.
So I'll be able to have dashboards about everyone who's
been added from an @gmail.com account,
and then I can also create alerting policies on that.
And I can see that earlier this morning, my Google account
did add somebody who was @gmail.com.
So let me go back to our logs-based metrics
and go ahead and show you how you would create
an alerting policy on this.
So I'm going to Create Alert from Metric.
This will pop me over into Stackdriver Monitoring, which
is where I manage all of my dashboards
and my alerting policies.
So I'm able to, here, see the logs-based metric.
Take a look here.
I don't have any recent ones, but I'm
going to go ahead and create an alerting policy if I ever
see this, because I don't ever expect to see this.
And instead of duration, I'm going to say most recent value.
So if this ever happens, I want to be alerted.
Go ahead and save the condition, and create a notification.
So I'll send it to my favorite email address,
I can add some documentation in terms of who to contact.
Name this policy.
And I'll go ahead and save it.
So now if I go ahead and add a new member to an IAM policy
anywhere in this project, I will receive
a notification about this.
Another common use case we hear from customers
is that they want to send their logs someplace else.
So this is the beauty of the centralized logging solution
is that it all comes in centrally,
and then we can slice and dice it and send it
to many different places using exports, which
is the log router that we were talking
about just a moment ago.
So I'll start out with a very common use
case, which is I want to send all of my audit logs
to BigQuery.
So I select a filter.
In this case, I'm going to say everything
that matches the activity audit logs in the log name.
And then I simply select BigQuery and the destination
and create a sink.
And any future logs that match this will automatically
be sent to BigQuery, which is great,
but that only helps me for this project.
What if I want to do this at the organization level?
So in this case, I need to pull up my Cloud Shell here.
And I can use a gcloud command to set this
at the organizational level.
And I'll call out right here.
We have "include children," and we're setting this up
at the organization level.
Same thing, though-- the log name
matches the filter of anything that
has cloudaudit.googleapis.com.
And I'm going to go ahead and save this,
and now any audit logs from anywhere
in my organization will all go to the same BigQuery
destination.
Now one tricky thing just to remember
is now that I set it up at the org level,
any audit logs that come from this project
will be written twice.
So in that case, I'd probably want
to go back and get rid of the one at the project level.
Another thing that we have users do
a lot is I want to act on logs.
So in this case, let's say every time a new VM is spun up,
I want to take a look at it, process it
with Cloud Functions, maybe, and add some labels
or apply some rules to it.
So I'm going to create a sink for any time
that I have created an instance, which is the insert command.
And I'll be able to send all of these to Pub/Sub.
I can then use a Cloud Function to pull all of the logs
from Pub/Sub, process them, take whatever action I want on them.
So this is another common use case we see.
And then last but not least, we'll
talk about log exclusions.
So we have a page dedicated to helping
you understand what your log volume is
across your various Google Cloud resources or AWS resources
here as well.
And I can see, for example, taking a look here,
a lot of my volume in this project
is coming from Kubernetes, so that's the bulk of my logs.
I can see, though, that projected
through the end of the month is 43 gigabytes, which is well
within the free limit of 50 gigabytes,
so I'm not too, too worried.
But I could go ahead and say, you know what?
I'm going to send these logs maybe to my Elk Stack.
I don't want to pay for them in Stackdriver.
I could go ahead and disable the logs altogether here,
or I could create an exclusion filter based
on this and, for example, say, anything that is less
than a warning level, I want to maybe just sample those logs.
So instead of excluding 100% of them,
maybe I will set this is 99%.
I can also deep dive into the logs volume in Stackdriver
Monitoring using the Metrics Explorer tool
and visualize exactly where my volume of logs is coming from.
And if we could cut back please to the presentation, awesome.
EDUARDO SILVA: So now with a solution like Stackdriver,
we can say that logging is not long and boring.
But before that, handling logs in the different formats,
many sources, and correlate data is quite hard.
And I'm sure if you are working already with distributed system
or cluster or with Kubernetes, there's
quite challenges that needs to be solved.
So what I'm going to explain now is
about how logging operates at Kubernetes level.
So if you understand the problem, how
it works behind the scenes, means
you can optimize the engine queries
and get better insights from your data.
Do we have any Kubernetes users here?
Oh, good.
So I'm going to do a little introduction
about how logging works in Kubernetes
or when you play with docking containers.
And basically, you have one application
that triggers a message, and that message
means to be like a log message saying a status, an alert,
a warning, or anything related.
For example, we have, like, "Hey Next."
But it's not just about the message.
That message has some metadata that needs to be correlated.
One of them is at what time this message was generated,
and the other is what's the channel that this message was
generated from.
If we speak about containers, there
are two main channels, standard output, standard error.
And here, for example, "Hey Next" is just not a next.
We have JSON example with metadata.
That message then needs to be stored somewhere.
There are many ways to store container logs
or in Kubernetes logs, if, for example, we have systemd.
But here we are going to refer how Docker operates.
Basically, your message is stored in the file system,
in a pod.
So every message becomes a new JSON map,
and every message is appended at the end of that file.
But things will become a little bit complex later,
because if we think about how Kubernetes
works from an architectural perspective,
we can think about this.
You have your application, the most simple use case.
That application is running in a container.
Container appends limitation restrictions
and allows you to set up certain policy rules
and also how this process that is running
can communicate with others.
And when I say the "container," you
know that container is just a concept, right?
From an operating system perspective,
it's all about namespaces and cgroups.
So your application runs in a container,
and that container runs in a concept, which is called a pod.
So things get complex, because a pod
can have multiple containers.
Multiple pods can be running on the same node.
And a node I'm referring to to a bare metal
machine or a BPEL machine.
So here is just one single machine, but in a real cluster,
you have many of them.
So imagine that you have your distributed application.
You told Kubernetes, please deploy my application.
Kubernetes, based on the scaling policies,
is going to decide where these containers will run
or will scale up.
It's going to create some replicas on which
node these replicas will run.
And likely, not all of them will run in the same node.
So these become complex.
If I tell you 20 years ago, please
look at my logs from my application, well,
you open the terminal.
You do some SSH and look what's going on with your application.
Just cut, file, and you get the message.
But on this, if you have a huge cluster,
you cannot do an SSH on every node and try to find which is
the local JSON file that belongs to the application that I just
deployed, but maybe that application will destroy it,
because it failed or it was scaled up.
So things become more complex, and complex