Configuring Prometheus
Prometheus configuration is YAML. The Prometheus download comes with a sample configuration in a file called that is a good place to get started.
We’ve stripped out most of the comments in the example file to make it more succinct (comments are the lines prefixed with a ).
There are three blocks of configuration in the example configuration file: , , and .
The block controls the Prometheus server’s global configuration. We have two options present. The first, , controls how often Prometheus will scrape targets. You can override this for individual targets. In this case the global setting is to scrape every 15 seconds. The option controls how often Prometheus will evaluate rules. Prometheus uses rules to create new time series and to generate alerts.
The block specifies the location of any rules we want the Prometheus server to load. For now we’ve got no rules.
The last block, , controls what resources Prometheus monitors. Since Prometheus also exposes data about itself as an HTTP endpoint it can scrape and monitor its own health. In the default configuration there is a single job, called , which scrapes the time series data exposed by the Prometheus server. The job contains a single, statically configured, target, the on port . Prometheus expects metrics to be available on targets on a path of . So this default job is scraping via the URL: http://localhost:9090/metrics.
The time series data returned will detail the state and performance of the Prometheus server.
For a complete specification of configuration options, see the
configuration documentation.
Последние новости
25.03.2020 |
НИЦ «Курчатовский институт» – ЦНИИ КМ «Прометей» принимает комплексные меры, направленные на предотвращение распространения новой короновирусной инфекции (COVID-19). |
20.03.2020 |
16 марта в НИЦ «Курчатовский институт» объявлены лауреаты премии имени академика |
11.03.2020 |
10 марта в НИЦ «Курчатовский институт» |
08.03.2020 | |
04.03.2020 |
Работы и авторские коллективы ученых НИЦ «Курчатовский институт» |
04.03.2020 |
27 февраля 2020 года ГНЦ РФ ФГУП «ЦНИИчермет имени И.П. Бардина» вновь собрал молодых специалистов на |
02.03.2020 |
Актуальные проблемы отрасли обсудили в НТО судостроителей27 февраля состоялось заседание секции «Судостроительные материалы» Научно-технического общества судостроителей им. академика А. Н. Крылова под руководством нового руководителя секции, доктора технических наук заместителя начальника научно производственного комплекса НИЦ «Курчатовский институт» — ЦНИИ КМ «Прометей» Андрея Валентиновича Анисимова. |
CustomResourceDefinitions
-
, which defines a desired Prometheus deployment.
The Operator ensures at all times that a deployment matching the resource definition is running. -
, which declaratively specifies how groups
of services should be monitored. The Operator automatically generates Prometheus scrape configuration
based on the definition. -
, which declaratively specifies how groups
of pods should be monitored. The Operator automatically generates Prometheus scrape configuration
based on the definition. -
, which defines a desired Prometheus rule file, which can
be loaded by a Prometheus instance containing Prometheus alerting and
recording rules. -
, which defines a desired Alertmanager deployment.
The Operator ensures at all times that a deployment matching the resource definition is running.
To learn more about the CRDs introduced by the Prometheus Operator have a look
at the design doc.
To automate validation of your CRD configuration files see about linting.
What is Prometheus?
Prometheus is an open-source systems
monitoring and alerting toolkit originally built at
SoundCloud. Since its inception in 2012, many
companies and organizations have adopted Prometheus, and the project has a very
active developer and user community. It is now a standalone open source project
and maintained independently of any company. To emphasize this, and to clarify
the project’s governance structure, Prometheus joined the
Cloud Native Computing Foundation in 2016
as the second hosted project, after Kubernetes.
For more elaborate overviews of Prometheus, see the resources linked from the
media section.
Features
Prometheus’s main features are:
- a multi-dimensional data model with time series data identified by metric name and key/value pairs
- PromQL, a flexible query language
to leverage this dimensionality - no reliance on distributed storage; single server nodes are autonomous
- time series collection happens via a pull model over HTTP
- pushing time series is supported via an intermediary gateway
- targets are discovered via service discovery or static configuration
- multiple modes of graphing and dashboarding support
Components
The Prometheus ecosystem consists of multiple components, many of which are
optional:
- the main Prometheus server which scrapes and stores time series data
- client libraries for instrumenting application code
- a push gateway for supporting short-lived jobs
- special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
- an alertmanager to handle alerts
- various support tools
Most Prometheus components are written in Go, making
them easy to build and deploy as static binaries.
Architecture
This diagram illustrates the architecture of Prometheus and some of
its ecosystem components:
Prometheus scrapes metrics from instrumented jobs, either directly or via an
intermediary push gateway for short-lived jobs. It stores all scraped samples
locally and runs rules over this data to either aggregate and record new time
series from existing data or generate alerts. Grafana or
other API consumers can be used to visualize the collected data.
histogram_quantile()
calculates the φ-quantile (0 ≤ φ
≤ 1) from the buckets of a
. (See
histograms and summaries for
a detailed explanation of φ-quantiles and the usage of the histogram metric type
in general.) The samples in are the counts of observations in each bucket.
Each sample must have a label where the label value denotes the inclusive
upper bound of the bucket. (Samples without such a label are silently ignored.)
The
automatically provides time series with the suffix and the appropriate
labels.
Use the function to specify the time window for the quantile
calculation.
Example: A histogram metric is called . To
calculate the 90th percentile of request durations over the last 10m, use the
following expression:
The quantile is calculated for each label combination in
. To aggregate, use the aggregator
around the function. Since the label is required by
, it has to be included in the clause. The following
expression aggregates the 90th percentile by :
To aggregate everything, specify only the label:
The function interpolates quantile values by
assuming a linear distribution within a bucket. The highest bucket
must have an upper bound of . (Otherwise, is returned.) If
a quantile is located in the highest bucket, the upper bound of the
second highest bucket is returned. A lower limit of the lowest bucket
is assumed to be 0 if the upper bound of that bucket is greater than
0. In that case, the usual linear interpolation is applied within that
bucket. Otherwise, the upper bound of the lowest bucket is returned
for quantiles located in the lowest bucket.
If contains fewer than two buckets, is returned. For φ -Inf is
returned. For φ > 1, is returned.
increase()
calculates the increase in the
time series in the range vector. Breaks in monotonicity (such as counter
resets due to target restarts) are automatically adjusted for. The
increase is extrapolated to cover the full time range as specified
in the range vector selector, so that it is possible to get a
non-integer result even if a counter increases only by integer
increments.
The following example expression returns the number of HTTP requests as measured
over the last 5 minutes, per time series in the range vector:
should only be used with counters. It is syntactic sugar
for multiplied by the number of seconds under the specified
time range window, and should be used primarily for human readability.
Use in recording rules so that increases are tracked consistently
on a per-second basis.
Aggregation operators
Prometheus supports the following built-in aggregation operators that can be
used to aggregate the elements of a single instant vector, resulting in a new
vector of fewer elements with aggregated values:
- (calculate sum over dimensions)
- (select minimum over dimensions)
- (select maximum over dimensions)
- (calculate the average over dimensions)
- (calculate population standard deviation over dimensions)
- (calculate population standard variance over dimensions)
- (count number of elements in the vector)
- (count number of elements with the same value)
- (smallest k elements by sample value)
- (largest k elements by sample value)
- (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
These operators can either be used to aggregate over all label dimensions
or preserve distinct dimensions by including a or clause. These
clauses may be used before or after the expression.
or
is a list of unquoted labels that may include a trailing comma, i.e.
both and are valid syntax.
removes the listed labels from the result vector, while
all other labels are preserved the output. does the opposite and drops
labels that are not listed in the clause, even if their label values are
identical between all elements of the vector.
is only required for , , and
.
outputs one time series per unique sample value. Each series has
an additional label. The name of that label is given by the aggregation
parameter, and the label value is the unique sample value. The value of each
time series is the number of times that sample value was present.
and are different from other aggregators in that a subset of
the input samples, including the original labels, are returned in the result
vector. and are only used to bucket the input vector.
Example:
If the metric had time series that fan out by
, , and labels, we could calculate the total
number of seen HTTP requests per application and group over all instances via:
Which is equivalent to:
If we are just interested in the total of HTTP requests we have seen in all
applications, we could simply write:
To count the number of binaries running each build version we could write:
To get the 5 largest HTTP requests counts across all instances we could write:
Removal
To remove the operator and Prometheus, first delete any custom resources you created in each namespace. The
operator will automatically shut down and remove Prometheus and Alertmanager pods, and associated ConfigMaps.
for n in $(kubectl get namespaces -o jsonpath={..metadata.name}); do kubectl delete --all --namespace=$n prometheus,servicemonitor,podmonitor,alertmanager done
After a couple of minutes you can go ahead and remove the operator itself.
kubectl delete -f bundle.yaml
The operator automatically creates services in each namespace where you created a Prometheus or Alertmanager resources,
and defines three custom resource definitions. You can clean these up now.
for n in $(kubectl get namespaces -o jsonpath={..metadata.name}); do kubectl delete --ignore-not-found --namespace=$n service prometheus-operated alertmanager-operated done kubectl delete --ignore-not-found customresourcedefinitions prometheuses.monitoring.coreos.com servicemonitors.monitoring.coreos.com podmonitors.monitoring.coreos.com alertmanagers.monitoring.coreos.com prometheusrules.monitoring.coreos.com
When does it fit?
Prometheus works well for recording any purely numeric time series. It fits
both machine-centric monitoring as well as monitoring of highly dynamic
service-oriented architectures. In a world of microservices, its support for
multi-dimensional data collection and querying is a particular strength.
Prometheus is designed for reliability, to be the system you go to
during an outage to allow you to quickly diagnose problems. Each Prometheus
server is standalone, not depending on network storage or other remote services.
You can rely on it when other parts of your infrastructure are broken, and
you do not need to setup extensive infrastructure to use it.
irate()
calculates the per-second instant rate of increase of
the time series in the range vector. This is based on the last two data points.
Breaks in monotonicity (such as counter resets due to target restarts) are
automatically adjusted for.
The following example expression returns the per-second rate of HTTP requests
looking up to 5 minutes back for the two most recent data points, per time
series in the range vector:
should only be used when graphing volatile, fast-moving counters.
Use for alerts and slow-moving counters, as brief changes
in the rate can reset the clause and graphs consisting entirely of rare
spikes are hard to read.
Note that when combining with an
(e.g. )
or a function aggregating over time (any function ending in ),
always take a first, then aggregate. Otherwise cannot detect
counter resets when your target restarts.
_over_time()
The following functions allow aggregating each series of a given range vector
over time and return an instant vector with per-series aggregation results:
- : the average value of all points in the specified interval.
- : the minimum value of all points in the specified interval.
- : the maximum value of all points in the specified interval.
- : the sum of all values in the specified interval.
- : the count of all values in the specified interval.
- : the φ-quantile (0 ≤ φ ≤ 1) of the values in the specified interval.
- : the population standard deviation of the values in the specified interval.
- : the population standard variance of the values in the specified interval.
Note that all values in the specified interval have the same weight in the
aggregation even if the values are not equally spaced throughout the interval.
This documentation is . Please help improve it by filing issues or pull requests.
Binary operators
Prometheus’s query language supports basic logical and arithmetic operators.
For operations between two instant vectors, the
can be modified.
Arithmetic binary operators
The following binary arithmetic operators exist in Prometheus:
- (addition)
- (subtraction)
- (multiplication)
- (division)
- (modulo)
- (power/exponentiation)
Binary arithmetic operators are defined between scalar/scalar, vector/scalar,
and vector/vector value pairs.
Between two scalars, the behavior is obvious: they evaluate to another
scalar that is the result of the operator applied to both scalar operands.
Between an instant vector and a scalar, the operator is applied to the
value of every data sample in the vector. E.g. if a time series instant vector
is multiplied by 2, the result is another vector in which every sample value of
the original vector is multiplied by 2.
Between two instant vectors, a binary arithmetic operator is applied to
each entry in the left-hand side vector and its
in the right-hand vector. The result is propagated into the result vector with the
grouping labels becoming the output label set. The metric name is dropped. Entries
for which no matching entry in the right-hand vector can be found are not part of
the result.
Comparison binary operators
The following binary comparison operators exist in Prometheus:
- (equal)
- (not-equal)
- (greater-than)
- (greater-or-equal)
Comparison operators are defined between scalar/scalar, vector/scalar,
and vector/vector value pairs. By default they filter. Their behavior can be
modified by providing after the operator, which will return or
for the value rather than filtering.
Between two scalars, the modifier must be provided and these
operators result in another scalar that is either () or
(), depending on the comparison result.
Between an instant vector and a scalar, these operators are applied to the
value of every data sample in the vector, and vector elements between which the
comparison result is get dropped from the result vector. If the
modifier is provided, vector elements that would be dropped instead have the value
and vector elements that would be kept have the value .
Between two instant vectors, these operators behave as a filter by default,
applied to matching entries. Vector elements for which the expression is not
true or which do not find a match on the other side of the expression get
dropped from the result, while the others are propagated into a result vector
with the grouping labels becoming the output label set.
If the modifier is provided, vector elements that would have been
dropped instead have the value and vector elements that would be kept have
the value , with the grouping labels again becoming the output label set.
Logical/set binary operators
These logical/set binary operators are only defined between instant vectors:
- (intersection)
- (union)
- (complement)
results in a vector consisting of the elements of
for which there are elements in with exactly matching
label sets. Other elements are dropped. The metric name and values are carried
over from the left-hand side vector.
results in a vector that contains all original elements
(label sets + values) of and additionally all elements of
which do not have matching label sets in .
results in a vector consisting of the elements of
for which there are no elements in with exactly matching
label sets. All matching elements in both vectors are dropped.
Use cases
There are different use cases for federation. Commonly, it is used to either
achieve scalable Prometheus monitoring setups or to pull related metrics from
one service’s Prometheus into another.
Hierarchical federation
Hierarchical federation allows Prometheus to scale to environments with tens of
data centers and millions of nodes. In this use case, the federation topology
resembles a tree, with higher-level Prometheus servers collecting aggregated
time series data from a larger number of subordinated servers.
For example, a setup might consist of many per-datacenter Prometheus servers
that collect data in high detail (instance-level drill-down), and a set of
global Prometheus servers which collect and store only aggregated data
(job-level drill-down) from those local servers. This provides an aggregate
global view and detailed local views.
Cross-service federation
In cross-service federation, a Prometheus server of one service is configured
to scrape selected data from another service’s Prometheus server to enable
alerting and queries against both datasets within a single server.
For example, a cluster scheduler running multiple services might expose
resource usage information (like memory and CPU usage) about service instances
running on the cluster. On the other hand, a service running on that cluster
will only expose application-specific service metrics. Often, these two sets of
metrics are scraped by separate Prometheus servers. Using federation, the
Prometheus server containing service-level metrics may pull in the cluster
resource usage metrics about its specific service from the cluster Prometheus,
so that both sets of metrics can be used within that server.
Configuring federation
On any given Prometheus server, the endpoint allows retrieving the
current value for a selected set of time series in that server. At least one
URL parameter must be specified to select the series to expose. Each
argument needs to specify an
like
or . If multiple parameters are provided,
the union of all matched series is selected.
To federate metrics from one server to another, configure your destination
Prometheus server to scrape from the endpoint of a source server,
while also enabling the scrape option (to not overwrite any
labels exposed by the source server) and passing in the desired
parameters. For example, the following federates any series
with the label or a metric name starting with from
the Prometheus servers at into the scraping
Prometheus:
This documentation is . Please help improve it by filing issues or pull requests.
rate()
calculates the per-second average rate of increase of the
time series in the range vector. Breaks in monotonicity (such as counter
resets due to target restarts) are automatically adjusted for. Also, the
calculation extrapolates to the ends of the time range, allowing for missed
scrapes or imperfect alignment of scrape cycles with the range’s time period.
The following example expression returns the per-second rate of HTTP requests as measured
over the last 5 minutes, per time series in the range vector:
should only be used with counters. It is best suited for alerting,
and for graphing of slow-moving counters.
Note that when combining with an aggregation operator (e.g. )
or a function aggregating over time (any function ending in ),
always take a first, then aggregate. Otherwise cannot detect
counter resets when your target restarts.
What to alert on
Aim to have as few alerts as possible, by alerting on symptoms that are
associated with end-user pain rather than trying to catch every possible way
that pain could be caused. Alerts should link to relevant consoles
and make it easy to figure out which component is at fault.
Allow for slack in alerting to accommodate small blips.
Online serving systems
Typically alert on high latency and error rates as high up in the stack as possible.
Only page on latency at one point in a stack. If a lower-level component is
slower than it should be, but the overall user latency is fine, then there is
no need to page.
For error rates, page on user-visible errors. If there are errors further down
the stack that will cause such a failure, there is no need to page on them
separately. However, if some failures are not user-visible, but are otherwise
severe enough to require human involvement (for example, you are losing a lot of
money), add pages to be sent on those.
You may need alerts for different types of request if they have different
characteristics, or problems in a low-traffic type of request would be drowned
out by high-traffic requests.
Offline processing
For offline processing systems, the key metric is how long data takes to get
through the system, so page if that gets high enough to cause user impact.
Batch jobs
For batch jobs it makes sense to page if the batch job has not succeeded
recently enough, and this will cause user-visible problems.
This should generally be at least enough time for 2 full runs of the batch job.
For a job that runs every 4 hours and takes an hour, 10 hours would be a
reasonable threshold. If you cannot withstand a single run failing, run the
job more frequently, as a single failure should not require human intervention.
Capacity
While not a problem causing immediate user impact, being close to capacity
often requires human intervention to avoid an outage in the near future.
Metamonitoring
It is important to have confidence that monitoring is working. Accordingly, have
alerts to ensure that Prometheus servers, Alertmanagers, PushGateways, and
other monitoring infrastructure are available and running correctly.
Supplementing the whitebox monitoring of Prometheus with external blackbox
monitoring can catch problems that are otherwise invisible, and also serves as
a fallback in case internal systems completely fail.
This documentation is . Please help improve it by filing issues or pull requests.
Using the expression browser
Let us try looking at some data that Prometheus has collected about itself. To
use Prometheus’s built-in expression browser, navigate to
http://localhost:9090/graph and choose the «Console» view within the «Graph»
tab.
As you can gather from http://localhost:9090/metrics, one metric that
Prometheus exports about itself is called
(the total number of requests the Prometheus server has served). Go ahead and enter this into the expression console:
This should return a number of different time series (along with the latest value recorded for each), all with the metric name , but with different labels. These labels designate different requests statuses.
If we were only interested in requests that resulted in HTTP code , we could use this query to retrieve that information:
To count the number of returned time series, you could write:
For more about the expression language, see the
expression language documentation.
Vector matching
Operations between vectors attempt to find a matching element in the right-hand side
vector for each entry in the left-hand side. There are two basic types of
matching behavior: One-to-one and many-to-one/one-to-many.
One-to-one vector matches
One-to-one finds a unique pair of entries from each side of the operation.
In the default case, that is an operation following the format .
Two entries match if they have the exact same set of labels and corresponding values.
The keyword allows ignoring certain labels when matching, while the
keyword allows reducing the set of considered labels to a provided list:
Example input:
Example query:
This returns a result vector containing the fraction of HTTP requests with status code
of 500 for each method, as measured over the last 5 minutes. Without there
would have been no match as the metrics do not share the same set of labels.
The entries with methods and have no match and will not show up in the result:
Many-to-one and one-to-many vector matches
Many-to-one and one-to-many matchings refer to the case where each vector element on
the «one»-side can match with multiple elements on the «many»-side. This has to
be explicitly requested using the or modifier, where
left/right determines which vector has the higher cardinality.
The label list provided with the group modifier contains additional labels from
the «one»-side to be included in the result metrics. For a label can only
appear in one of the lists. Every time series of the result vector must be
uniquely identifiable.
Grouping modifiers can only be used for
and
. Operations as , and
operations match with all possible entries in the right vector by
default.
Example query:
In this case the left vector contains more than one entry per label
value. Thus, we indicate this using . The elements from the right
side are now matched with multiple elements with the same label on the
left:
Many-to-one and one-to-many matching are advanced use cases that should be carefully considered.
Often a proper use of provides the desired outcome.
Области аккредитации
АЦ «Прометей» сегодня
В Аттестационном центре «Прометей» аттестовано более 12 000 специалистов
по шести видам (методам) неразрушающего контроля: ультразвуковому (УК),
радиационному (РК), магнитопорошковому (МК), визуальному и измерительному
(ВИК), проникающими веществами (ПВ) – капиллярному (ПВК) и контролю
герметичности (ПВТ).
АЦ «Прометей» (в соответствии с ИСО/МЭК 17024:2012,
СДА-13-2009, СДА-24-2009, ПБ 03-440-02, ИСО9712, СДСПНК-06-2013,
ISO 9712:2012) имеет все необходимые свидетельства и
аттестаты соответствия, дающие право на аттестацию персонала по
неразрушающему контролю (НК) в следующих системах:
- в Единой системе оценки соответствия в области
промышленной, экологической безопасности, безопасности в энергетике и
строительстве (ЕС ОС), Свидетельство об аккредитации
№ НОАП-0024; - в Системе
добровольной сертификации персонала в области неразрушающего контроля и
технической диагностики Российского общества по неразрушающему контролю и
технической диагностике (СДСПНК РОНКТД), Аттестат соответствия № 36,
включающий в том числе аттестацию специалистов НК в области Российского
морского регистра судоходства; - в атомной энергетике (ФГУП «ЦНИИ КМ
«Прометей» является Головной материаловедческой организацией в соответствии
Приказом госкорпорации «Росатом» №1/505-П от 09.06.2012. Функции Головной
материаловедческой организации определены Правилами контроля ПНАЭ
Г-7-010-89, действующими в атомной энергетике).
Аттестация специалистов по неразрушающему контролю
проводится по следующим объектам контроля:
- Объекты контроля в области промышленной, экологической
безопасности, безопасности в энергетике и строительстве (в соответствии системой ЕС ОС):- oбъекты котлонадзора,
- системы газоснабжения (газораспределения),
- подъемные сооружения,
- оборудование нефтяной и газовой промышленности,
- оборудование металлургической промышленности,
- оборудование взрывопожароопасных и химически опасных производств,
- здания и сооружения (строительные объекты, металлические
конструкции, в том числе стальные конструкции мостов).
- Объекты контроля в промышленных и
производственных секторах (в соответствии с системой СДСПНК РОНКТД и Российского
морского регистра судоходства):- объекты морского регистра (включая объекты
инфраструктуры), - объекты речного регистра (включая объекты инфраструктуры),
- здания и сооружения (строительные объекты, мостовые
конструкции), - объекты энергетики (включая турбостроение),
- общепромышленные объекты (включая машиностроение и
металлопроизводство, трубопроводный транспорт), - лакокрасочные покрытия (ЛКП).
- объекты морского регистра (включая объекты
- Объекты контроля в атомной энергетике:
- сварные соединения I, II, III категорий и наплавки,
- основные материалы (полуфабрикаты).