promtool
prom2json
Native histograms were introduced as an experimental feature in November 2022.
They are a concept that touches almost every part of the Prometheus stack. The
first version of the Prometheus server supporting native histograms was
v2.40.0. The support had to be enabled via a feature flag
--enable-feature=native-histograms
. (TODO: This is still the case with the
current release v2.55 and v3.00. Update this section with the stable release,
once it has happened.)
Due to the pervasive nature of the changes related to native histograms, the documentation of those changes and explanation of the underlying concepts are widely distributed over various channels (like the documentation of affected Prometheus components, doc comments in source code, sometimes the source code itself, design docs, conference talks, …). This document intends to gather all these pieces of information and present them concisely in a unified context. This document prefers to link existing detailed documentation rather than restating it, but it contains enough information to be comprehensible without referring to other sources. With all that said, it should be noted that this document is neither suitable as an introduction for beginners nor does it focus on the needs of developers. For the former, the plan is to provide an updated version of the Best Practices article on histograms and summaries. (TODO: And a blog post or maybe even a series of them.) For the latter, there is Carrie Edward's Developer’s Guide to Prometheus Native Histograms.
While formal specifications are supposed to happen in their respective context (e.g. OpenMetrics changes will be specified in the general OpenMetrics specification), some parts of this document take the shape of a specification. In those parts, the key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” are used as described in RFC 2119.
This document still contains a lot of TODOs. In most cases, they are not just referring to incompleteness of this doc but more importantly to incomplete implementation or open questions. For now, this is essentially a living document that will receive updates as implementations and specifications catch up.
The core idea of native histograms is to treat histograms as first class citizens in the Prometheus data model. Elevating histograms to a “native” sample type is the fundamental prerequisite for the key properties listed below, which explains the choice of the name native histograms.
Prior to the introduction of native histograms, all Prometheus sample values have been 64-bit floating point values (short float64 or just float). These floats can directly represent gauges or counters. The Prometheus metric types summary and (the classic version of) histogram, as they exist in exposition formats, are broken down into float components upon ingestion: A sum and a count component for both types, a number of quantile samples for a summary and a number of bucket samples for a (classic) histogram.
With native histograms, a new structured sample type is introduced. A single sample represents the previously known sum and count plus a dynamic set of buckets. This is not limited to ingestion, but PromQL expressions may also return the new sample type where previously it was only possible to return float samples.
Native histograms have the following key properties:
These key properties are fully realized with standard bucketing schemas. There are other schemas with different trade-offs that might only feature a subset of these properties. See the Schema section below for details
Compared to the previously existing “classic” histograms, native histograms (with standard bucketing schemas) allow a higher bucket resolution across arbitrary ranges of observed values at a lower storage and query cost with very little to no configuration required. Even partitioning histograms by labels is now much more affordable.
Because the sparse representation (property 1 in the list above) is so crucial for many of the other benefits of native histograms, sparse histograms was a common name for native histograms early during the design process. However, other key properties like the exponential bucketing schema or the dynamic nature of the buckets are also very important, but not caught at all in the term sparse histograms.
These are the design docs that guided the development of native histograms. Some details are obsolete now, but they describe rather well the underlying concepts and how they evolved.
A more approachable way of learning about native histograms is to watch conference talks, of which a selection is presented below. As an introduction, it might make sense to watch these talks and then return to this document to learn about all the details and technicalities.
This section describes the data model of native histograms in general. It avoids implementation specifics as far as possible. This includes terminology. For example, a list described in this section will become a repeated message in a protobuf implementation and (most likely) a slice in a Go implementation.
Similar to a classic histogram, a native histogram has a field for the count of observations and a field for the sum of observations. In addition, it contains the following components, which are described in detail in dedicated sections below:
Any native histogram has a specific flavor along each of two independent dimensions:
Float histograms are occasionally used in direct instrumentation for “weighted”
observations, for example to count the number of seconds an observed value was
falling into different buckets of a histogram. The far more common use case for
float histograms is within PromQL, though. PromQL generally only acts on float
values, so the PromQL engine converts every histogram retrieved from the TSDB
to a float histogram first, and any histogram stored back into TSDB via
recording rules is a float histogram. If such a histogram is effectively an
integer histogram (because the value of all non-sum fields can be represented
precisely as uint64), a TSDB implementation MAY convert them back to integer
histograms to increase storage efficiency. (As of Prometheus v3.00, the TSDB
implementation within Prometheus is not utilizing this option.) Note, however,
that the most common PromQL function applied to a counter histogram is rate
,
which generally produces non-integer numbers, so that results of recording
rules will commonly be float histograms with non-integer values anyway.
Treating native histograms explicitly as integer histograms vs. float histogram is a notable deviation from the treatment of conventional simple numeric samples, which are always treated as floats throughout the whole stack for the sake of simplicity.
The main reason for the more involved treatment of histograms is the easy efficiency gains in protobuf-based exposition formats. Protobuf uses varint encoding for integers, which reduces the data size for small integer values without requiring an additional compression layer. This benefit is amplified by the delta encoding of integer buckets, which generally results in smaller integer values. Floats, in contrast, always require 8 bytes in protobuf. In practice, many integers in an integer histogram will fit in 1 byte, and most will fit in 2 bytes, so that the explicit presence of integer histogram in a protobuf-exposition format results directly in a data size reduction approaching 8x for histograms with many buckets. This is particularly relevant as the overwhelming majority of histograms exposed by instrumented targets are integer histograms.
For similar reasons, the representation of integer histograms in RAM and on disk is generally more efficient than that of float histograms. This is less relevant than the benefits in the exposition format, though. For one, Prometheus uses Gorilla-style XOR encoding for floats, which reduces their size, albeit not as much as the double-delta encoding used for integers. More importantly, an implementation could always decide to internally use an integer representation for histogram fields that are effectively integer values (see above). (Historical note: Prometheus v1 used exactly this approach to improve the compression of float samples, and Prometheus v3 might very well adopt this approach again in the future.)
In a counter histogram, the total count of observation and the counts in the buckets individually behave like Prometheus counters, i.e. they only go down upon a counter reset. However, the sum of observation may decrease as a consequence of the observation of negative values. PromQL implementations MUST detect counter resets based on the whole histogram (see the counter reset considerations section below for details). (Note that this always has been a problem for the sum component of classic histograms and summaries, too. The approach so far was to accept that counter reset detection silently breaks for sum in those cases. Fortunately, negative observations are a very rare use case for Prometheus histograms and summaries.)
The schema is a signed integer value with a size of 8 bits (short: int8). It defines the way bucket boundaries are calculated. The currently valid values are -53 and the range between and including -4 and +8. More schemas may be added in the future. -53 is a schema for so-called custom bucket boundaries or short custom buckets, while the other schema numbers represent the different standard exponential schemas (short: standard schemas).
The standard schemas are mergeable with each other and are RECOMMENDED for general use cases. Larger schema numbers correspond to higher resolutions. Schema n has half the resolution of schema n+1, which implies that a histogram with schema n+1 can be converted into a histogram with schema n by merging neighboring buckets.
For any standard schema n, the boundaries of a bucket with index i calculated as follows (using Python syntax):
(2**2**-n)**i
(2**2**-n)**(i-1)
-((2**2**-n)**i)
-((2**2**-n)**(i-1))
i is an integer number that may be negative.
There are exceptions to the rules above concerning the largest and smallest
finite values representable as a float64 (called MaxFloat64
and MinFloat64
in the following) and the positive and negative infinity values (+Inf
and
-Inf
):
MaxFloat64
(according to the boundary
formulas above) has an upper inclusive limit of MaxFloat64
(rather than the
limit calculated by the formulas above, which would overflow float64).MaxFloat64
and an upper
inclusive limit of +Inf
. (It could be called a positive overflow bucket.)MinFloat64
(according to the boundary
formulas above) has a lower inclusive limit of MinFloat64
(rather than the
limit calculated by the formulas above, which would underflow float64).MinFloat64
and an lower
inclusive limit of -Inf
. (It could be called a negative overflow bucket.)+Inf
and -Inf
buckets described above MUST NOT be used.There are more exceptions for values close to zero, see the zero bucket section below.
The current limits of -4 for the lowest resolution and 8 for the highest resolution have been chosen based on practical usefulness. Should a practical need arise for even lower or higher resolution, an extension of the range will be considered. However, a schema greater than 52 does not make sense as the growth factor from one bucket to the next would then be smaller than the difference between representable float64 numbers. Likewise, a schema smaller than -9 does not make sense either, as the growth factor would then exceed the largest float representable as float64. Therefore, the schema numbers between (and including) -9 and +52 are reserved for future standard schemas (following the formulas for bucket boundaries above) and MUST NOT be used for any other schemas.
For schema -53, the bucket boundaries are set explicitly via custom values, described in detail in the custom values section below. This results in a native histogram with custom bucket boundaries (or short custom buckets, often further abbreviated to NHCB). Such a histogram can be used to represent a classic histogram as a native histogram. It can also be used if the exponential bucketing featured by the standard schemas is a bad match for the distribution to be represented by the histogram. Histograms with different custom bucket boundaries are generally not mergeable with each other. Therefore, schema -53 SHOULD only be used as an informed decision in specific use cases. (TODO: NHCB aren't fully merged into main as of now (2024-11-03). They are worked into this document as far as possible already. This information might not yet be relevant for released Prometheus versions.)
For standard schemas, buckets are represented as two lists, one for positive buckets and one for negative buckets. For custom buckets (schema -53), only the positive bucket list is used, but repurposed for all buckets.
Any unpopulated buckets MAY be excluded from the lists. (Which is the reason why the buckets are often called sparse buckets.)
For float histograms, the elements of the lists are float64 and represent the bucket population directly.
For integer histograms, the elements of the lists are signed 64-bit integers (short: int64), and each element represents the bucket population as a delta to the previous bucket in the list. The first bucket in each list contains an absolute population (which can also be seen as a delta relative to zero).
To map buckets in the lists to the indices as defined in the previous section, there are two lists of so-called spans, one for the positive buckets and one for the negative buckets.
Each span consists of a pair of numbers, a signed 32-bit integer (short: int32) called offset and an unsigned 32-bit integer (short: uint32) called length. Only the first span in each list can have a negative offset. It defines the index of the first bucket in its corresponding bucket list. (Note that for NHCBs, the index is always positive, see the custom values section below for details.) The length defines the number of consecutive buckets the bucket list starts with. The offsets of the following spans define the number of excluded (and thus unpopulated buckets). The lengths define the number of consecutive buckets in the list following the excluded buckets.
The sum of all length values in each span list MUST be equal to the length of the corresponding bucket list.
Empty spans (with a length of zero) are valid and MAY be used, although they are generally not useful and they SHOULD be eliminated by adding their offset to the offset of the following span. Similarly, spans that are not the first span in a list MAY have an offset of zero, although those offsets SHOULD be eliminated by adding their length to the previous span. Both cases are allowed so that producers of native histograms MAY pick whatever representation has the best resource trade-offs at that moment. For example, if a histogram is processed through various stages, it might be most efficient to only eliminate redundant spans after the last processing stage.
In a similar spirit, there are situation where excluding every unpopulated bucket from the bucket list is most efficient, but in other situations, it might be better to reduce the number of spans by representing small numbers of unpopulated buckets explicitly.
Note that future high resolution schemas might require offsets that are too
large to be represented with an int32. An extension of the data model will be
required in that case. (The current standard schema with the highest resolution
is schema 8, for which the bucket that contains MaxFloat64
has index 262144,
and thus the +Inf
overflow bucket has index 262145, while the largest number
representable with int32 is 2147483647. The highest standard schema that would
still work with int32 offsets would be schema 20, corresponding to a growth
factor from bucket to bucket of only ~1.000000661.)
An integer histogram has the following positive buckets (index→population):
-2→3, -1→5, 0→0, 1→0, 2→1, 3→0, 4→3, 5→2
They could be represented in this way:
[3, 2, -4, 2, -1]
[[-2, 2], [2,1], [1,2]]
The second and third span could be merged into one if the single unpopulated bucket with index 3 is represented explicitly, leading to the following result:
[3, 2, -4, -1, 3, -1]
[[-2, 2], [2,4]]
Or merge all the spans into one by representing all unpopulated buckets above explicitly:
[3, 2, -5, 0, 1, -1, 3, -1]
[[-2, 8]]
Observations of exactly zero do not fit into any bucket as defined by the standard schemas above. They are counted in a dedicated bucket called the zero bucket.
The number of observations in the zero bucket is tracked by a single uint64 (for integer histograms) or float64 (for float histograms).
The zero bucket has an additional parameter called the zero threshold, which is a float64 ≥ 0. If the threshold is set to zero, only observations of exactly zero go into the zero bucket, which is the case described above. If the threshold has a positive value, all observations within the closed interval [-threshold, +threshold] go to the zero bucket rather than a regular bucket. This has two use cases:
The threshold of the zero bucket SHOULD coincide with a boundary of a regular bucket, which avoids the complication of the zero bucket overlapping with parts of a regular bucket. However, if such an overlap is happening, the observations that are counted in the regular bucket overlapping with the zero bucket MUST be outside of the [-threshold, +threshold] interval.
To merge histograms with the same zero threshold, the two zero buckets are simply added. If the zero thresholds in the source histograms are different, however, the largest threshold in any of the source histograms is chosen. If that threshold happens to be within any populated bucket in the other source histograms, the threshold is increased until one of the following is true for each source histogram:
Then the source zero buckets and any source buckets now inside the new threshold are added up to yield the population of the new zero bucket.
The zero bucket is not used if the schema is -53 (custom buckets).
The list of custom values is unused for standard schemas. It is used by non-standard schemas in a custom way in case there is need to store additional data.
The only currently defined schema for which custom values are used is -53 (custom buckets). The remaining part of this section describes the usage of the custom values in more detail for this specific case.
The custom values represent the upper inclusive boundaries of the custom buckets. They are sorted in ascending fashion. The custom buckets themselves are stored using the positive bucket list and the positive span list, although their boundaries, as determined via the custom values, can be negative. The index of each of those “positive” buckets defines the zero-based position of their upper boundary within the custom values list.
The lower exclusive boundary is defined by the custom value preceding the upper
boundary. For the first custom value (at position zero in the list), there is
no preceding value, in which case the lower boundary is considered to be
-Inf
. Therefore, the custom bucket with index zero counts all observations
between -Inf
and the first custom value. In the common case that only
positive observations are expected, the custom bucket with index zero SHOULD
have an upper boundary of zero to clearly mark if there have been any
observations at zero or below. (If there are indeed only positive observations,
the custom bucket with index zero will stay unpopulated and therefore will
never be represented explicitly. The only cost is the additional zero element
at the beginning of the custom values list.)
The last custom value MUST NOT be +Inf
. Observations greater than the last
custom value go into an overflow bucket with an upper boundary of +Inf
. This
overflow bucket is added with an index equal to the length of the custom
values list.
A native histogram sample can have zero, one, or more exemplars. They work in the same way as conventional exemplars, but they are organized in a list (as there can be more than one), and they MUST have a timestamp.
Exemplars exposed as part of a classic histogram MAY be used by native histograms, if they have a timestamp.
Instrumented code SHOULD avoid observing values of NaN
and ±Inf
because
they make limited sense in the context of a histogram. However, those values
MUST still be handled properly, as described in the following.
The sum of observations is calculated as usual by adding the observation to the
sum of observations, following normal floating point arithmetic. (For example,
an observation of NaN
will set the sum to NaN
. An observation of +Inf
will set the sum to +Inf
, unless it is already NaN
or -Inf
, in which case
the sum is set to NaN
.)
An observation of NaN
goes into no bucket, but increments the count of
observations. This implies that the count of observations can be greater than
the sum of all buckets (negative, positive, and zero buckets), and the
difference is the number of NaN
observations. (For an integer histogram
without any NaN
observations, the sum of all buckets is equal to the count of
observations. Within the usual floating point precision limits, the same is
true for a float histogram without any NaN
observations.)
An observation of +Inf
or -Inf
increments the count of observations and
increments a bucket chosen in the following way:
- With a standard schema, a +Inf
observation increments the
positive overflow bucket as described above.
- With a standard schema, a -Inf
observation increments the
negative overflow bucket as described above.
- With schema -53 (custom buckets), a +Inf
observation increments the
bucket with an index equal to the length of the custom values list.
- With schema -53 (custom buckets), a -Inf
observation increments the
bucket with index zero.
Prometheus (Prom) native histograms with a standard schema can be easily mapped into an OpenTelemetry (OTel) exponential histogram and vice versa, as detailed in the following.
The Prom schema is equal to the scale in OTel, with the restriction that OTel allows lower values than -4 and higher values than +8. As described above, Prom has reserved more schema numbers to extend its range, should it ever by required in practice.
The index is offset by one, i.e. a Prom bucket with index n has index n-1 for OTel.
OTel has a dense rather than a sparse representation of buckets. One might see OTel as “Prom with only one span”.
The Prom zero bucket is called zero count in OTel. (Prom also uses zero count to name the field storing the count of observations in the zero bucket). Both work the same, including the existence of a zero threshold. Note that OTel implies a threshold of zero if none is given.
(TODO: The OTel spec reads: “When zero_threshold is unset or 0, this bucket stores values that cannot be expressed using the standard exponential formula as well as values that have been rounded to zero.” Double-check if this really creates the same behavior. If there are problems close to zero, we could make Prom's spec more precise. If OTel counts NaN in the zero bucket, we have to add a note here.)
OTel exponential histograms only support standard exponential bucketing schemas (as the name suggests). Therefore, NHCBs (or native histograms with other future bucketing schemas) cannot be cleanly converted to OTel exponential histograms. However, conversion to a conventional OTel histogram with fixed buckets is still possible.
OTel histograms of any kind have optional fields for the minimum and maximum value observed in the histogram. These fields have no equivalent concept in Prometheus because counter histograms accumulate data over a long and unpredictable timespan and can be scraped at any time, so that tracking a minimum and maximum value is either infeasible or of limited use. Note, though, that native histograms enable a fairly accurate estimation of the maximum and minimum observation during arbitrary timespans, see the PromQL section.
Metrics exposition in the classic Prometheus use case is dominated by strings because all the metric names, label names, and label values take much more space than the float64 sample values, even if the latter are represented in a potentially more verbose text form. This was one of the reasons why abandoning protobuf-based exposition seemed advantageous in the past.
In contrast, a native histogram, following the data model described above, consists of a lot more numerical data. This amplifies the advantages of a protobuf based format. Therefore, the previously abandoned protobuf-based exposition was revived to efficiently expose and scrape native histograms.
At the time native histograms were conceived, OpenMetrics adoption was still lacking, and in particular, the protobuf version of OpenMetrics had no known applications at all. Therefore, the initial approach was to extend the classic Prometheus protobuf format to support native histograms. (An additional practical consideration was that the Go instrumentation library was still using the classic protobuf spec as its internal data model, simplifying the initial development.)
The classic Prometheus text form was not extended for native histograms, and such an extension is not planned. (See also the OpenMetrics section below.)
There is a proto2 and a proto3 version of the protobuf specification, which both create the same wire format:
These files have comprehensive comments, which should enable an easy mapping from the proto spec to the data model described above.
Here are relevant parts from the proto3 file:
// [...]
message Histogram {
uint64 sample_count = 1;
double sample_count_float = 4; // Overrides sample_count if > 0.
double sample_sum = 2;
// Buckets for the classic histogram.
repeated Bucket bucket = 3 [(gogoproto.nullable) = false]; // Ordered in increasing order of upper_bound, +Inf bucket is optional.
google.protobuf.Timestamp created_timestamp = 15;
// Everything below here is for native histograms (also known as sparse histograms).
// Native histograms are an experimental feature without stability guarantees.
// schema defines the bucket schema. Currently, valid numbers are -4 <= n <= 8.
// They are all for base-2 bucket schemas, where 1 is a bucket boundary in each case, and
// then each power of two is divided into 2^n logarithmic buckets.
// Or in other words, each bucket boundary is the previous boundary times 2^(2^-n).
// In the future, more bucket schemas may be added using numbers < -4 or > 8.
sint32 schema = 5;
double zero_threshold = 6; // Breadth of the zero bucket.
uint64 zero_count = 7; // Count in zero bucket.
double zero_count_float = 8; // Overrides sb_zero_count if > 0.
// Negative buckets for the native histogram.
repeated BucketSpan negative_span = 9 [(gogoproto.nullable) = false];
// Use either "negative_delta" or "negative_count", the former for
// regular histograms with integer counts, the latter for float
// histograms.
repeated sint64 negative_delta = 10; // Count delta of each bucket compared to previous one (or to zero for 1st bucket).
repeated double negative_count = 11; // Absolute count of each bucket.
// Positive buckets for the native histogram.
// Use a no-op span (offset 0, length 0) for a native histogram without any
// observations yet and with a zero_threshold of 0. Otherwise, it would be
// indistinguishable from a classic histogram.
repeated BucketSpan positive_span = 12 [(gogoproto.nullable) = false];
// Use either "positive_delta" or "positive_count", the former for
// regular histograms with integer counts, the latter for float
// histograms.
repeated sint64 positive_delta = 13; // Count delta of each bucket compared to previous one (or to zero for 1st bucket).
repeated double positive_count = 14; // Absolute count of each bucket.
// Only used for native histograms. These exemplars MUST have a timestamp.
repeated Exemplar exemplars = 16;
}
message Bucket {
uint64 cumulative_count = 1; // Cumulative in increasing order.
double cumulative_count_float = 4; // Overrides cumulative_count if > 0.
double upper_bound = 2; // Inclusive.
Exemplar exemplar = 3;
}
// A BucketSpan defines a number of consecutive buckets in a native
// histogram with their offset. Logically, it would be more
// straightforward to include the bucket counts in the Span. However,
// the protobuf representation is more compact in the way the data is
// structured here (with all the buckets in a single array separate
// from the Spans).
message BucketSpan {
sint32 offset = 1; // Gap to previous span, or starting point for 1st span (which can be negative).
uint32 length = 2; // Length of consecutive buckets.
}
// A BucketSpan defines a number of consecutive buckets in a native
// histogram with their offset. Logically, it would be more
// straightforward to include the bucket counts in the Span. However,
// the protobuf representation is more compact in the way the data is
// structured here (with all the buckets in a single array separate
// from the Spans).
message BucketSpan {
sint32 offset = 1; // Gap to previous span, or starting point for 1st span (which can be negative).
uint32 length = 2; // Length of consecutive buckets.
}
// [...]
(TODO: The above does not yet contain the custom values needed for NHCBs. Update once merged into main.)
Note the following:
Histogram
proto message, i.e. the existing Histogram
message got extended
with fields for native histograms.created_timestamp
are shared between classic and native histograms and keep
working in the same way for both.sample_count_float
,
cumulative_count_float
).Bucket
field and the Bucket
message are used for the buckets of a
classic histogram. It is perfectly possible to create a Histogram
message
that represents both a classic and a native version of the same histogram.
Parsers have the freedom to pick either or both versions (see also the
scrape configuration section).sint64
type.Histogram
message that is meant to be parsed
as a native histogram MUST contain a “no-op span”, i.e. a BucketSpan
with
offset
and length
set to 0, in the repeated positive_span
field.Exemplar
field of the Histogram
message, but each one MUST have a
timestamp. If there are no exemplars provided in this way, a parser MAY use
timestamped exemplars provided for classic buckets (as at most one exemplar
per bucket in the Exemplar
field of the Bucket
message).Histogram
message, and the exemplars SHOULD fall
into different buckets and cover the whole spread of buckets approximately
evenly. (This is generally preferred over an exemplar distribution that
proportionally represents the distribution of observations, as the latter
will rarely yield exemplars from the long tail of a distribution, which are
often the most interesting exemplars to look at.)Currently (2024-11-03), OpenMetrics does not support native histograms.
Adding support to the protobuf version of OpenMetrics is relatively straightforward due to its similarity to the classic Prometheus protobuf format. A proposal in the form of a PR is under review.
Adding support to the text version of OpenMetrics is harder, but also highly desirable because there are many situations where the generation of protobuf is infeasible. A text format has to make a trade-off between readability for humans and efficient handling by machines (encoding, transport, decoding). Work on it is in progress. See the design doc for more details.
(TODO: Update section as progress is made.)
The protobuf specification enables low-level creation of metrics exposition including native histograms using the language specific bindings created by the protobuf compiler. However, for direct code instrumentation, an instrumentation library is needed.
Currently (2024-11-03), there are two official Prometheus instrumentation libraries supporting native histograms:
Adding native histogram support to other instrumentation libraries is relatively easy if the library already supports protobuf exposition. For purely text based libraries, the completion of a text based exposition format is a prerequisite. (TODO: Update this as needed.)
This section does not cover details of how to use individual instrumentation libraries (see the documentation linked above for that) but focuses on the common usage patterns and also provides general guidelines how to implement native histogram support as part of an instrumentation library. The already existing Go implementation is used for examples. The sections about the data model and the exposition formats are highly relevant for the implementation of instrumentation libraries (but not restated in this section!).
The actual instrumentation API for histograms does not change for native histograms. Both classic histograms and native histograms receive observations in the same way (with subtle differences concerning exemplars, see next paragraph). Instrumentation libraries can even maintain a classic and a native version of the same histogram and expose them in parallel so that the scraper can choose which version to ingest (see the section about exposition formats for details). The user chooses whether to expose classic and/or native histograms via configuration settings.
Exemplars for classic histograms are usually tracked by storing and exposing the most recent exemplar for each bucket. As long as classic buckets are defined, an instrumentation library MAY expose the same exemplars for the native version of the same histogram, as long as each exemplar has a timestamp. (In fact, a scraper MAY use the exemplars provided with the classic version of the histogram even if it is otherwise only ingesting the native version, see details in the exposition formats section.) However, a native histogram MAY be assigned any number of exemplars, and an instrumentation library SHOULD use this liberty to meet the best practices for exemplars as described in the exposition formats section.
An instrumentation library SHOULD offer the following configuration parameters for native histograms following standard schemas. Names are examples from the Go library – they have to be adjusted to the idiomatic style in other languages. The value in parentheses is the default value that the library SHOULD offer.
NativeHistogramBucketFactor
(1.1): A float greater than one to determine
the initial resolution. The library picks a starting schema that results in a
growth of the bucket width from one bucket to the next by a factor not larger
than the provided value. See table below for example values.NativeHistogramZeroThreshold
(2-128): A float of value zero or
greater to set the initial threshold for the zero bucket.The resolution is set via a growth factor rather than providing the schema directly because most users will not know the mathematics behind the schema numbers. The notion of an upper limit for the growth factor from bucket to bucket is understandable without knowing about the internal workings of native histograms. The following table lists an example factor for each valid schema.
NativeHistogramBucketFactor |
resulting schema |
---|---|
65536 | -4 |
256 | -3 |
16 | -2 |
4 | -1 |
2 | 0 |
1.5 | 1 |
1.2 | 2 |
1.1 | 3 |
1.05 | 4 |
1.03 | 5 |
1.02 | 6 |
1.01 | 7 |
1.005 | 8 |
Buckets of native histograms are created dynamically when they are populated for the first time. An unexpectedly broad distribution of observed values can lead to an unexpectedly high number of buckets, requiring more memory than anticipated. If the distribution of observed values can be manipulated from the outside, this could even be used as a DoS attack vector via exhausting all the memory available to the program. Therefore, an instrumentation library SHOULD offer a bucket limitation strategy. It MAY set one by default, depending on the typical use cases the library is used for. (TODO: Maybe we should say that a strategy SHOULD be set by default. The Go library is currently not limiting the buckets by default, and no issues have been reported with that so far.)
The following describes the bucket limitation strategy implemented by the Go instrumentation library. Other libraries MAY follow this example, but other strategies might be feasible as well, depending on the typical usage pattern of the library.
The strategy is defined by three parameters: an unsigned integer
NativeHistogramMaxBucketNumber
, a duration NativeHistogramMinResetDuration
,
and a float NativeHistogramMaxZeroThreshold
. If
NativeHistogramMaxBucketNumber
is zero (which is the default), buckets are
not limited at all, and the other two parameters are ignored. If
NativeHistogramMaxBucketNumber
is set to a positive value, the library
attempts to keep the bucket count of each histogram to the provided value. A
typical value for the limit is 160, which is also the default value used by
OTel exponential histograms in a similar strategy. (Note that partitioning by
labels will create a number of histograms. The limit applies to each of them
individually, not to all of them in aggregate.) If the limit would be exceeded,
a number of remedies are applied in order until the number of buckets is within
the limit again:
NativeHistogramMinResetDuration
has passed since the last
reset of the histogram (which includes the creation of the histogram), the
whole histogram is reset, i.e. all buckets are deleted and the sum and count
of observations as well as the zero bucket are set to zero. Prometheus
handles this as a normal counter reset, which means that some observations
will be lost between scrapes, so resetting should happen rarely compared to
the scraping interval. Additionally, frequent counter resets might lead to
less efficient storage in the TSDB (see the TSDB section for
details). A NativeHistogramMinResetDuration
of one hour is a value that
should work well in most situations.NativeHistogramMinResetDuration
is set to zero, which is the default
value), no reset is performed. Instead, the zero threshold is increased to
merge buckets close to zero into the zero bucket, reducing the number of
buckets in that way. The increase of the threshold is limited by
NativeHistogramMaxZeroThreshold
. If this value is already reached (or it
is set to zero, which is the default), nothing happens in this step.If step 2 or 3 have changed the histogram, a reset will be performed once
NativeHistogramMinResetDuration
has passed since the last reset, not only to
remove the buckets but also to return to the initial values for the zero
threshold and the bucket resolution. Note that this is treated like a reset for
other reasons in all aspects, including updating the so-called created
timestamp.
It is tempting to set a very low NativeHistogramBucketFactor
(e.g. 1.005)
together with a reasonable NativeHistogramMaxBucketNumber
(e.g. 160). In this
way, each histogram always has the highest possible resolution that is
affordable within the given bucket count “budget”. (This is the default
strategy used by the OTel exponential histogram. It starts with an even higher
schema (20), which is currently not even available in Prometheus native
histograms.) However, this strategy is generally not recommended for the
Prometheus use case. The resolution will be reduced quite often after creation
and after each reset as observations come in. This creates churn both in the
instrumented program as well as in the TSDB, which is particularly problematic
for the latter. All of this effort is mostly in vain because the typical
queries involving histograms require many histograms to get merged, during
which the lowest common resolution is used so that the user ends up with a
lower resolution anyway. The TSDB can be protected against the churn by
limiting the resolution upon ingestion (see
below), but if a reasonably low
resolution will be enforced upon ingestion anyway, it is more straightforward
to set this resolution during instrumentation already. However, this strategy
might be worth the resource overhead within the instrumented program in
specific cases where a reasonable resolution cannot be assumed at
instrumentation time, and the scraper should have the flexibility to pick the
desired resolution at scrape time.
While partitioning of a classic histogram with many buckets by labels has to be done judiciously, the situation is more relaxed with native histograms. Partitioning a native histograms still creates a multiplicity of individual histograms. However, the resulting partitioned histograms will often populate fewer buckets each than the original unpartitioned histogram. (For example, if a histogram tracking the duration of HTTP requests is partitioned by HTTP status code, the individual histogram tracking requests responded by status code 404 might have a very sharp bucket distribution around the typical duration it takes to identify an unknown path, populating only a few buckets.) The total number of populated buckets for all partitioned histograms will still go up, but by a smaller factor than the number of partitioned histograms. (For example, if adding labels to an already quite heavy classic histogram results in 100 labeled histograms, the total cost will go up by a factor of 100. In case of a native histogram, the cost for the single histogram might already be lower if the classic histogram featured a high resolution. After partitioning, the total number of populated buckets in the labeled native histograms will be signifcantly smaller than 100 times the number of buckets in the original native histogram.)
Currently (2024-11-03), instrumentation libraries offer no way to directly configure native histograms with custom bucket boundaries (NHCBs). The use case for NHCBs is to allow native-histogram enabled scrapers to convert classic histograms to NHCBs upon ingestion (see next section). However, there are valid use cases where custom buckets are desirable directly during instrumentation. In those cases, the current approach is to instrument with a classic histogram and configure the scraper to convert it to an NHCB upon ingestion. However, a more direct treatment of NHCBs in instrumentation libraries might happen in the future.
To enable the Prometheus server to scrape native histograms, the feature flag
--enable-feature=native-histograms
is required. This flag also changes the
content negotiation to prefer the classic protobuf-based exposition format over
the OpenMetrics text format. (TODO: This behavior will change once native
histograms are a stable feature.)
With Prometheus v2.49 and later, it is possible to fine-tune the scrape
protocol negotiation globally or per scrape config via the scrape_protocols
config setting. It is a list defining the content negotiation priorities. Its
default value depends on the --enable-feature=native-histograms
flag. If the
flag is set, it is [ PrometheusProto, OpenMetricsText1.0.0,
OpenMetricsText0.0.1, PrometheusText0.0.4 ]
, otherwise the first element,
PrometheusProto
is removed from the list, resulting in [
OpenMetricsText1.0.0, OpenMetricsText0.0.1, PrometheusText0.0.4 ]
. These
default values result in the behavior described above, i.e. protobuf is unused
without the --enable-feature=native-histograms
flag, while it is the first
priority with the flag set.
The setting can be used to configure protobuf scrapes without ingesting native
histograms or enforce a non-protobuf format for certain targets even with the
--enable-feature=native-histograms
flag set. As long as the classic
Prometheus protobuf format (PrometheusProto
in the configured list) is the
only format supporting native histograms, both the feature flag and negotiation
of protobuf is required to actually ingest native histograms.
(TODO: Update this section once native histograms are a stable feature or native histograms are supported by other formats.)
quantile
labels (used in summaries) and le
labels (used in classic histograms). This
problem only affects v2 of the Prometheus server (v3 has consistent formatting
under all circumstances) and is not directly related to native histograms, but
might show up in the same context because enabling native histograms requires
the protobuf exposition format. See details in the documentation for the
native-histograms
feature
flag
for v2.55.While instrumentation libraries SHOULD offer configuration options to limit the resolution and bucket count of a native histogram, there is still a need to enforce those limits upon ingestion. Users might be unable to change the instrumentation of a given program, or a program might be deliberately instrumented with high-resolution histograms to give different scrapers the option to reduce the resolution as they see fit.
The Prometheus scrape config offers two settings to address this need:
native_histogram_bucket_limit
sets an upper inclusive limit for the
number of buckets in an individual histogram. If the limit is exceeded, the
resolution of a histogram with a standard schema is repeatedly
reduced (by doubling the width of the buckets, i.e. decreasing the schema)
until the limit is reached. In case an NHCB exceeds the limit, or in the rare
case that the limit cannot be satisfied even with schema -4, the scrape
fails.native_histogram_min_bucket_factor
sets a lower inclusive limit for
the growth factor from bucket to bucket. This setting is only relevant for
standard schemas and has no effect on NHCBs. Again, if the limit
is exceeded, the resolution of the histogram is repeatedly reduced (by
doubling the width of the buckets, i.e. decreasing the schema) until the
limit is reached. However, once schema -4 is reached, the scrape will still
succeed, even if a higher growth factor has been specified.Both settings accept zero as a valid value, which implies “no limit”. In case of the bucket limit, this means that the number of buckets are indeed not checked at all. In the case of the bucket factor, Prometheus will still ensure that a standard schema will not exceed the capabilities of the used storage backend. (TODO: This currently means the schema is at most +8, which is also the limit we allow in the exposition format. OTel allows higher exponential schemas, and Prometheus might therefore allow them in ingestion paths, too, but reduce the schema to +8 upon ingestion, or to whatever limit the current implementation requires. See https://github.com/prometheus/prometheus/issues/14168 for final clarification.)
If both settings have a non-zero values, the schema is decreased sufficiently to satisfy both limits.
Note that the bucket factor set during instrumentation is an upper limit (exposed bucket growth factor ≤ configured value), while the bucket factor set in the scrape config is a lower limit (ingested bucket growth factor ≥ configured value). The schemas resulting from certain limits are therefore slightly different. Some examples:
native_histogram_min_bucket_factor |
resulting max schema |
---|---|
65536 | -4 |
256 | -3 |
16 | -2 |
4 | -1 |
2 | 0 |
1.4 | 1 |
1.1 | 2 |
1.09 | 3 |
1.04 | 4 |
1.02 | 5 |
1.01 | 6 |
1.005 | 7 |
1.002 | 8 |
General considerations about setting the limits:
native_histogram_bucket_limit
is suitable to set a hard limit for the cost of
an individual histogram. The same cannot be accomplished by
native_histogram_min_bucket_factor
because histograms can have many buckets
even with a low resolution if the distribution of observations is sufficiently
broad. native_histogram_min_bucket_factor
is well suited to avoid needless
overall resource costs. For example, if the use case at hand only requires a
certain resolution, setting a corresponding
native_histogram_min_bucket_factor
for all histograms might free up enough
resources to accept a very high bucket count on a few histograms with broad
distributions of observed values. Another example is the case where some
histograms have low resolution for some reason (maybe already on the
instrumentation side). If aggregations regularly include those low resolution
histograms, the outcome will have that same low resolution (see the PromQL
details below). Storing other histograms
regularly aggregated with the low resolution histograms at higher resolution
might not be of much use.
As described above, a histogram exposed by an instrumented program might contain both a classic and a native histograms, and some parts are even shared (like the count and sum of observations). This section explains which parts will be scraped by Prometheus, and how to control the behavior.
Without the --enable-feature=native-histograms
flag, Prometheus will
completely ignore the native histogram parts during scraping. (TODO: Update
once the feature flag has been no-op'd.) With the flag set, Prometheus will
prefer the native histogram parts over the classic histogram parts, even if
both are exposed for the same histogram. Prometheus will still scrape the
classic histogram parts for histograms with no native histogram data.
In situations like migration scenarios, it might
be desired to scrape both versions, classic and native, for the same histogram,
provided both versions are exposed by the instrumented program. To enable this
behavior, there is a boolean setting always_scrape_classic_histograms
in the
scrape config. It defaults to false, but if set to true, both versions of each
histogram will be scraped and ingested, provided there is at least one classic
bucket and at least one native bucket span (which might be a no-op span). This
will not cause any conflicts in the TSDB because classic histograms are
ingested as a number of suffixed series, while native histograms are ingested
as just one series with their unmodified name. (Example: A histogram called
rpc_latency_seconds
results in a native histogram series named
rpc_latency_seconds
and in a number of series for the classic part, namely
rpc_latency_seconds_sum
, rpc_latency_seconds_count
, and a number of
rpc_latency_seconds_bucket
series with different le
labels.)
The aforementioned NHCB is capable of modeling a classic histogram as a native histogram. Prometheus can be configured to ingest classic histograms as NHCBs rather than classic histograms. (TODO: Explain how to do that once it is merged.)
NHCBs have the same issue with limited mergeability as classic histograms, but they are generally much less expensive to store.
The TSDB stores integer histograms and float histograms differently. Generally, integer histograms are expected to compress better, so a TSDB implementation MAY store a float histogram as an integer histogram if all bucket counts and the count of observations have an integer value within the int64 range so that the conversion to an integer histogram creates a numerically precise representation of the original float histogram. (Note that the Prometheus TSDB is not utilizing this option yet.)
Native histograms require two new chunk encodings (Go type chunkenc.Encoding
)
in the TSDB: chunkenc.EncHistogram
(string representation histogram
,
numerical value 2) for integer histograms, and chunkenc.EncFloatHistogram
(string representation floathistogram
, numerical value 3) for float
histograms.
Similarly, there are two new record types for the WAL and the in-memory
snapshot (Go type record.Type
): record.HistogramSamples
(string
representation histogram_samples
, numerical value 9) for integer histograms,
and record.FloatHistogramSamples
(string representation
float_histogram_samples
, numerical value 10) for float histograms. For
backwards compatibility reasons, there are two more histogram record types:
record.HistogramSamplesLegacy
(histogram_samples_legacy
, 7) and
record.FloatHistogramSamplesLegacy
(float_histogram_samples_legacy
, 8).
They were used prior to the introduction of custom values needed for NHCB. They
are supported so that reading old WALs is still possible.
Prometheus identifies time series just by their labels. Whether a sample in a series is a float (and as such a counter or a gauge) or a histogram (no matter what flavor) does not contribute to the series's identity. Therefore, a series MAY contain a mix of samples of different types and flavors. Changes of the sample type within a time series are expected to be very rare in practice. They usually happen after changes in the instrumentation of a target (in the rare case that the same metric name is used for e.g. a gauge float prior to the change and a counter histogram after the change) or after a change of a recording rule (e.g. where the old version of a rule created a gauge float and the new version of the rule now creates a gauge histogram while retaining its name). Frequent changes of the sample type are usually the consequence of a misconfiguration (e.g. two different recording rules creating different sample types feeding into the same series). Therefore, a TSDB implementation MUST handle a change in sample type, but it MAY do so in a relatively inefficient way. When the Prometheus TSDB encounters a sample type that cannot be written to the currently used chunk, it closes that chunk and starts a new one with the appropriate encoding. (A time series that switches sample types back and forth for each sample will lead to a new chunk for each sample, which is indeed very inefficient.)
Histogram chunks use a number of custom encodings for numerical values, in order to reduce the data size by encoding common values in fewer bits than less common values. The details of each custom encoding are described in the low level chunk format documentation (and ultimately in the code linked from there). The following three encodings are used for a number of different fields and are therefore named here for later reference:
Histogram chunks start as usual with the number of samples in the chunk (as a uint16), followed by one byte describing if the histogram is a gauge histogram or a counter histogram and providing counter reset information for the latter. See the corresponding section below for details. This is followed by the so called chunk layout, which contains the following information, shared by all histograms in the chunk:
The chunk layout is followed by a repeated sequence of sample data. The sample data is different for integer histograms and float histograms. For an integer histogram, the data of each sample contains the following:
The sample data of a float histogram has the following differences:
The following events trigger cutting a new chunk (for the reasons described in parentheses):
Differences in the spans would also change the chunk layout, but they are reconciled by adding (explicitly represented) unpopulated buckets as needed so that all histograms in a chunk share the same span structure. This is straightforward if a bucket disappears, because the missing bucket is simply added to the new histogram as an unpopulated bucket while the histogram is appended to the chunk. However, disappearance of a formerly populated bucket constitutes a counter reset (see below), so this case can only happen for gauge histograms (which do not feature counter resets). The far more common case is that buckets exist in a newly appended histogram that did not exist in the previously appended histograms. In this case, these buckets have to be added as explicitly unpopulated buckets to all previously appended histograms. This requires a complete re-encoding of the entire chunk. (There is some optimization potential in only re-encoding the affected parts. Implementing this would be quite complicated. So far, the performance impact of the full re-encoding did not stick out as problematic.)
NaN
value. This very specific float value is called “special
stale NaN
value” in the following section. It is (almost certainly) never
returned by the usual arithmetic float operations and as such different from a
“naturally occurring” NaN
value, including those discussed in Special cases
of observed values. In fact, the special
stale NaN
value is never returned directly when querying the TSDB, but it is
handled internally before it reaches the caller.To mark staleness in histogram series, the usual special stale NaN
value
could be used. However, this would require cutting a new chunk, just for the
purpose of marking the series as stale, because a float value following a
histogram value has to be stored in a different chunk (see above). Therefore,
there is also a histogram version of a stale marker where the field for the sum
of observations is set to the special stale NaN
value. In this case, all
other fields are ignored, which enables setting them to values suitable for
efficient storage (as the histogram version of a stale marker is essentially
just a storage optimization). This works for both float and integer histograms
(as the sum field is a float value even in an integer histogram), and the
appropriate version can be used to avoid cutting a new chunk. All version of a
stale marker (float, integer histogram, float histogram) MUST be treated as
equivalent by the TSDB.
The size of float chunks is limited to 1024 bytes. The same size limitation is generally used for histogram chunks, too. However, individual histograms can become very large if they have many buckets, so blindly enforcing the size limit could lead to chunks with very few histograms. (In the most extreme case, a single histogram could even take more than 1024 bytes so that the size limit could not be enforced at all.) With very few histograms per chunk, the compression ratio becomes worse. Therefore, a minimum number of 10 histograms per chunks has to be reached before the size limit of 1024 bytes kicks in. This implies that histogram chunks can be much larger than 1024 bytes.
Requiring a minimum of 10 histograms per chunk is an initial, very simplistic approach, which might be improved in the future to find a better trade-off between chunk size and compression ratio.
Generally, Prometheus considers a counter to have reset whenever its value drops from one sample to the next (but see also the next section about the created timestamp). The situation is more complex when detecting a counter reset between two histogram samples.
First of all, gauge histograms and counter histograms are explicitly different (whereas Prometheus generally treats all float samples equally after ingestion, no matter if they were ingested as a gauge or a counter metric). Counter resets do not apply to gauge histograms.
If a gauge histogram is followed by a counter histogram in a time series, a counter reset is assumed to have happened, because a change from gauge to counter is considered equivalent to the gauge being deleted and the counter being newly created from zero.
The most common case is a counter histogram being followed by another counter histogram. In this case, a possible counter reset is detected by the following procedure:
If the two histograms differ in schema or in the zero bucket width, these changes could be part of a compatible resolution reduction (which happens regularly to reduce the bucket count of a histogram). Both of the following is true for a compatible resolution reduction:
If any of the conditions are not met, the change is not a compatible resolution reduction. Because such a change is only possible by resetting or newly creating a histogram, it is considered a counter reset and the detection procedure is concluded.
If both conditions are met, the first histogram has to be converted so that its schema and zero bucket width matches those of the second histogram. This happens in the same way as previously described: Neighboring buckets are merged to reduce the schema, and regular buckets are merged with the zero bucket to increase the width of the zero bucket.
At this point in the procedure, both histograms have the same schema and zero bucket width, either because this was the case from the beginning, or because the first histogram was converted accordingly. (Note that NHCBs do not use the zero bucket. Their zero bucket widths and population counts are considered equal for the sake of this procedure.) In this situation, any of the following constitutes a counter reset:
If none of the above is the case, there is no counter reset.
As this whole procedure is relatively involved, the counter reset detection preferably happens once during ingestion, with the result being persisted for later use. Counter reset detection during ingestion has to happen anyway because a counter reset is one of the triggers to cut a new chunk.
Cutting a new chunk after a counter reset aims to improve the compression ratio. A counter reset sets all bucket populations to zero, so there are fewer buckets to represent. A chunk, however, has to represent the superset of all buckets of all histograms in the chunk, so cutting a new chunk enables a simpler set of buckets for the new chunk.
This in turn implies that there will never be a counter reset after the first
sample in a chunk. Therefore, the only counter reset information that has to be
persisted is that of the 1st histogram in a chunk. This happens in the
so-called histogram flags, a single byte stored directly after the the number
of samples in the chunk. This byte is currently only used for the counter reset
information, but it may be used for other flags in the future. The counter
reset information uses the first two bits. The four possible bit patterns are
represented as Go constants of type CounterResetHeader
in the chunkenc
package. Their names and meanings are the following:
GaugeType
(bit pattern 11
): The chunk contains gauge histograms.
Counter resets are irrelevant for gauge histograms.CounterReset
(bit pattern 10
): A counter reset happened between the last
histogram of the previous chunk and the 1st histogram of this chunk. (It is
likely that the counter reset was actually the reason why the new chunk was
cut.)NotCounterReset
(bit pattern 01
): No counter reset happened between the
last histogram of the previous chunk and the 1st histogram of this chunk.
(This commonly happens if a new chunk is cut because the previous chunk hit
the size limit.)UnknownCounterReset
(bit pattern 00
): It is unknown if there was a
counter reset between the last histogram of the previous chunk and the 1st
histogram of this chunk.UnknownCounterReset
is always a safe choice. It does not prevent counter
reset detection, but merely requires that the counter reset detection procedure
has to be performed (again) whenever counter reset information is needed.
The counter reset information is propagated to the caller when querying the
TSDB (in the Go code as a field of type CounterResetHint
in the Go types
Histogram
and FloatHistogram
, using enumerated constants with the same
names as the bit pattern constants above).
For gauge histogram, the CounterResetHint
is always GaugeType
. Any other
CounterResetHint
value implies that the histogram in question is a counter
histogram. In this way, queriers (including the PromQL engine, see
below) obtain the information if a histogram is a gauge or a counter
(which is notably different from float samples).
As long as counter histograms are returned in order from a single chunk, the
CounterResetHint
for the 2nd and following histograms in a chunk is set to
NotCounterReset
. (Overlapping blocks and out-of-order ingestion may lead to
histogram sequences coming from multiple chunks, which requires special
treatment, see below.)
When returning the 1st histogram from a counter histogram chunk, the
CounterResetHint
MUST be set to UnknownCounterReset
unless the TSDB
implementation can ensure that the previously returned histogram was indeed the
same histogram that was used as the preceding histogram to detect the counter
reset at ingestion time. Only in the latter case, the counter reset information
from the chunk MAY be used directly as the CounterResetHint
of the returned
histogram.
This precaution is needed because there are various ways how chunks might get removed or inserted (e.g. deletion via tombstones or adding blocks for backfilling). A counter reset, while attributed to one sample, is in fact happening between the marked sample and the preceding sample. Removing the preceding sample or inserting another sample in between the two samples invalidates the previously performed counter reset detection.
UnknownCounterReset
for all 1st histograms
from a counter histogram chunk. See tracking
issue for efforts to
change that.As already implied above, the querier MUST perform the counter reset detection
procedure (again), if the CounterResetHint
is set to UnknownCounterReset
.
Special caution has to be applied when processing overlapping blocks or out-of-order samples (for querying or during compaction). Both overdetection and underdetection of counter resets may happen in these cases, as illustrated by the following examples:
CounterResetHint
of
UnknownCounterReset
has to be set for each sample that comes from a
different chunk than the previously returned sample, which mandates a counter
reset detection by the querier (utilizing the safe fallback described above).CounterResetHint
of CounterReset
, although there is now no counter reset
between C and D. Similar to the situation in the previous example, a new
counter reset detection has to be performed between A and B, and another one
between C and D. Or both B and D have to be returned with a
CounterResetHint
of UnknownCounterReset
.In summary, whenever the TSDB cannot safely establish that a counter reset
detection between two samples has happened upon ingestion, it either has to
perform another counter reset detection or it has to return a
CounterResetHint
of UnknownCounterReset
for the second sample.
Note that there is the possiblity of counter resets that are not detected by the procedure described above, namely if the counts in the reset histogram have increased quickly enough so that the 1st sample after the counter reset has no counts that have decreased compared to the last sample prior to the counter reset. (This is also a problem for float counters, where it is actually more likely to happen.) With the mechanisms explained above, it is possible to store a counter reset even in this case, provided that the counter reset was detected by other means. However, due to the complications caused by insertion and removal of chunks, out-of-order samples, and overlapping blocks (as explained above), this information MAY get lost if a second round of counter reset detection is required. (TODO: Currently, this information is reliably lost, see TODO above.) A better way to safely mark a counter reset is via created timestamps (see next section).
OpenMetrics introduced so-called created timestamps for counters, summaries, and classic counter histograms. (The term is probably short for “created-at timstamp”. The more appropriate term might have been “creation timestamp” or “reset timestamp”, but the term “created timestamp” is firmly established by now.)
The created timestamp provides the most recent time the metric was created or reset. A design doc describes how Prometheus handles created timestamps.
Created timestamps are also useful for native histograms. In the same way a synthetic zero sample is inserted for float counters, a zero value of a histogram sample is inserted for counter histograms. A zero value of a histogram has no populated buckets, and the sum of observations, the count of observations, and the zero bucket population are all zero. Schema, zero bucket width, custom values, and the float vs. integer flavor of the histogram SHOULD match the sample that directly follows the synthetic zero sample (to not trigger the detection of a spurious counter reset).
The counter reset information of the synthetic zero sample is always set to
CounterReset
. (TODO: Currently, Prometheus probably sets it to
UnknownCounterReset
for the first sample of a series, which is not wrong, but
I think setting it to CounterReset
makes more sense.)
Exemplars for native histograms are attached to the histogram sample as a whole, not to individual buckets. (See also the exposition formats section.) Therefore, it is allowed (and in fact the common case) that a single native histogram sample comes with multiple exemplars attached.
Exemplars may or may not change from one scrape to the next. Scrapers SHOULD detect unchanged exemplars to avoid storing many duplicate exemplars. Duplicate detection is potentially expensive, though, given that a single sample might have many exemplars, of which any subset could be repeated exemplars from the last scrape. The TSDB MAY rely on the assumption that any new exemplar has a more recent timestamp than any of the previously exposed exemplars. (Remember that exemplars of native histograms MUST have a timestamp.) Duplicate detection is then possible in an efficient way:
Exemplars are only counted as out of order if all exemplars of an ingested histogram would be sorted before the last successfully appended exemplar. This does not detect out-of-order exemplars that are mixed with newer exemplars or with a duplicate of the last successfully appended exemplar, which is considered acceptable.
This section describes how PromQL handles native histograms. It focuses on general concepts rather than every single detail of individual operations. For the latter, refer to the PromQL documentation about operators and functions.
The introduction of native histograms creates certain situations where a PromQL expression returns unexpected results, most commonly the case where some or all elements in the output vector are unexpectedly missing. To help users detect and understand those situations, operations acting on native histograms often use annotations. Annotations can have warn and info level and describe possible issues encountered during the evaluation. Warn level is used to mark situations that are most likely an actual problem the user has to act on. Info level is used for situations that might also be deliberate, but are still unusual enough to flag them.
PromQL always acts on float histograms. Native histograms that are stored as integer histograms are automatically converted to float histograms when retrieved from the TSDB.
When an operator or function acts on two or more native histograms, the histograms involved need to have the same schema and zero bucket width. Within certain limits, histograms can be converted on the fly to meet these compatibility criteria:
If incompatibility prevents an operation, a warn-level annotation is added to the result.
Counter resets are defined as described above. Counter reset hints returned from the TSDB MAY be taken into account to avoid explicit counter reset detection and to correctly process counter resets that are not detectable by the usual procedure. (This implies that these counter resets are only taken into account on a best effort basis. However, the same is true for the TSDB itself, see above.) A notable difference to the counter reset handling for classic histograms and summaries is that a decrease of the sum of observations does not constitute a counter reset by itself. (For example, calculating the rate of a native histogram will still work correctly even if the histogram has observed negative values.)
Note that the counter reset hints of counter histograms returned by sub-queries MUST NOT be taken into account to avoid explicit counter reset detection, unless the PromQL engine can safely detect that consecutive counter histograms returned from the sub-query are also consecutive in the TSDB. (TODO: This is not implemented yet.)
Via the counter reset hint returned from the TSDB, PromQL is aware if a native histogram is a gauge or a counter histogram. To mirror PromQL's treatment of float samples (where it cannot reliably distinguish between float counters and gauges), functions that act on counters will still process gauge histograms, and vice versa, but a warn-level annotation is returned with the result. Note that explicit counter reset detection has to be performed on a gauge histogram in that case, treating it as if it were a counter histogram.
When estimating quantiles or fractions, PromQL has to apply interpolation within a bucket. In classic histograms, this interpolation happens in a linear fashion. It is based on the assumption that observations are equally distributed within the bucket. In reality, this assumption might be far off. (For example, an API endpoint might respond to almost all request with a latency of 110ms. The median latency and maybe even the 90th percentile latency would then be close to 110ms. If a classic histogram has bucket boundaries at 100ms and 200ms, it would see most observations in that range and estimate the median at 150ms and the 90th percentile at 190ms.) The worst case is an estimation at one end of a bucket where the actual value is at the other end of the bucket. Therefore, the maximum possible error is the whole width of a bucket. Not doing any interpolation and using some fixed midpoint within a bucket (for example the arithmetic mean or even the harmonic mean) would minimize the maximum possible error (which would then be half of the bucket width in case of the arithmetic mean), but in practice, the linear interpolation yields an error that is lower on average. Since the interpolation has worked well over many years of classic histogram usage, interpolation is also applied for native histograms.
For NHCBs, PromQL applies the same interpolation method as for classic histograms to keep results consistent. (The main use case for NHCBs is a drop-in replacement for classic histograms.) However, for standard exponential schemas, linear interpolation can be seen as a misfit. While exponential schemas primarily intend to minimize the relative error of quantile estimations, they also benefit from a balanced usage of buckets, at least over certain ranges of observed values. The basic assumption is that for most practically occurring destributions, the density of observations tends to be higher for smaller observed values. Therefore, PromQL uses exponential extrapolation for the standard schemas, which models the assumption that dividing a bucket into two when increasing the schema number by one (i.e. doubling the resolution) will on average see similar populations in both new buckets. A more detailed explanation can be found in the PR implementing the interpolation method.
A special case is interpolation within the zero bucket. The zero bucket breaks the exponential bucketing schema. Therefore, linear interpolation is applied within the zero bucket. Furthermore, if all populated regular buckets of a histogram are positive, it is assumed that all observations in the zero bucket are also positive, i.e. the interpolation is done between zero and the upper bound of the zero bucket. In the case of a histogram where all populated regular buckets are negative, the situation is mirrored, i.e. the interpolation within the zero bucket is done between the lower bond of the zero bucket and zero.
As already discussed above, neither the sample type nor the flavor of a native histogram is part of the identity of a series. Therefore, one and the same series might contain a mix of different sample types and flavors.
A mix of counter histograms and gauge histograms doesn't prevent any PromQL operation, but a warn-level annotation is returned with the result if some of the input samples have an inappropriate flavor (see above).
A mix of float samples and histogram samples is more problematic. Many functions that operate on range vectors will remove elements from the result where the input elements contain a mix of floats and histograms. If this happens, a warn-level annotation is added to the result. Concrete examples can be found below.
The unary minus can be used on native histograms. It returns a histogram where all bucket populations and the count and the sum of observations have their sign inverted. Everything else stays the same, including the counter reset hint. Note, however, that explicit counter reset detection will be thrown off by the inverted signs. (TODO: Maybe we should mark all negative histograms as gauges?) Negative histograms do not really make sense on their own and are only supposed to act as intermediate results inside other expressions.
Most binary operators do not work between two histograms or between a histogram and a float or between a histogram and a scalar. If an operator processes such an impossible combination, the corresponding element is removed from the output vector and an info-level annotation is added to the result. (This situation is somewhat similar to label matching, where the sample type plays a role similar to a label. Therefore, such a mismatch might be known and deliberate, which is the reason why the level of the annotation is only info.)
The following describes all the operations that actually do work.
Addition (+
) and subtraction (-
) work between two compatible histograms.
These operators add or subtract all matching bucket populations and the count
and the sum of observations. Missing buckets are assumed to be empty and
treated accordingly. Subtraction might result in negative histograms, see
notes above. Generally, both operands
should be gauges. Adding and subtracting counter histograms requires caution,
but PromQL allows it. Adding a gauge histogram and a counter histogram results
in a gauge histogram. Adding two counter histograms with contradicting counter
reset hints triggers a warn-level annotation. (TODO: The latter not yet
implemented. Also, subtraction doesn't check/modify counter reset hints yet.
This should be documented in detail in the PromQL docs.)
Multiplication (*
) works between a float sample or a scalar on the one side
and a histogram on the other side, in any order. It multiplies all bucket
populations and the count and the sum of observations by the float (sample or
scalar). This will lead to “scaled” and sometimes even negative histograms,
which is usually only useful as intermediate results inside other expressions
(see also notes above). Multiplication
works for both counter histograms and gauge histograms, and their flavor is left
unchanged by the operation.
Division (/
) works between a histogram on the left hand side and a float
sample or a scalar on the right hand side. It is equivalent to multiplication
with the inverse of the float (sample or scalar). Division by zero results in a
histogram with no regular buckets and the zero bucket population and the count
and sum of observations all set to +Inf
, -Inf
, or NaN
, depending on their
values in the input histogram (positive, negative, or zero/NaN
,
respectively).
Equality (==
) and inequality (!=
) work between two histograms, both in
their filtering version as well as with the bool
modifier. They compare the
schema, the custom values, the zero threshold, all bucket populations, and the
sum and count of observations. Whether the histograms have counter or gauge
flavor is irrelevant for the comparison. (A counter histogram could be equal to
a gauge histogram.)
The logical/set binary operators (and
, or
, unless
) work as expected even
if histogram samples are involved. They only check for the existence of a
vector element and don't change their behavior depending on the sample type or
flavor of an element (float or histogram, counter or gauge).
The “trim” operators >/
and </
were introduced specifically for native
histograms. They only work for a histogram on the left hand side and a float
sample or a scalar on the right hand side. (They do not work for float samples
or scalars on both sides. An info-level annotation is returned in this case.)
These operators remove observations from the histogram that are greater or
smaller than the float value on the right side, respectively, and return the
resulting histogram. The removal is only precise if the threshold coincides
with a bucket boundary. Otherwise, interpolation within the affected buckets
has to be used, as described above. The
counter vs. gauge flavor of the histogram is preserved. (TODO: These operators
are not yet implemented and might also change in detail, see tracking
issue.)
The following aggregation operators work in the same way with float and histogram samples (for the reason stated in parentheses):
group
(The result of this aggregation does not depend on the sample values.)count
(The result of this aggregation does not depend on the sample values.)count_values
(The text representation as produced by the Go
FloatHistogram.String
method is used as the value of histograms.)limitk
(The sampled elements are returned unchanged.)limit_ratio
(The sampled elements are returned unchanged.)The sum
aggregation operator work with native histograms by summing up the
histogram to be aggregated (in the same way as described for the +
operator
above). The avg
aggregation operator works in the same way, but divides the
sum by the number of aggregated histogram (in the same way as described for the
/
operator above). Both aggregation operators remove elements from the output
vector that would require the aggregation of float samples with histogram
samples. Such a removal is flagged by a warn-level annotation.
All other aggregation operators do not work with native histograms. Histograms in the input vector are simply ignored, and an info-level annotation is added for each ignored histogram.
The following functions operate on range vectors of native histograms by applying the usual float operation individually to matching buckets (including the zero bucket) and the sum and count of observations, resulting in a new native histogram:
delta()
(For gauge histograms.)increase()
(For counter histograms.)rate()
(For counter histograms.)idelta()
(For gauge histograms.)irate()
(For counter histograms.)idelta
and irate
are not yet implemented for histograms.These functions SHOULD be applied to either gauge histograms or counter histograms as noted above. However, they all work with both flavors, but if at least one histogram of an unsuitable flavor is contained in the range vector, a warn-level annotation is added to the result.
All these functions return no result for series that contain a mix of float samples and histogram samples within the range. A warn-level annotation is added for each output element missing for that reason.
All these functions return gauge histograms as results.
avg_over_time()
and sum_over_time()
work with native histograms in a way
that corresponds to the respective aggregation operators. In particular, if a
series contains a mix of float samples and histogram samples within the range,
the corresponding result is removed entirely from the output vector. Such a
removal is flagged by a warn-level annotation.
The changes()
and the resets()
function work with native histogram samples
in the same way as with float samples. They even work with a mix of float
samples and histogram samples within the same series. In this case, a change
from a float sample to a histogram sample and vice versa counts as a change for
changes()
and as a reset for resets()
. A change in flavor from counter
histogram to gauge histogram and vice versa does not count as a change for
changes()
. resets()
SHOULD only be applied to counter floats and counter
histograms, but the function still works with gauge histograms, applying
explicit counter reset detection in this case. Furthermore, a change from
counter histogram to gauge histogram and vice versa is counted as a reset.
(TODO: Not implemented yet.)
The histogram_quantile()
function has a very special role as it is the only
function that treats a specific “magic” label specially, namely the le
label
used by classic histograms. histogram_quantile()
also works for native
histograms in a similar way, but without the special role of the le
label.
The function keeps treating float samples in the known way, while it uses the
new “native” way for native histogram samples.
An example for a typical query for classic histograms (including rate
and
aggregation):
histogram_quantile(0.9, sum by (job, le) (rate(http_request_duration_seconds_bucket[10m])))
This is the corresponding query for a native histograms:
histogram_quantile(0.9, sum by (job) (rate(http_request_duration_seconds[10m])))
As with classic histograms, an estimation of the maximum and minimum
observation in a histogram can be performed using 1 and 0, respectively, as the
first parameter of histogram_quantile
. However, native histograms with
standard schemas enable much more useful results, not only because of the
usually higher resolution of native histograms, but even more so because native
histograms with standard schemas sustain the same resolution across the whole
range of float64 numbers. With a classic histogram, the odds are that the
maximum observation is in the +Inf bucket, so that the estimation simply
returns the upper limit of the last bucket before the +Inf bucket. Similarly,
the minimum observation will often be in the lowest bucket.
histogram_quantile
treats observations of value NaN
(which SHOULD NOT
happen, see above) effectively as
observations of +Inf
. This follows the rationale that NaN
is never less
than any value that histogram_quantile
returns and is consistent with how
classic histograms usually treat NaN
observations (which end up in the +Inf
bucket in most implementations). (TODO: The correct implementation of this
behavior still needs to be verified by tests.)
The following functions have been introduced specifically for native histograms:
histogram_avg()
histogram_count()
histogram_fraction()
histogram_sum()
histogram_stddev()
histogram_stdvar()
All these functions silently ignore float samples as input. Each function returns a vector of float samples.
histogram_count()
and histogram_sum()
return the count of observations or the sum of
observations, respectively, that are contained in a native histogram. As they are normal
functions, their result cannot be used in a range selector. Instead of using
sub-queries, the recommended way to calculate a rate of the count or the sum of
observations is to first rate the histogram and then apply histogram_count()
or histogram_sum()
to the result. For example, the following query calculates
the rate of observations (in this case corresponding to “requests per second”)
from a native histogram:
histogram_count(rate(http_request_duration_seconds[10m]))
Note that the special counter reset detection for native histograms doesn't
apply when using a sub-query on the result of histogram_sum()
, i.e. negative
observations may result in spurious counter resets.
histogram_avg()
returns the arithmetic average of the observed values
in a native histogram. (This is notably different from applying the avg
aggregation operator to a number of native histograms. The latter returns an
averaged histogram.)
Similarly, histogram_stddev()
and histogram_stdvar()
return the estimated
standard deviation or standard variance, respectively, of the observations in a
native histogram. For this estimation, all observations in a bucket are assumed to
have the value of the geometric mean of the bucket boundaries.
histogram_fraction(lower, upper, histogram)
returns the estimated fraction of
observations in histogram
between the provided boundaries, the scalar values
lower
and upper
. The error of the estimation depends on the resolution of
the underlying native histogram and how closely the provided boundaries are
aligned with the bucket boundaries in the histogram. +Inf
and -Inf
are
valid boundary values and useful to estimate the fraction of all observations
above or below a certain value. However, observations of value NaN
are always
considered to be outside of the specified boundaries (even +Inf
and -Inf
).
(TODO: Verify the correct implementation of this behavior with tests.) Whether
the provided boundaries are inclusive or exclusive is only relevant if the
provided boundaries are precisely aligned with bucket boundaries in the
underlying native histogram. In this case, the behavior depends on the precise
definition of the schema of the histogram.
The following functions do not interact directly with sample values and therefore work with native histogram samples in the same way as they work with float samples: (TODO: Still need to verify that this is true for all of the functions below. Need to update the documentation for some of them.)
absent()
absent_over_time()
count_over_time()
info()
label_join()
label_replace()
last_over_time()
present_over_time()
sort_by_label()
sort_by_label_desc()
timestamp()
All remaining functions not mentioned in this section do not work with native
histograms. Histogram elements in the input vector are silently ignored. (TODO:
Make sure this is the case, rather than treating histogram samples as floats
with value 0.) For deriv()
, double_exponential_smoothing()
,
predict_linear()
, and all the <aggregation>_over_time()
functions not
mentioned before, native histogram samples are removed from the input range
vector. In case any series contains a mix of float samples and histogram
samples within the range, the removal of histograms is flagged by an info-level
annotation.
Recording rules MAY result in native histogram values. They are stored back into the TSDB as during normal ingestion, including whether the histogram is a gauge histogram or a counter histogram. In the latter case, a counter reset explicitly marked by the counter reset hint is also stored, while a new counter reset detection is initiated during ingestion otherwise.
TSDB implementations MAY convert the float histograms created by recording rules to integer histograms if this conversion precisely represents all the float values in the original histogram.
Alerts work as usual with native histograms. However, it is RECOMMENDED to
avoid native histograms as output values for alerts. If native histogram
samples are used in templates, they are rendered in their simple text
form (as producted by the Go FloatHistogram.String
method), which is hard to read for humans.
The PromQL testing framework has been extended so that both PromQL unit tests
as well as rules unit tests via promtool
can include native histograms. The
histogram sample notation is complex and explained in the documentation for
rules unit
testing.
There is an alternative load
command called load_with_nhcb
, which converts
classic histograms to NHCBs and loads both the float series of the classic
histogram as well as the NHCB series resulting from the conversion.
Not specific to native histograms, but very useful in their context, are the
eval_info
and eval_warn
keywords that expect the evaluation to result in at
least one info-level annotation or at least one warn-level annotation,
respectively. It is currently neither possible to test for the presence of
annotations of both levels nor to test for specific annotations.
As usual, PromQL implementations MAY apply any optimizations they see fit as long as the behavior stays the same. Decoding native histograms can be quite expensive with the potentially many buckets. Similarly, deep-copying a histogram sample within the PromQL engine is much more expensive than copying a simple float sample. This creates a huge potential for optimization compared to a naive approach of always decoding everything and always copying everything.
Prometheus currently tries to avoid needless copies (TODO: but a proper CoW like approach still has to be implemented, as it would be much cleaner and less bug prone) and skips decoding of the buckets for special cases where only the sum and count of observations is required.
The query API documentation includes native histogram support. This section focuses on the parts relevant for native histograms and provides a bit of context not part of the API documentation.
To return native histograms in the JSON response of instant (query
endpoint)
and range (query_range
endpoint) queries, both the vector
and matrix
result type needs an extension by a new key.
The vector
result type gets a new key histogram
at the same level as the
existing value
key. Both these keys are mutually exclusive, i.e. each element
in a vector
has either a value
key (for a float result) or a histogram
key (for a histogram result). The value of the histogram
key is structured
similarly to the value of the value
key (a two-element array), with the
difference that the string representing the float sample value is replaced by a
specific histogram object described below.
The matrix
result type gets a new key histograms
at the same level as the
existing values
key. These keys are not mutually exclusive. A series may
contain both float values and histogram values, but for a given timestamp,
there must be only one sample, either a float or a histogram. The value of the
histograms
key is structured similarly to the value of the values
key (an
array of n two-element arrays), with the difference that the strings
representing float sample values are replaced by specific histogram objects
described below.
Note that a better naming of the keys would be float
/histogram
and
floats
/histograms
because both float values and histogram values are
values. The current naming has historical reasons. (In the past, there was only
one value type, namely floats, so calling the keys simply value
and values
was the obvious choice.) The intention here is to not break existing consumers
that do not know about native histograms.
The histogram object mentioned above has the following structure:
{
"count": "<count_of_observations>",
"sum": "<sum_of_observations>",
"buckets": [ [ <boundary_rule>, "<left_boundary>", "<right_boundary>", "<count_in_bucket>" ], ... ]
}
count
and sum
directly correspond to the histogram fields of the same name.
Each bucket is represented explicitly with its boundaries and count, including
the zero bucket. Spans and the schema are therefore not part of the response,
and the structure of the histogram object does not depend on the used schema.
The <boundary_rule>
placeholder is an integer between 0 and 3 with the
following meaning:
For standard schemas, positive buckets are “open left”, negative buckets are “open right”, and the zero bucket (with a negative left boundary and a positive right boundary) is “closed both”. For NHCBs, all buckets are “open left” (mirroring the behavior of classic histograms). Future schemas might utilize different boundary rules.
For the series
endpoint, series containing native histograms are included in
the same way as conventional series containing only floats. The endpoint does
not provide any information what sample types are included (and in fact, any
series may contain either or both sample types). Note in particular that a
histogram exposed by a target under the name request_duration_seconds
will
lead to a series called request_duration_seconds
if it is exposed and
ingested as a native histogram, but if it is exposed and ingested as a classic
histogram, it will lead to a set of series called
request_duration_seconds_sum
, request_duration_seconds_count
, and
request_duration_seconds_bucket
. If the histogram is ingested as both a
native histogram and a classic
histogram, all of the series
names above will be returned by the series
endpoint.
The target and metric metadata (endpoints targets/metadata
and metadata
)
work a bit differently, as they are acting on the original name as exposed by
the target. This means that a classic histogram called
request_duration_seconds
will be represented by these metadata endpoints
only as request_duration_seconds
(and not request_duration_seconds_sum
,
request_duration_seconds_count
, or request_duration_seconds_bucket
). A
native histogram request_duration_seconds
will also be represented under this
name. Even in the case where request_duration_seconds
is ingested as both a
classic and a native histogram, there will be no collision as the metadata
returned is actually the same (most notably the returned type
will be
histogram
). In other words, there is currently no way of distinguishing
native from classic histograms via the metadata endpoints alone. An additional
look-up via the series
endpoint is required. There are no plans to change
this, as the existing metadata endpoints are anyway severely limited (no
historical information, no metadata for metrics created by rules, limited
ability to handle conflicting metadata between different targets). There are
plans, though, to improve metadata handling in Prometheus in general. Those
efforts will also take into account how to support native histograms properly.
(TODO: Update as progress is made.)
This section describes the rendering of histograms by Prometheus's own UI. This MAY be used as a guideline for third party graphing frontends.
In the Table view, a histogram data point is rendered graphically as a bar graph together with a textual representation of all the buckets with their lower and upper limit and the count and sum of observations. Each bar in the bar graph represents a bucket. The position of each bar on the x axis is determined by the lower and upper limit of the corresponding bucket. The area of each bar is proportional to the population of the corresponding bucket (which is a core principle of rendering histograms in general).
The graphical histogram allows a choice between an exponential and a linear x axis. The former is the default. It is a good fit for the standard schemas. (TODO: Consider linear as a default for non-exponential schemas.) Conveniently, all regular buckets of an exponential schema have the same width on an exponential x axis. This means that the y axis can display actual bucket populations without violating the above principle that the area (not the height) of a bar is representative for the bucket population. The zero bucket is an exception to that. Technically, it has an infinite width. Prometheus simply renders it with the same width as the regular exponential buckets (which in turn means that the x axis is not strictly exponential around the zero point). (TODO: How to do the rendering for non-exponential schemas.)
With a linear x axis, the buckets generally have varying width. Therefore, the y axis displays the bucket population divided by its width. The Prometheus UI does not render values on the y axis as they would be hard to interpret for humans anyway. The population can still be inspected in the text representation.
In the Graph view, Prometheus displays a heatmap (TODO: not yet, see below),
which could be seen as a series of histograms over time, rotated by 90 degrees
and encoding the bucket population as a color rather than the height of a bar.
The typical query to render a counter-like histogram as a heatmap would be a
rate
query. A heatmap is an extremely powerful representation that allows
humans to easily spot characteristics of distributions as they change over
time.
Native histograms work in template expansion. They are rendered in a text
representation inspired by the mathematical notation of open and closed
intervals. (This is generated by the FloatHistogram.String
method in Go.) As
native histograms can have a lot of buckets and bucket boundaries tend to have
boundaries with a lot of decimal places, the representation isn't necessarily
very readable. Use native histograms in template expansion judiciously.
Example for the text representation of a float histogram:
{count:3493.3, sum:2.349209324e+06, [-22.62741699796952,-16):1000, [-16,-11.31370849898476):123400, [-4,-2.82842712474619):3, [-2.82842712474619,-2):3.1, [-0.01,0.01]:5.5, (0.35355339059327373,0.5]:1, (1,1.414213562373095]:3.3, (1.414213562373095,2]:4.2, (2,2.82842712474619]:0.1}
The protobuf specs for remote write &
read were extended
for native histograms as an experimental feature. Receivers not capable of
processing native histograms will simply ignore the newly added fields.
Nevertheless, Prometheus has to be configured to send native histograms via
remote write (by setting the send_native_histograms
remote write config
setting to true).
In remote write v2, native histograms are a stable feature.
It might appear tempting to convert classic histograms to NHCBs while sending or receiving them. However, this does not overcome the known consistency problems classic histograms suffer from when transmitted via remote write. Instead, classic histograms SHOULD be converted to NHCBs during scraping. Similarly, explicit OTel histograms SHOULD be converted to NHCBs during OTLP ingestion already. (TODO: See tracking issue.)
Federation of native histograms works as expected, provided the federation scrape uses the protobuf format. A federation via OpenMetrics text format will be possible, at least in principle, once native histograms are supported in that format, but federation via protobuf is preferred for efficiency reasons anyway.
The OTLP receiver built into Prometheus converts incoming OTel exponential histograms to Prometheus native histograms utilizing the compatibility described above. The resolution of a histogram using a schema (“scale” in OTel lingo) greater than 8 will be reduced to match schema 8. (In the unlikely case that a schema smaller than -4 is used, the ingestion will fail.)
Explicit OTel histograms are the equivalent of Prometheus's classic histograms. Prometheus therefore converts them to classic histograms by default, but optionally offers direct conversion to NHCBs. (TODO: Not implemented yet, see tracking issue.)
Native histogram support has been gradually added to the
Pushgateway. Full support was
reached in v1.9. The Pushgateway always has been based on the classic protobuf
format as its internal data model, which made the necessary changes easy
(mostly UI concerns). Combined histograms (with classic and native buckets) can
be pushed and will be exposed as such via the /metrics
endpoint. (However,
the query API, which can be used to query the pushed metrics as JSON, will only
be able to return one kind of buckets and will prefer native buckets if
present.)
promtool
This section describes promtool
commands added or changed to support native
histograms. Commands not mentioned explicitly do not directly interact with
native histograms and require no changes.
The promtool query ...
commands work with native histograms. See the query
API documentation to learn about the output format.
A new command promtool query analyze
was specifically added to analyze
classic and native histogram usage patterns returned by the query API.
The rules unit testing via promtool test rules
works with native histograms,
using the format described above.
promtool tsdb analyze
and promtool tsdb list
work normally with native
histograms. The --extended
output of the former has specific sections for
histogram chunks.
promtool tsdb dump
uses the usual text representation of native histograms
(as produced by the Go method FloatHistogram.String
).
promtool tsdb create-blocks-from rules
works with rules that emit native
histograms.
The promtool promql ...
commands support all the PromQL features added for
native histograms.
While promtool tsdb bench write
could in principle include native histograms,
such a support is not planned at the moment.
The following commands depend on the OpenMetrics text format and therefore cannot support native histograms as long as there is no native histogram support in OpenMetrics:
promtool check metrics
promtool push metrics
promtool tsdb dump-openmetrics
promtool tsdb create-blocks-from openmetrics
prom2json
prom2json
is a small tool that
scrapes a Prometheus /metrics
endpoint, converts the metrics to a bespoke
JSON format, which it dumps to stdout. This is convenient for further
processing with tools handling JSON, for example jq
.
prom2json
v1.4 added support for native histograms. If a histogram in the
exposition contains at least one bucket span, prom2json
will replace the
usual classic bucket in the JSON output with the buckets of the native
histogram, following a format inspired by the Prometheus query
API.
When migrating from classic to native histograms, there are three important sources of issues to consider:
To address (3), it is of course possible to not migrate the classic histogram in question and leave things as they are. Another option is to leave the instrumentation the same but convert classic histograms to NHCBs upon ingestion. This leverages the increased storage performance of native histograms, but still requires to address (1) and (2) in the same way as for a full migration to native histograms (see next paragraphs).
The conservative way of addressing (1) and (2) is to allow a long transition period, which comes at the cost of collecting and storing classic and native histograms in parallel for a while.
The first step is to update the instrumentation to expose classic and native histograms in parallel. (This step can be skipped if the plan is to stick with classic histogram in the instrumentation and simply convert them to NHCBs during scraping.)
Then configure Prometheus to scrape both classic and native histograms, see section about scraping both classic and native histograms above. (If needed, also activate the conversion of classic histograms to NHCB.)
The existing queries involving classic histograms will continue to work, but
from now on, users can start working with native histograms and start to change
queries in dashboards, alerts, recording rules,… As already mentioned above, it
is important to pay attention to queries with longer range vectors like
histogram_quantile(0.9, rate(rpc_duration_seconds[1d]))
. This query
calculates the 90th percentile latency over the last day. Hoewever, if native
histograms haven't been collected for at least one day, the query will only
cover that shorter period. Thus, the query should only be used once native
histograms have been collected for at least 1d. For a dashboard that displays
the daily 90th percentile latency over the last month, it is tempting to craft
a query that correctly switches from classic to native histograms at the right
moment. While that is in principle possible, it is tricky. If feasible, the
transition period during which classic and native histograms are collected in
parallel, can be quite long to minimize the necessity to implement tricky
switch-overs. For example, once classic and native histograms have been
collected in parallel for a month, any dashboard not looking farther back than
a month can simply be switched over from a classic histogram query to a native
histogram query without any consideration about the right switch-over.
Once there is confidence that all queries have been migrated correctly, configure Prometheus to only scrape native histograms (which is the “normal” setting). (It is also possible to incrementally remove classic histograms with relabel rules in the scrape config.) If everything still works, it is time to remove classic histograms from the instrumentation.
The Grafana Mimir documentation contains a detailed migration guide following the same philosophy as described in this section.
This documentation is open-source. Please help improve it by filing issues or pull requests.