Implementing Custom Service Discovery

Implementing Custom Service Discovery

Prometheus contains built in integrations for many service discovery (SD) systems such as Consul, Kubernetes, and public cloud providers such as Azure. However, we can’t provide integration implementations for every service discovery option out there. The Prometheus team is already stretched thin supporting the current set of SD integrations, so maintaining an integration for every possible SD option isn’t feasible. In many cases the current SD implementations have been contributed by people outside the team and then not maintained or tested well. We want to commit to only providing direct integration with service discovery mechanisms that we know we can maintain, and that work as intended. For this reason, there is currently a moratorium on new SD integrations.

However, we know there is still a desire to be able to integrate with other SD mechanisms, such as Docker Swarm. Recently a small code change plus an example was committed to the documentation directory within the Prometheus repository for implementing a custom service discovery integration without having to merge it into the main Prometheus binary. The code change allows us to make use of the internal Discovery Manager code to write another executable that interacts with a new SD mechanism and outputs a file that is compatible with Prometheus' file_sd. By co-locating Prometheus and our new executable we can configure Prometheus to read the file_sd-compatible output of our executable, and therefore scrape targets from that service discovery mechanism. In the future this will enable us to move SD integrations out of the main Prometheus binary, as well as to move stable SD integrations that make use of the adapter into the Prometheus discovery package.

Integrations using file_sd, such as those that are implemented with the adapter code, are listed here.

Let’s take a look at the example code.

Adapter

First we have the file adapter.go. You can just copy this file for your custom SD implementation, but it's useful to understand what's happening here.

// Adapter runs an unknown service discovery implementation and converts its target groups
// to JSON and writes to a file for file_sd.
type Adapter struct {
    ctx     context.Context
    disc    discovery.Discoverer
    groups  map[string]*customSD
    manager *discovery.Manager
    output  string
    name    string
    logger  log.Logger
}

// Run starts a Discovery Manager and the custom service discovery implementation.
func (a *Adapter) Run() {
    go a.manager.Run()
    a.manager.StartCustomProvider(a.ctx, a.name, a.disc)
    go a.runCustomSD(a.ctx)
}

The adapter makes use of discovery.Manager to actually start our custom SD provider’s Run function in a goroutine. Manager has a channel that our custom SD will send updates to. These updates contain the SD targets. The groups field contains all the targets and labels our custom SD executable knows about from our SD mechanism.

type customSD struct {
    Targets []string          `json:"targets"`
    Labels  map[string]string `json:"labels"`
}

This customSD struct exists mostly to help us convert the internal Prometheus targetgroup.Group struct into JSON for the file_sd format.

When running, the adapter will listen on a channel for updates from our custom SD implementation. Upon receiving an update, it will parse the targetgroup.Groups into another map[string]*customSD, and compare it with what’s stored in the groups field of Adapter. If the two are different, we assign the new groups to the Adapter struct, and write them as JSON to the output file. Note that this implementation assumes that each update sent by the SD implementation down the channel contains the full list of all target groups the SD knows about.

Custom SD Implementation

Now we want to actually use the Adapter to implement our own custom SD. A full working example is in the same examples directory here.

Here you can see that we’re importing the adapter code "github.com/prometheus/prometheus/documentation/examples/custom-sd/adapter" as well as some other Prometheus libraries. In order to write a custom SD we need an implementation of the Discoverer interface.

// Discoverer provides information about target groups. It maintains a set
// of sources from which TargetGroups can originate. Whenever a discovery provider
// detects a potential change, it sends the TargetGroup through its channel.
//
// Discoverer does not know if an actual change happened.
// It does guarantee that it sends the new TargetGroup whenever a change happens.
//
// Discoverers should initially send a full set of all discoverable TargetGroups.
type Discoverer interface {
    // Run hands a channel to the discovery provider(consul,dns etc) through which it can send
    // updated target groups.
    // Must returns if the context gets canceled. It should not close the update
    // channel on returning.
    Run(ctx context.Context, up chan<- []*targetgroup.Group)
}

We really just have to implement one function, Run(ctx context.Context, up chan<- []*targetgroup.Group). This is the function the manager within the Adapter code will call within a goroutine. The Run function makes use of a context to know when to exit, and is passed a channel for sending it's updates of target groups.

Looking at the Run function within the provided example, we can see a few key things happening that we would need to do in an implementation for another SD. We periodically make calls, in this case to Consul (for the sake of this example, assume there isn’t already a built-in Consul SD implementation), and convert the response to a set of targetgroup.Group structs. Because of the way Consul works, we have to first make a call to get all known services, and then another call per service to get information about all the backing instances.

Note the comment above the loop that’s calling out to Consul for each service:

// Note that we treat errors when querying specific consul services as fatal for for this
// iteration of the time.Tick loop. It's better to have some stale targets than an incomplete
// list of targets simply because there may have been a timeout. If the service is actually
// gone as far as consul is concerned, that will be picked up during the next iteration of
// the outer loop.

With this we’re saying that if we can’t get information for all of the targets, it’s better to not send any update at all than to send an incomplete update. We’d rather have a list of stale targets for a small period of time and guard against false positives due to things like momentary network issues, process restarts, or HTTP timeouts. If we do happen to get a response from Consul about every target, we send all those targets on the channel. There is also a helper function parseServiceNodes that takes the Consul response for an individual service and creates a target group from the backing nodes with labels.

Using the current example

Before starting to write your own custom SD implementation it’s probably a good idea to run the current example after having a look at the code. For the sake of simplicity, I usually run both Consul and Prometheus as Docker containers via docker-compose when working with the example code.

docker-compose.yml

version: '2'
services:
consul:
    image: consul:latest
    container_name: consul
    ports:
    - 8300:8300
    - 8500:8500
    volumes:
    - ${PWD}/consul.json:/consul/config/consul.json
prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
    - 9090:9090

consul.json

{
"service": {
    "name": "prometheus",
    "port": 9090,
    "checks": [
    {
        "id": "metrics",
        "name": "Prometheus Server Metrics",
        "http": "http://prometheus:9090/metrics",
        "interval": "10s"
    }
    ]

}
}

If we start both containers via docker-compose and then run the example main.go, we’ll query the Consul HTTP API at localhost:8500, and the filesd compatible file will be written as customsd.json. We could configure Prometheus to pick up this file via the file_sd config:

scrape_configs:
  - job_name: "custom-sd"
    scrape_interval: "15s"
    file_sd_configs:
    - files:
      - /path/to/custom_sd.json