Realtime Data in Apache Druid Choosing the Right Strategy


Storing data in real-time data streams has always been difficult. The solution depends on the application. If you want to save data for analysis daily or monthly, you can use a distributed file system and run Hive or Presto on it. If you want to do a simple real-time analysis, you can save the latest data in Elasticsearch and run Kibana on the chart.

Apache Druid was created to handle both of the above use cases. It can be used as a permanent data storage for daily or monthly analysis. It also acts as a fast and retrievable data store that allows you to transfer and access data in real-time.

The problem with previous versions of Apache Druid, however, was collecting data from streams in the database. Let's take a look at the problems developers have encountered before.

Tranquility

Tranquility is a software package provided by Apache Druid to collect real-time data. Tranquility is not exactly the same as the JDBC or Cassandra driver. Handles partitioning, replication, service discovery, and architecture transition. Users should be interested in the data and data sources to be used.

Peaceful collection of real-time data

Quiet solves many problems that users may encounter.
However, it is tied to its own assignments.

Not Exactly-once

In some cases, the record creates duplicate records. Not guaranteed for once. In situations where the POST request data times out or no confirmation is received, tranquility can generate duplicate records.

This situation is the responsibility of the user for deduplication. Abusing graphics in the Apache superset can lead to incorrect graphics

Data drops

The biggest problem with serenity is erasing data. There are a number of situations that are deliberate or incorrect, and these situations prevent data from being inserted silently. Some examples listed in the official documentation are:

Events with time stamps other than the configured time period are discarded.

If more Druid Middle Manager errors occur than configured replicas, some index data may be lost.
If a persistent problem prevents communication with the Druid Indexing Service and the retry strategy is exhausted during this period, or if the duration is longer than the window period, some events are deleted.

If you have problems with the "silence" that is not approved by the indexing service, you can repeat the batch to repeat the event.

In the worst case, in most cases, you do not know that the data has been deleted until you query the data.

Error Handling

The tranquility daemon runs in the JVM, so handling errors (e.g. timeouts) is the responsibility of the application. For applications like Apache Flink, ineffective management of one of these errors can lead to unexpected restarts.

In addition to all of these problems, Druid 0.9.2 is also quiet. Use in the current Druid version 0.16.0 can lead to unknown problems.

Kafka Indexer

To fix all of the above problems, Apache Druid added Kafka Indexer in version 0.9.1. Up to version 0.14, the indexer was experimental.

Kafka Indexing Service first starts a supervisor based on the configuration you specified. The supervisor then regularly starts a new indexing task that uses Kafka data and publishes it for Druid.

Unlike serenity, Kafka indexing tasks can take a long time. You can post multiple segments at least after No. The number of lines or bytes was reached without starting a new job.

Kafka Indexer aims to solve various problems that previous models have faced.

Exactly-Once Semantics

Kafka Indexer gives you a guarantee. Since Kafka 0.11.x supports this meaning immediately, it is guaranteed that you are a local Kafka user.

Publish delayed data

Kafka Indexer aims to publish delayed data. Lancer's view of calm does not apply. This function allows you to freely fill data in Kafka with a specific staggered druid.

Schema Update

Tranquility also supports Schema updates, but it is easier to do in Kafka Indexer. When you submit a POST request with a new schema, the administrator will create a new task with the updated schema. You don't need to make changes on the producer side.

If you add a new column, the old row displays a blank value in that column, but you can still query the row.

The Kafka Indexer service addresses many of the issues developers face when using Tranquility. To get started with Kafka Indexing services, see Apache Kafka Ingestion in the official Druid documentation.