We are seeing more and more enterprise projects generating, using, analyzing and responding to data in real time in these projects. In this way, technology can be aware of what is happening in and around it-making pragmatic tactical decisions on its own. We see this situation at play in most areas of transportation, telephone, healthcare, security and law enforcement, finance, manufacturing, and every industry.
Before this evolution, the analytical results inherent in the data were derived long after the event that produced or created the data. Now, we can use technology to capture, analyze, and act on what is happening right now.
There are many names for this type of data: streaming, messaging, real-time feed, real-time, and event-driven. In the field of streaming data and message queue technology, there are many popular technologies in use, including Apache Kafka and Apache Pulsar™.
In January, DataStax, known for its commercial support, software and cloud database as a service for Apache Cassandra™, launched a new data stream business line called Luna Streaming. DataStax Luna Streaming is a subscription service based on open source Apache Pulsar. In April, DataStax launched a private beta version for streaming Pulsar as a service for data engineers, software engineers, and enterprise architects.
We recently conducted a performance test comparing Luna Streaming (Pulsar) and Kafka clusters with Kubernetes. We want to see whether Pulsar’s inherent architectural advantages (hierarchical storage, decoupled computing and storage, and multi-tenancy) can achieve an efficient architecture, thereby generating tangible performance advantages in actual scenarios.
We deployed the Kubernetes cluster to the Amazon Web Services EC2 instance and used the OpenMessaging Benchmark (OMB) test tool for evaluation. We used the Confluent branch of OpenMessaging Benchmark on GitHub. We also used the same hardware configuration instance type for the Kafka agent, and put the Pulsar agent and the Bookkeeper node together to take advantage of two large (2.5TB), fast, locally connected NVMe solid state drives.
For Kafka, we span persistent volume storage across two disks. For Pulsar, we created a persistent volume and used a local drive for the Bookkeeper ledger and another for scope. For Bookkeeper logs, we provisioned a 100GB gp3 AWS Elastic Band Storage (EBS) volume with 4,000 IOPS and 1,000 MB/s throughput. In addition to using the storage configuration of these two platforms, we did not make any other specific adjustments to these two platforms, but prefer to use their “out of the box” configuration because they are through their respective Docker images and Helm chart deployment.
Our performance tests show that Luna Streaming has a higher average throughput in all OMB test workloads we perform. In terms of agent node equivalence, we found that:
3 Luna Streaming nodes @ 5 Kafka nodes
6 Luna Streaming nodes @ 8 Kafka nodes
9 Luna Streaming nodes @ 14 Kafka nodes
We assume that the enterprise’s streaming data demand simply grows linearly within three years-a “small” cluster (3x Luna Streaming or 5x Kafka) in year 1, a “medium” cluster (6x Luna Streaming or 8x Kafka) in year 2, And the “large” in year 3 (9x Luna Streaming or 14x Kafka). Using the node equivalents found in our test above, by using Luna Streaming instead of Kafka, this will result in a 33% savings in infrastructure costs.
In this scenario that focuses on “peak period” workloads, we found a savings of about 50%, depending on the percentage of time that the peak period lasts.
For our third cost scenario, we focus on projects that may have significant complexity but limited raw throughput requirements, resulting in an organizational environment that requires a large number of topics and partitions to handle the wide-ranging needs of the entire enterprise. In this case, we found that using Luna Streaming saves 75% of infrastructure costs compared to Kafka.
You can download the report here, which contains a complete description of the impact of the test and results.