Kafka

What Is Kafka?

Kafka is a publish-subscribe based messaging system designed from the ground up to provide high throughput, fast performance, scalability, and high availability.

Kafka was originally created by LinkedIn and open sourced by them in 2011 to the Apache Software Foundation (ASF). ASF is a decentralized open-source community of developers. The software they produce, in this case Apache Kafka, is distributed under the terms of the Apache License and is free and open-source software. Kafka is one of the five most active projects of the Apache Software Foundation and has quickly evolved from a messaging queue to a full-fledged event streaming platform used to collect, process, and store streaming event data or data that has no discrete beginning or end. Kafka software abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.

Kafka is published by the Apache Software Foundation

Learn More

To help with your Kafka introduction, here are some of the advantages of Kafka:

Open Sourced via Apache
Free software license
Fast, low latency system dedicated to high performance
Scalable, distributed, and robust design
Runs as a cluster of servers each of which is called a Broker. Each broker can handle large volumes of data
Highly scalable storage system

Kafka records a log of messages defining exactly what event happened and when and stores that information as record streams. These records are called an immutable commit log. It is immutable because it can be appended to, but not otherwise changed. Applications can subscribe to a Topic (access/read the data) and also publish to it (write/add more data) from streaming real-time applications, as well as other systems. Kafka for beginners terms to know include:

Messages – an array of bytes (record).
Topics – a collection of Messages that relate to a single category.
Producers (Publishers) – publish Messages on one or more Kafka Topics.
Consumers (Subscribers) – subscribe to Topics. A Consumer can select more than one Topic to read from Messages that are already published.

Who Uses Kafka?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies to facilitate data integration, high-performance data pipelines, mission-critical applications, and streaming analytics.

From health care, financial services, and government to computer software (commercial and open source) and transportation, Kafka is used by organizations in every industry. Over 80% of Fortune 100 companies including, LinkedIn, Netflix, and Microsoft, Box, Goldman Sachs, Target, Cisco, Intuit, and more use Apache Kafka. Indeed, the utilization rates of Kafka in the largest organizations by industry is remarkably high.

Kafka is implemented in:

10out of 10Largest insurance companies

10out of 10Largest manufacturing companies

10out of 10Largest information technology and services companies

8out of 10Largest telecommunications companies

8out of 10Largest transportation companies

7out of 10Largest retail companies

7out of 10Largest banks and finance companies

6out of 10Largest energy and utilities organizations

Kafka in Financial Services

There are so many ways banks, fin-techs, mortgage companies, and traditional financial institutions are utilizing Apache Kafka to enhance their operations that we could fill dozens of web pages with examples.

More info on Finance

IT Modernization

From a volume perspective, it is likely that IT modernization projects lead the way across all sectors. The two most obvious types of work being done in IT modernization using Kafka are 1) messaging modernization and event streaming replacing less capable and scalable older technologies like MQ and TIBCO; and 2) retooling application development and data access using microservices to standardize and decouple distributed platforms and resources. Other major project areas include streaming ETL, high availability distributed operations, mainframe offloading, replication for disaster recovery purposes, data warehouse modernization, migration of on-premise data to the cloud and hybrid cloud connection strategies. Here is one example of a modernization project. Goldman Sachs, famous in the financial services industry, developed a platform called ‘Core’ to handle its data. The Core Platform uses Apache Kafka as a pub-sub messaging platform. That platform has allowed Goldman Sachs to achieve a high data loss prevention rate, minimize down time and improve disaster recovery.

Customer Experience

Every industry is looking for better and more effective strategies to give their customers a seamless experience. The increasing need for better customer experiences in the retail and corporate banking, mortgage, and wealth management sectors is making use of Kafka’s high-performance capabilities. Realtime clickstream analytics, recommendations and visualizations, and predictions via machine language applications are significantly increasing customers’ digital engagement – especially through mobile applications. The ability to simultaneously access many channels of information via Kafka event streams allow organizations to provide ever more personalization for their alerts, notifications, and special offers. For example, Rabobank, a Dutch multinational banking and financial services company, used Apache Kafka for Rabo Alerts, one of its important customer services. The service uses push notifications to notify customers on issues like transactions in their account, suggestions on future investments based on their credit score, etc. This service application is built on Kafka Streams.

Trades and Reporting

Event-based activities are being captured and transformed by Kafka for all sorts of capital market and wealth management applications. Of course, trade data capture is high on the list as are cleaning and settlement improvements. In the area of regulatory reporting, Kafka is being used extensively in capturing data for Order Audit Trail System (OATS), Consolidated Audit Trail (CAT), and MiFID II transaction reporting. New applications based on Kafka technologies, coupled with machine learning, are driving next generation solutions for advisory services and pricing.

Illegal Event Detection

More Kafka utilization is showing up in retail, commercial and corporate banking apps as well as mortgage processing. Event streaming from diverse sources combined with machine learning and distributed workflow and decision management systems is helping improve credit analytics, fraud detection, and money laundering and illegal payments detection. The rise of on-line banking and mobile payment hubs makes streaming events in a common and secure way at scale a match for Kafka’s performance and architecture.

Market and Credit Risk

Kafka is also having an impact on market and credit-risk processing including data consolidation across diverse risk systems and the real time capture of market data. It has become more crucial than ever to measure risks and no institutions can afford to make a costly mistake, so risk modeling and predictions are now financial sector business mainstays. Organizations that handle trades are leveraging Apache Kafka as part of analytics tools that detect even slight manipulations and immediately alert authorities to take action.

Cyber Security

Last, but certainly not least, the Kafka impact on cyber security has been significant. At the heart of this is fast, high-volume log ingests from varied systems and platforms associated with large-scale IT deployments. Consolidated fraud detection and anomaly reporting across systems has also been significantly improved. Finally, the serious and growing emphasis on cyber security has seen the advent of a new market sector – security information and event management (SIEM.) Kafka is central to many SIEM modernization and optimization projects around the world.

Kafka in Healthcare

The healthcare industry has been notoriously slow in its digital transformation but new technologies like streaming data services, microservices architectures and machine learning (ML) are starting to be adopted on a wide scale and Kafka is playing a significant role in deploying these new technologies.

More info on healthcare

The way other industries are attacking their problem is serving as a template to healthcare organizations. Streaming data and event-driven architectures are revolutionizing the way organization everywhere share data between systems and medical personnel are starting to use streaming data to improve and speed-up workflows and decision making. That in turn enhances patient experiences, improves (and can often predict) patient outcomes, and add to patients’ overall quality of life.

Kafka is Now Everywhere in Healthcare

Healthcare companies were not the initial customers for these kinds of technologies, but the healthcare industry has long struggled to share data across physician offices, hospitals, and insurance companies. Today, a large segment of the healthcare industry is using Kafka technologies to improve everything from connected health and IoT applications to automated diagnostics and the development of real time mobile applications. Use cases for organizations like Cerner, Bayer, Express Scripts, Babylon, and Humana are easy to find on the internet. Like Financial Services, healthcare infrastructure projects like IT modernization and cyber security are also visible in every major healthcare organization. Kafka is also helping to manage complex event processing, such as drug interactions during clinical trials or patients receiving care from multiple specialists. Because administrative tasks like claims processing and scheduling and booking appointments often require input data from many sources, they are also being improved using Kafka event streaming and topics. As one administrative example of how improved data sharing and interoperability using Kafka can save time, Humana cited improving pre-authorizations using Kafka. According to them, “It used to take 20-30 minutes to do a pre-auth, now it is down to a minute.”

Consolidating Disconnected Data Points

The healthcare industry generates enormous amounts of data about every patient, originating from many locations and systems, and has an extremely diverse set of formats both structured and unstructured. Those attributes make it difficult to store and access in conventional databases.

Data sources include electronic health records (EHRs), payor records, lab results, medical devices, medical imaging, the pharmacy record, the home healthcare visit, the specialist’s recommendations, and life science research just to name a few. Disconnected data points make it hard for providers to see a comprehensive picture and for patients to get coordinated care. Providers often don’t even have the latest updates from one another, so it’s often up to the patient to help figure things out. This problem is certainly annoying and at worst it can be life-threatening. Kafka event recording, streaming and topic consolidation schemes are connecting varied data points in new and powerful ways.

Real Time Monitoring

Today, Apache Kafka is increasingly providing real-time messaging at scale from the diverse sources healthcare requires. For example, Kafka streaming is communicating data between systems such as lab results, prescriptions, billing information, ML warnings for infectious diseases, and changes in patient conditions in real time. Kafka streaming technologies also allow providers to monitor chronic conditions such as heart and lung functions, blood sugar levels for diabetics, breathing functions for asthmatics, epilepsy indicators, seizure information, and hypertension data. Based on real time streaming data, health providers can use workflows, decision management, and machine learning to detect and alert about early signs of health issues, thus delivering personalized care.

Regulatory Requirements

Healthcare is one of the most highly regulated industries in the world especially in the United States. Individual states also have rigid requirements regarding the accuracy of information displayed online. The proliferation of mobile applications and portals that display provider and patient data makes safeguarding information-in-motion more difficult than the walled gardens of the past. Data privacy under HIPAA and GDPR requires patient consent and rigid attention to security in the transmission and storage of data. By standardizing and centralizing how data flows using Kafka, control and auditing from all sources become much more manageable.

Streaming data using Kafka not only improves healthcare delivery, but also helps diverse healthcare providers comply with complex regulations. For example, health payers may work with many thousands of providers which means to process claims they need quick access to accurate and current provider information, like addresses, specialties, phone numbers, hours of operation and National Provider Identifier (NPI) number. Kafka platforms can help update providers’ information instantaneously and avoid hefty regulatory fines associated with inaccurate information. Streaming platforms can communicate changes to data in near real time helping to ensure compliance through better workflows and decision making while also improving patient care.

Data-driven Insights

The healthcare industry is on the road to real transformation through use of technologies such as big data and advanced analytics. Feeding data to these technologies using microservices architectures and Kafka event processing streams is driving new applications and providing data-driven insights from the myriad sources of healthcare information. For example, one class of new apps are taking advantage of remote monitoring through biometric devices and mobile data collection apps to monitor medications, glucose levels, blood pressure and activities of patients. Apache Kafka and streaming data will continue to improve the way healthcare organizations collect, store, and share data while improving operations, privacy and cyber security.

The Kafka Framework

Initially conceived as a messaging queue, Kafka, written in the Scala and Java programming languages, is based on an abstraction of a distributed commit log. Generally speaking, Kafka initially fell into the category of message-oriented middleware (MOM) which is typically software infrastructure for sending and receiving messages between distributed systems. Historically, there has been a lack of standards governing the use of message-oriented middleware and that caused problems. Most of the major MOM vendors have their own implementations, each with its own application programming interface (API) and management tools.

Apache Kafka Components

Comprehensive Apache Kafka documentation is available on the Apache site but as a Kafka introduction, let’s have a brief look at its components and APIs. Apache Kafka is comprised of eight major components. Each of these components has a specific role to play in the Kafka system.

Kafka Messages or Records: Kafka Records are immutable and can have a key (optional), value (e.g., an array of bytes, JSON objects, Strings etc.) and timestamp.
Kafka Topics: A collection of Messages that relate to a single category. Think of a Topic as a feed name for a stream of records (“/logins”, “/orders-shipped”, etc.)
Kafka Topic Log: Each Topic has a Log which is the Topic’s storage on disk. Kafka appends records from a Producer(s) to the end of a Topic Log.
Kafka Producers: Kafka Producers publish streams of data records (Messages) to Topics.
Kafka Consumer: Kafka Consumers consume (read) streams of records (Messages) from Topics.
Kafka Broker: A Broker is a Kafka server that runs in a Kafka Cluster. These are the systems that manage the published information. “Broker” is also often used to refer to the logical system or to Kafka as a whole.
Kafka Cluster: A Kafka cluster is made up of multiple Kafka Brokers on many servers. Kafka replicates partitions to many nodes to provide failover.
Kafka Zookeeper: In Kafka systems prior to V2.8.0, Apache ZooKeeper was required to deploy Kafka. Therefore, if you wanted to deploy a Kafka cluster you also had to manage, deploy, and monitor Zookeeper as well. ZooKeeper is no longer needed with recent releases of Kafka

Apache Kafka Core APIs

The Apache Kafka framework consists of five main core APIs for Java and Scala – Admin API, Producer API, Consumer API, Streams API, and Connector API. There is also command line support for admin and management tasks.

Admin API manages and inspects Topics, Brokers, and other Kafka objects.
Producer API publishes (writes) a stream of Messages/events to one or more Kafka Topics.
Consumer API subscribes to (reads) one or more Topics to process the stream of events produced for them.
Streams API acts as a stream processor for implementing stream processing applications and microservices. It provides functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event-time, and more. By reading input from one or more topics to generate output to one or more topics, the API can effectively change the input flows (streams) to modified output flows (streams).
Connect API builds and runs reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications that can integrate with Kafka. The Kafka community provides hundreds of ready-to-use source and sink connectors to databases, CRMs, and other commercial systems and applications.

Kafka-as-a-Service

Apache Kafka is one of the most popular and powerful tools for microservice architectures and it solves a variety of problems. When implementing a Kafka-based event streaming architecture, organizations now face a “make or buy” decision. Deciding between in-house or a managed cloud approach depends on a multitude of factors including budget, resources, timeline, and any special requirements. Apache Kafka requires its own ecosystem so there is a wide variety of features and functionality being provided in managed cloud products including production monitoring and network analytics. Today, both major IT companies and startups are offering Kafka software and hardware products, as well as, development, integrator services and consulting/contracting. Some of these organizations are Confluent, Instaclustr, CloudKarafka, Aiven, Microsoft, IBM, Amazon (AWS), and Red Hat.

Trisotech and Kafka

The Trisotech Digital Enterprise Suite (DES) can both produce/emit (Publish) and consume (Subscribe) to events via Kafka connectors and through Events (both catching and throwing), Tasks (both send and receive) in BPMN and through Listeners (catching) in CMMN. DES exposes different ways of registering to receive asynchronous notifications when system events occur in the suite. Events publication is divided into hierarchical topics where each topic contains a small number of message types. Each topic also defines security constraints regarding which users can receive a given message sent on that topic.

The Trisotech Digital Enterprise Suite can publish events to Kafka topics using its emitter system. The Trisotech Digital Automation Suite can also publish messages on Kafka topics using service tasks and use incoming messages to trigger start or intermediate events in workflows.

Learn More

Subscribing

There are multiple mechanisms to subscribe to events:

Kafka Receive Message Trigger Events – a special BPMN 2.0 Trisotech value-added extension connector Event that subscribes to Kafka messages from a topic.
BPMN 2.0 Receive Task.
BPMN 2.0 Intermediate Events.
CMMN 1.1 Listener

Message Correlation

When subscribing to event streams it is useful to be able to filter incoming messages quickly and easily. DES supports a Message Correlation feature that allows modelers to define a FEEL expression that determines if an event message will be accepted by the process event.

Publishing

There are multiple ways to publish events

Publish Apache Kafka Message Task – a special Trisotech BPMN 2.0 value-added extension connector task that publishes Kafka messages by topic to the Admin-specified Broker.
BPMN 2.0 Send Task
Kafka Emitter* – a secure user-configured emitter where events can be sent to user-selected Kafka Topics.

*Emitters

Emitters are DES-provided functions that can be used to generate asynchronous messages for event logs, including Kafka support, and other purposes.

Trisotech

the
Engineer

View all

Kafka

What Is Kafka?

Who Uses Kafka?

Kafka in Financial Services

IT Modernization

Customer Experience

Trades and Reporting

Illegal Event Detection

Market and Credit Risk

Cyber Security

Kafka in Healthcare

Kafka is Now Everywhere in Healthcare

Consolidating Disconnected Data Points

Real Time Monitoring

Regulatory Requirements

Data-driven Insights

The Kafka Framework

Apache Kafka Components

Apache Kafka Core APIs

Kafka-as-a-Service

Trisotech and Kafka

The Trisotech Digital Enterprise Suite can publish events to Kafka topics using its emitter system. The Trisotech Digital Automation Suite can also publish messages on Kafka topics using service tasks and use incoming messages to trigger start or intermediate events in workflows.

Subscribing

Message Correlation

Publishing

*Emitters

Trisotech

the Engineer

the
Engineer