Reigning In The Smart & Fast Data Pipeline With Avro
Author - Venki Iyer
Dear Friends
The 1st time I heard the word "Apache AVRO" from fellow logicians was from Gaurav Taksale, who is currently on deputation at a customer location in Latin America. Needless to say, I was extremely surprised. In today's scenario where most people are not aware of big data streaming tools like "Apache Spark" and "Kafka", someone from our team mentioning the next level of big-data pipeline implementation viz. "Apache Avron" is very very encouraging.
Nobody has worked and contributed more on Kafka and related solutions than the company called "Confluent". In this blog, I will make a small attempt to educate fellow logicians on Apache Avro and the Confluent Schema Registry. We will understand how they work, the problems they solve and study the typical target "Big-data architecture".
If you’re not using Schemas in Apache Kafka, you’re missing out big time!
Let us 1st analyze how does Kafka work?
-
- Kafka does not look at your data.
- Kafka takes bytes as an input and sends bytes as an output. That constraint-free protocol is what makes Kafka powerful.
- Kafka works as a pure publisher/subscriber model. It takes bytes as input and redistributes them without ever parsing them (no overheads & built-in intelligence in data parsing)
Obviously, your data has meaning beyond bytes, hence the need to parse it and later on interpret it. When all goes well, there are no complaints. When it doesn’t go as planned, that’s when you hit the panic button.
There’s nothing worse than parsing exceptions in a big-data stream.
They mainly occur in these two situations:
- The field you’re looking for doesn’t exist anymore
- The data type of the field has changed (e.g. what used to be a String is now an Integer)
What are our options to prevent and overcome these issues?
Most convenient solution (Common solution) - Catch exception on parsing errors.
Rresult - Your code becomes ugly and very hard to maintain.
Commandment #1 - Never ever change the data producer and ensure your producer code will never forget to send a field, by means of some custom validation and logic.
That’s what most architects do to check these data exception problems at the publisher source. But after a few key people quit the organization, all your “safeguards” & "Checks & balances" are gone!!
Pertinent Question: - What is the best way to solve these (if not adopting the common solution above)?
Answer: - Adopt a data format and enforce rules that allow you to perform schema evolution while guaranteeing not to break your downstream applications. (Sounds too good to be true?)
That data format (part of the answer to the above question) is Apache Avro.
Going forward, I’ll discuss why you need Avro and why it’s very well complemented by the Confluent Schema Registry.
The most commonly used data formats in all big-data projects have flaws.
Okay — all data formats have flaws, nothing is perfect. But some are better suited for data streaming than others. If we take a brief look at commonly used data formats (CSV, XML, Relational Databases, JSON), here’s what we can find.
Common Data Formats used in most big-data projects.
CSV
Probably the worst data format for streaming, an all-time favourite of everyone who doesn’t deal with data on a daily basis; CSV is something we all know and have to deal with one day or another.
1/3/06 18:46,6,6A,7000,38.53952458,-121.464
1/5/06 2:19,6,6A,1042,5012,38.53952458,-121.4647
1/8/06 12:58,6,6A,1081,5404,38.53182194,-121.464701
Pros:
- Easy to parse with Excel
- Easy to read with Excel
- Easy to make sense of with Excel
Cons:
- The data types of elements have to be inferred and are not a guarantee
- Parsing becomes tricky when data contains the delimiting character (viz the comma itself is the data)
- Column names (header) may or may not be present in your data
Conclusion: CSV creates more problem than it’ll ever address. You may save in data storage space with it, but you lose in data safety. and accuracy (please note these statements are purely from a data stream perspective, not static data)
Rule #1 - Don’t ever use CSV for data streaming!
XML
XML is heavyweight (believe me, that, is an understatement), CPU intense to parse and completely outdated, so don’t use it for data streaming. Sure, it has schema support, but unless you absolutely love, dealing with XSD (XML Schema definition) files, XML is not worth considering. Additionally, you would have to send the XML schema with each payload, which is very redundant & wasteful of resources.
Rule #2 - Don’t use XML for data streaming!
The relational database format
CREATE TABLE distributors (
did integer PRIMARY KEY,
name varchar(40));
Looks kind of nice, has schema support and data validation. You can still have runtime parsing errors in your SQL statements if someone decides to drop a column, but hopefully, that won’t happen very often.
Pros:
- Data is fully typed
- Data fit in a table format
Cons:
- Data has to be flat
- Data is stored in a database, and data definition, storage, and serialization will be different for each database technology.
- No schema evolution protection mechanism. Evolving a table can break applications
Conclusion: Relational databases have a lot of concepts we desire for our streaming needs, but the showstopper is that there’s no “common data serialization format” across databases. You will have to convert the data to another format (like JSON) before inserting it into Kafka. The concept of “Schema” is great though, and that, my friends, is worth a mention.
JSON — The blue-eyed boy of all developers
The JSON data format has grown tremendously in popularity. It is omnipresent in every language, and almost every modern application uses it.
Pros:
- Lightweight
- Data can take any form (arrays, nested elements)
- JSON is a widely accepted format on the web
- JSON can be read by pretty much any computer language
- JSON can be easily shared over a network
Cons:
- JSON has no native schema support
- JSON objects can be quite big and cumbersome in size because of repeated keys
- No comments, metadata, documentation
Conclusion: JSON is a popular data choice in Kafka, but also the best example to “how, by giving indirectly too much flexibility and zero constraints to your producers, one can be changing data types and deleting fields”.
Summary
As we have seen, all these data formats have advantages and some flaws, and their usage may be justified in many cases, but not necessarily well suited for data streaming. We’ll see how Avro can make this better. Nonetheless, a big reason why all these formats are popular though is because they’re human-readable. As we’ll see, Avro isn’t, because it’s binary.
Apache Avro — Schemas you can trust
Avro has grown in popularity in the Big Data community. It also has become the favorite Fast-Data serialization format thanks to a big push by Confluent (due to the Confluent Schema Registry).
Question: - How does Avro solve our problem?
Answer: - Schema as a first-class citizen
Similarly, to how in a SQL database you can’t add data without creating a table first, one can’t create an Avro object without first providing a schema.
There’s no way around it. A huge chunk of your work and project efforts will be to define an Avro schema. (keep that in mind while doing the effort estimation)
Avro Features & goodies
Avro has support for primitive types (int, string, long, bytes, etc.…), complex types (enema, arrays, unions, optional), logical types (dates, timestamp-millis, decimal), and data record (name and namespace). All the types you’ll ever need.
Avro has support for embedded documentation. Although documentation is optional, in my workflow I will reject any Avro Schema PR (pull request) that does not document every single field, even if obvious. By embedding documentation in the schema, you reduce data interpretation misunderstandings, you allow other teams to know about your data without searching a wiki, and you allow your devs to document your schema where they define it. It’s a win-win for everyone.
Most importantly, Avro schemas are defined using JSON. Because every developer knows or can easily learn JSON, there’s a very low barrier to entry.
An Avro object contains the schema and the data. The data without the schema is an invalid Avro object. That’s a big difference with say, CSV, or JSON.
You can make your schemas evolve over time. Apache Avro has a concept of projection which makes evolving schema seamless to the end-user.
Other Features of Avro...
Avro data serialization is efficient in space, can be read by any language, and therefore has a smaller footprint (truly lightweight) on the CPU. You can even apply a compression algorithm such as Snappy on top of it to reduce the size of your payloads further.
Is it all hunky-dory or are there any drawbacks?
Avro Drawbacks
It takes longer in your development cycle to create your first Avro object and some developers will reject it because it’s not as straightforward as writing JSON. Regardless, the small initial up-front investment (longer in dev cycle) is, typically, far outweighed by the time savings later on when one does not have to troubleshoot data/data format problems.
Typically, let's say, the 1 hour of time spent agreeing with all stakeholders on the Avro schema is much less Level of Effort than 1 week spent on figuring out why your data pipeline suddenly broke overnight.
Avro is a binary format. In that regards, you cannot just open an Avro file with a text editor and view its content like you would with JSON.
It will take some time to learn Apache Avro. There’s no free lunch.
Inference & Conclusion
Apache Avro is a great data format to use for your fast data pipeline before the true value-add and benefits of Kafka kick in... In conjunction with the Schema Registry, you will have a killer combo.
Here's to a Happy & efficient fast big-data pipeline !!!