Voices in Data Storage – Episode 22: A Conversation with Noam Shendar of Zadara

:: ::

Enrico speaks with Noam Shendar about continuous inside storage systems, their benefits and how they can be leveraged to solve some particular infrastructure and application needs.

Guest

Noam Shendar brings over 20 years of experience in the technology industry in a variety of senior management positions. At LSI Corporation, he was Sr. Director of Business Planning and Product Management in the Engenio Storage Group, where he founded and led an internal startup, took it to revenue, and successfully handed it off to NetApp when the latter acquired Engenio. Prior to that, Noam was LSI’s Director of Corporate Strategy. Prior to LSI he was Director of Strategic Marketing at MIPS Technologies, Inc., where he was responsible for the company’s efforts to penetrate new markets and expand its presence in existing ones. Prior to MIPS, Noam was VP of Applications and Director of Engineering at iBlast, Inc., an entertainment technology startup. Prior to that, Noam held research and engineering positions at Intel Corporation’s Microprocessor Products Group. Noam has a B.Sc. (with Honors) in Electrical Engineering from the University of Wisconsin-Madison and an Executive MBA from Santa Clara University.

Transcript

Enrico Signoretti: Welcome to a new episode of Voices in Data Storage, brought to you by GigaOm. I'm your host, Enrico Signoretti, and today we'll talk about storage and containers but not storage for containers as you may think in storage for Kubernetes. We're talking about continuous inside storage systems, their benefits, and how they can be leveraged to solve some particular infrastructure and application needs. To help me with this topic, I have invited Noam Shendar, VP for Solutions Architecture, Zadara. Hi, Noam, how are you?

Noam Shendar: I'm really well. How are you, Enrico?

I'm fine. Thank you very much for joining me today and also thank you for bringing this topic to the show. We had a chart a few days ago where you were briefing me on the latest on the company and I thought that it means this feature, when–in the previous briefings and for this. Wow, this is great.

I also had an experience in the past with another start-up in the storage space where they were integrating a sort of a serverless approach for their storage and [it] was pretty similar. I always found it compelling for a lot of infrastructure and also you taught me business cases in this case. Maybe we can start with a little bit of introduction about yourself and Zadara before going into this really, really interesting topic today.

Great, thank you for the opportunity to be here. I'm Noam Shendar, and I've been in the technology industry in Silicon Valley for 20 years now, actually more than 20 years; time flies. Started my career at Intel and worked at a number of other semiconductor companies as well, like NIPS Technologies and LSI Logic. Zadara is my second start-up. We started Zadara in 2011 with an idea. The idea didn't have a name, didn't have a category, so we came up with the title ‘enterprise storage as a service.’ What we meant by this is that we thought there could be a way to take everything that is great about the existing enterprise storage solutions by the big companies we all know–EMC and NetApp, and IBM, and HPe–but we recognized that the model with which it was sold was lacking and also that the technology didn't support any kind of new business model.

The ‘as a service’ piece of enterprise storage as a service is providing the storage to the customer flexibly wherever the customer needs with consumption pricing only, so they only pay for what they use, not for what they have, and with the ability to change at any time. We all know that life has surprises. We all know we need to be able to react to changes. Traditional storage is not good at that. Traditional storage is rigid. We thought of a way to keep all of the capabilities of traditional storage but offer them in a flexible way that allows growing or shrinking or increasing performance or decreasing performance, adding features, subtracting features, and moving the data, whether it be from on-prem to the cloud or back out of the cloud to the on-premises, or even cloud to cloud. Anywhere that the customer wants to be, we continue to be with them there if they want us. It's a subscription, and if they don't want us, they stop.

Okay, so just to recap very, very quickly, so your solution can be consumed on premises as well as on the cloud, and you provide all the resources for which your customer can build a virtual storage array that has all the characteristics of a primary storage array and also giving block, file and object storage particles, if I remember well, but they pay only for what they consume. They just choose how many resources they want, how they want to configure the array. It's like an array, a traditional array, but actually it's virtual and it's cloud. Is this correct?

That's exactly right, Enrico. We have all the capabilities of a traditional storage array like block and file and as you mentioned; also object, in addition to what traditional arrays do. There are traditional capabilities like snapshots and mirror, remote mirroring, deduplication, compression; all of those things you expect in a traditional array are there. We've extended the capabilities like adding object, like adding the Docker container capability that we'll talk about, and yet we did it in a completely different architecture in order to enable the flexibility that we're famous for.

The architecture is very flexible in terms of the ability to grow, or shrink, or change the arrays as needed. The architecture's very flexible in terms of their location of the array. The array can go to the cloud, come back out of the cloud, or be a hybrid array with part of it in the cloud and part of it on premises. It can even be multi-cloud.

Under the hood, the architecture is very different from traditional arrays because it was not possible to use the traditional architecture to make these things possible, to provide the flexibility I talked about, to provide the scale that we need to provide, and to provide also multi-tenancy. We have to think of this like a cloud rather than like an array, and what we did was we created an architecture that has what we call VPSAs. Those are virtual private storage arrays, and each is a standalone software-defined array. Customers can have as many of those as they want, and they're each completely standalone. They're isolated from each other and they can be managed separately. They can each be scaled up or down as needed, and they can all coexist without interfering with each other. All of this is built into the architecture.

Then the last great thing about this is we manage all of this on behalf of our customers. We're the ones who designed it. We're the ones in the best position to properly operate this and to do this remotely at more than 200 locations around the globe, part of them public cloud locations, part of them private on-premises locations, but regardless, we do all of these with our team who knows best how to run this. Therefore, we can guarantee the up-time to our customers and our track record has been an impressive 5 ½ lines of availability over years and years of operation.

Sounds incredible. What I want to understand better is this container functionality. I mean, when we talked a few weeks ago, I asked you about Kubernetes support for containers. You told me that there is a CSI plug-in ready for your customers, but actually, there is another feature that is even more compelling, a differentiator when compared to other solutions because you can run containers directly in your system. You can take some compute power and allocate it for containers. Is this correct?

It's exactly correct. If I go back to the architecture for a moment, the way that we guarantee the perfect multi-tenancy, the performance isolation, is by providing each VPSA its own dedicated CPU and memory so that its storage stack–doesn't matter if it's file, or block, or object–runs on dedicated CPU cores and memory so that each of these different arrays doesn't interfere with any of the other ones because they each have their own dedicated resources. That's true for the drives, too. The SSDs and hard drives are dedicated as well.

This architecture allows us to do what you just described. We can take additional CPU cores and additional memory and assign those to what we call the ‘container engine.’ A container engine is a place for customers to run any code that they want that is containerized using Docker with a performance level that they can set because the container engine can be sized up or down as needed to add or subtract cores. There's even a free tier of a very small container. It's very good for testing, for example, and those containers, because they're running inside the VPSA, have extremely low latency access to the storage. There's no need to go over a fabric or a network in order to access the storage.

Probably this is the question that everybody has in his mind at the moment. Why should somebody want such a functionality?

There are a number of reasons for this. One reason for this is what is generally called an industry hyper-convergence, which is normally thought of as adding storage functionality into servers. We've turned that on its head and added compute functionality into storage. Where the customer wants to simplify some operations by having everything run within the same infrastructure, we enable that in a hyper-converged way. I would say that looking at what our customers are doing with this, that is not the most common reason for doing this. The most common reason for doing this is for automation of repetitive tasks. By having the compute so closely coupled to the storage enables two things. One is event-driven operations, which I'll get back to in a second, and the low latency I described, and I'll start with that.

When it comes to repetitive operations, latency can really add up. If I'm doing something a million times and I have to deal with one millisecond of latency, then I have a million seconds of non-productivity while I perform these operations. If my math is right, that's about 15 minutes of waiting. If I could take that one millisecond and reduce it by a factor of ten, let's say, to a hundred microseconds, well, now those 15 minutes are a minute and a half. That's one very good reason to do that, and that's common among our customers who are doing things like transcoding or any kind of format conversion where a bunch of files need to be converted or customers are doing integrity checks or MD5 check-summing. Those are repetitive operations and if they have millions of files, the latency becomes a problem.

The other piece that I mentioned was the automation. Wouldn't it be nice if something happened each time a file was added or modified? For example, the MD5 check-summing; wouldn't it be great if when every time a file was uploaded into, say, an object store, an MD5 check sum was generated for it automatically or vice-versa? Maybe every time a file was uploaded, its MD5 check sum was checked to confirm that it was uploaded correctly. What if every file was automatically virus-checked? What if every file was automatically transcoded? For example, a media streaming service uploads in one format and then automatically, the additional formats are generated: mp3s into AA4, for example. Those are common use cases among our customers.

Right, so we are thinking about serverless storage here. I mean, you are coupling one of the most interesting features at the very high level of the [13:00?] stack, which is serverless, with storage. You're making it available to all your customers. Of course, this is not available only for objects but also for files as well as blocks, right?

Absolutely.

At the end of the day, I have a solution that's totally different from the usual array that I have in production, but it's much more oriented to the application. As you said, the use cases are several. The end user can deploy this continuously on the storage system.

They upload the container or containers into our system, so we have a GUI and we have a corresponding API, so all functionality is available both ways: 100% API coverage and 100% GUI coverage. The containers, once uploaded, are available to be run either on command or as I mentioned, in an event-driven fashion. They can be triggered by actions like file uploads or file changes.

You catch all the events that happen in your storage system and you have a messaging system or something to pass them to the container.

Exactly, and this can scale very, very large. We have customers running millions and millions of containers because they're scaling the system up to serve a lot of end users, for example.

Do you ever measure the savings that these customers have by adopting this kind of architecture?

The savings can be, depending on the use case, either productivity savings: so things can be done faster, and the 10x number that I mentioned earlier around latency savings is a real number, so it's based on the customer experience. Then the other savings can be on infrastructure, so here I can tell you an example of a customer who's a manufacturer. You can put this story under the heading of smart factories or smart manufacturing. They manufacture a component that goes into a very, very large number of smartphones, and that–which means that the volume of manufacturing is very, very high. The vendors, the smartphone manufacturers, need very, very high quality. They don't want to deal with returns of defective products. Quality assurance is really important.

What this customer does is it takes very, very detailed pictures of every article, every single article, and uses those pictures both for quality assurance so they can, in real time, run analytics and improve the manufacturing process based on what the pictures show and keep some of this data later for failure analysis. If there is an issue, rather than having to recall physical specimens, they can look at these super-high resolution pictures and do the analysis of what went wrong there.

Both of those things are productivity enhancements for the customer and cost savings. By being able to run these analytics on the storage means that there's less infrastructure needed. You don't need physical servers. You need less rack space. You need less power to do all of these things. The factory floor is not typically a place that has a lot of IT, so space and power matter. By the way, the fact that we remotely manage it also helps because typically there aren't IT people in the factory. Then the ability to archive the data–in this case, we archive it to the cloud to AWS S3–also is a cost savings because it's a very affordable storage medium. It's easy to work with. It again doesn't require IT on that side and because it's off-site, that means, again, less space and less power that's being used up in the factory itself.

You talk about use cases that involve millions of containers, and also you mentioned that you are able to scale up to these kind of numbers. How do you do that? As far as I remember, you have a limited set of cores pair controllers in your solution, so can you scale out? Can you add additional resources on top of the traditional controllers you use for I/O operations?

The way to scale up is by VPSA and then by the size of the cluster, which is made up of what we call ‘storage nodes.’ The scale up mechanism is–actually as you mentioned, we could also add the number of cores, so let me start this from the beginning. The way to scale up is first by adding cores, and of course there's an upper limit to that. The next way to scale up is by the VPSA, so each VPSA can have its container engine of a custom size. Then from there, you can scale up by storage nodes, which means that the entire storage cloud is growing. Those customers who are running millions and millions of containers, they have a big cloud. They have a lot of nodes, have a lot of capacity. They have a lot of VPSAs and a lot of container engines. In total, they have enough compute power to do these gargantuan tasks.

At the end of the day, the solution is very simple. You give the support for the containers, so the end user can develop the application on any sort of language they like because you just run the containers with a few parameters when they can’t sync. . There is no limitation for the end user in terms of code or type of containers or whatever, right?

That's correct. They are Linux-based containers, so that's the only context that the user needs to be aware of.

Yes, but probably when you talk about containers, 99.99% of the containers are Linux-based, right?

As expected, they're Linux-based containers. Yeah, there could be arbitrary code in there which can be either a custom process that the customer needs to run like this image analysis at the factory, or it could be a new feature that the customer wants to add to the storage that we didn't include. A simple example is FTP. The customer that asks, “Do you support FTP?” We said “Sadly, we do not. However, go to Docker Hub, download an FTP container, use that, and voila, now the storage is able to [20:16].”

Nice. Do you think customer will ask for GPUs over time?

We have not heard this request to date. It's conceivable, especially since I mentioned analytics. You can guess that the analytics contain some AI or machine learning. It's reasonable to expect that at some point. I wouldn't say that we wouldn't support it, but I will say it's not on the near-term road map because there haven't been any requests, and there are a lot of really exciting things we have heard our customers ask us for, so we are working on those.

Like what, for example?

Generally don't like to talk about things we haven't released, but an obvious thing that I'm comfortable talking about is support for Kubernetes, especially as we talk about scaling up containers to very, very large number of them. A management paradigm is necessary and the de facto standard for that is Kubernetes, so this is something we're going to be adding to the Zadara container services to make it easier and more streamlined for our customers to take advantage of this pretty cool capability.

Okay, and don't want you to spill all the beans today but actually, what can we expect next for the entire Zadara system?

Again, many cool things are coming down the pike that I won't talk about. To give you something to work with, I'll mention that many customers are asking us about and therefore we are implementing: NVMe over fabric. I think it's a necessary improvement, and I think it's really beneficial in terms of the internal latency of the system. If you look at our history, it's pretty cool.

We started with iSCSI interconnect which is very well understood but does not have great latency. In order to improve the latency while maintaining the iSCSI compatibility, we went to iSER, which is iSCSI extensions for RDMA, and that reduced our latency by a factor of three, which is really amazing, while using exactly the same ethernet switches that we were using before, so using native ethernet and using iSCSI Primitive, still we are improving latency by a factor of three. We expect with NVMe over fabric to do that again, so squeeze another 3x of latency improvement out of the system.

By the way, in case any of our listeners are asking themselves ‘Do we support NVMe drives?’ since I mentioned MVMe over fabric, and the answer is yes. We do that in our all-flash array. NVMe drives are the high-speed tier that we use for both caching and metadata. As you know for de-duplicated systems, metadata access is critical, is in the critical path, and therefore really fast access is necessary in order to provide high performance, so that is why we chose NVMe for that tier of storage whereas NAND flash [ standard with SaaS or SATA connectors is what we do for the capacity, the flash-based capacity.

Fantastic. Again, just to wrap up a little, I really loved chatting with you about storage for containers or even better, containers into the storage system in a serverless fashion. Where can our listeners find more information on the web about this functionality?

A number of places. I will list a number of them. Our homepage is Zadara.com. Zadara is spelled with all As; Z-A-D-A-R-A. If you're listening in the UK, Zed-A-D-A-R-A. The next place to go is our LinkedIn. You can simply search for Zadara on LinkedIn. We have a Twitter feed, @Zadara, and if you want to see my personal Twitter feed, it's @noamshen. My first name is N-O-A-M and the first four letters of my last name, S-H-E-N, together: noamshen. All of those are good sources of information about what we're doing. We have a nice blog on our website, and we have a lot of resources, real-world case studies from actual customers, whitepapers, technical briefs, and a lot of webinars. We record a lot of webinars, and we have them nicely organized on our website, including 15-minute tech tips–I think we recorded 40 of them–to really focus in on specific features rather than talk too much. If you're looking for little tidbits, that's the place to go.

Is there any possibility to try your solution?

Absolutely. We love having our customers try before they buy. They can do this either through our own website. There's a free trial button that leads to registration. We just need your legitimate email address, and then you can go ahead and set up your trial. You can also do this through the AWS Marketplace. Search for Zadara on the AWS Marketplace, and you will find our solution there. This links our solution to AWS's billing system, so if you want to consolidate your billing, and we enable a free trial there for you as well. Google Cloud Marketplace, you can find us there as well, very similar set-up there.

Okay, many places then. Thank you very much again for your time today and bye-bye.

Interested in sponsoring one of our podcasts? Have a suggestion for a great guest? Please contact us and let us know.