OpenTelemetry with Austin Parker

In this episode, Austin Parker, Principal Developer Advocate at Lightstep talks about the OpenTelemetry Framework, which is an observability framework for cloud-native software and a collection of tools, APIs, and SDKs. You use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) for analysis in order to understand your software's performance and behavior.

Subscribe to
our podcast:

Listen on Apple Podcasts

Transcript

CHARLES: Hello everybody, and welcome to The Frontside podcast. My name is Charles. I’m a developer here at The Frontside and with me today also from The Frontside is Taras. Hello, Taras.

TARAS: Hello, Charles. And today we are here with Austin Parker, a Principal Developer Advocate at LightStep. Welcome, Austin.

AUSTIN: Hello. Thanks for having me. It's great to be here.

CHARLES: Yeah. And we are going to talk about a project called OpenTelemetry, and I guess we should start by talking about Telemetry in general. And then we'll talk about how you come to be working on OpenTelemetry at LightStep and what kind of relationship is there. But maybe we could just start by filling people in on what Telemetry with the capital T is, oh sorry, with the little t, and what the OpenTelemetry project is.

AUSTIN: So it is Telemetry with a capital T. I think the easiest way to explain it is something we're all familiar with, logging. How many times is the first thing you do when you sit down and it's like, I'm going to write a program, or I'm going to write something? You console.log("Hello, World!"). That's really the basis of how we understand what our programs are doing. And those logs are a type of Telemetry signal. They're a signal from our process, from our application that says what it's doing. So obviously when you get a more complex application and you add more services, the scale of your application increases, and you have more components and more things you need to be aware of both the quantity and the type of Telemetry that you use will increase. So a lot of times you're going to sit there and say, “Hmm, yeah. I have all these logs and that's great, but I need to know a number. I want to know how many requests per second do I have, or what's the latency of a request.” So that's a time-series metric, so you can express that. You write some instrumentation code or you use a library that does it for you, and it emits a number that is p50 or average latency. And then you keep adding services. You keep adding all this stuff. And you're like, ah, now I have 20 or 30 or 2000 different services running and a single request is going to hop through 15 or 30 or 500 of them, and gosh, that's really hard to understand by just metrics and logs. And so you can go in and add a distributed tracing, and that's a way to look at the entire request as it flows from all of these services.

And those three concepts, logs, metrics, and traces are sometimes referred to as the three pillars of observability, and that's what OpenTelemetry is about; it's an observability framework for cloud-native applications that really seeks to make all of those things I just talked about: metrics, logs, and tracing a built-in integrated feature of cloud-native software. So, how does it do this? The basic idea behind the project is -- let's actually go back to logs because it's the simplest one. You probably have a logging framework you use. And there are a lot of options, and it depends on your language. It depends on if you're using Java, then you probably have a log for J or if you're using the spring framework, then there's logging things there. If you're using Go, then there might be Zap or other types of structured loggers. If you're using Python, there are loggers there if you’re using C#, there are loggers there. And the cool thing about cloud-native is that it lets us ignore, you know, I don’t have to have a system comprised just of Java or just of Go. I can have multiple different types of applications and application frameworks and languages and all this stuff playing together abstracted from me.

But it can be really challenging to integrate all those different log sources together. The same thing happens with metrics, and the same thing happens with traces. You need some neutral piece in the middle. You need some sort of API that can be used by not only people developing applications but also library authors also people writing infrastructure components, things like queue libraries or SQL clients or whatever. If they want to provide Telemetry to their users, to people that are integrating their software, you want a standardized way to express these logs and traces and metrics. And to get value out of those, it's not enough to just have them all go to a file or go to standard out. You want to put them somewhere. You want to put them in a tool that can help you analyze that data and give you some insight about what's going on. You also need a way to have a common exposition format for them so that you're not writing endless levels of translators. So OpenTelemetry provides all of these things. It's an API right now creating distributed traces and metrics. It's also an SDK that you can integrate that will produce those traces and metrics. And then it's also a set of tools that can integrate traces, metrics, and logs that are created by all sorts of different existing libraries, tie them together, and put them into a common format that your analysis tools can understand. So if I have a bunch of Apache logs and I've got Prometheus metrics and maybe I've got Jaeger traces, I can throw all those together in an OpenTelemetry collector. And then it can translate them to a bunch of different things. It can send them to proprietary vendors. It can send them to other open-source tools. It can transform them into a common format called OTLP and write them to object storage so that I can go back later and do queries across them using SQL dialect. It’s a Swiss Army Knife almost for all this different Telemetry data. And the eventual goal is that this will all be integrated into the rest of the cloud-native ecosystem so that you, as an end-user, you're not having to go implement a bunch of this stuff. It’s there; it's a toggle you turn on.

CHARLES: It’s just there. So how does this compare and contrast to more traditional approaches? I mean, this is certainly outside my personal domain of expertise, but how would this compare to tools like Splunk and Logstash? And are they in a similar domain and if so, what would be the differentiator?

AUSTIN: That's a good question. There are a lot, I mean, there's obviously Splunk, like you said. And what I think you see is that there's -- maybe if I say metrics you think Prometheus or you think Nagios or you think Datadog.

CHARLES: Nagios, now there is a name I haven’t heard in years. [laughter] Of all the podcasts in the world, Nagios just comes walking into mine.

AUSTIN: Yeah, how many are name-dropping Nagios in 2021?

CHARLES: [Laughs]

AUSTIN: But if you think about it, traditionally, we've thought of this very much like a one-to-one mapping where you have a tool for logs. Maybe you're using Splunk for your logs and you've got a bunch of Splunk log collectors deployed everywhere. Technology changes, new things come along and you say, oh, well, now I want to add traces. I want to add an APM. So what am I going to do? Either I'm stuck with my existing vendor, or I want to get a new one. But if I want to get a new one, now I have to integrate a whole new thing. I have to integrate a whole new library, a whole new agent. I have to send data off to this other place, and now I've got two different tabs to click through. And you can be in this situation where you have a bunch of different really disconnected experiences and tools because you're using all these different proprietary solutions.

The advantage of OpenTelemetry is that it takes that middle layer, the layer between your application code and your infrastructure, and at the top of this, you have your analysis platform, your observability tools, that's your Splunks, that's your Grafanas, that’s your LightStep or your Datadogs. Open Telemetry sits between those, and it collects the data from your application code, translates it, and then sends it up to whatever you're going to use to analyze it. And because it's open-source, it's all out there in the open. If something doesn't work right, you can see what's going on. You don't have to be limited or restricted by what your vendor offers you. Maybe you're using a version of a library that isn't instrumented by your vendor, by their proprietary agent with OpenTelemetry, you can go see did someone in the community already contribute this plugin so that my particular version of Express has instrumentation for metrics and traces?

TARAS: We had this experience recently on a project where we had to integrate logging and tracing into a GraphQL application. And it's tricky because if you've never done it before, it's a bit of like what am I doing here? [laughs] How are we actually going to go about doing this?

AUSTIN: Yeah, it can be very challenging.

TARAS: Yeah. It's definitely not trivial. One of the solutions that we ended up going with in addition to -- so in a stack, it's a cloud-native platform that's being created internally and one of the challenges that we've seen is that the data ends up in a data lake. The logging gets collected and sent over, but we've found that a tool like Sentry provided better information about the logging that we have or the way that we could consume that information was much more detailed in Sentry, so we ended up actually integrating Sentry directly. But now we have a situation, we have a platform that is designed to do this that has Prometheus and has a bunch of things. But the vendor, an external vendor, actually offers a better developer experience. Is the granularity or the kind of data -- like, is anything getting lost in the process? So if you use something like OpenTelemetry between, for example, Sentry so if we could write to OpenTelemetry and then have that collected and sent over to Sentry, would you expect the quality of the data to be lost in the process? How transparent do you think that would be?

AUSTIN: So I'm not 100% familiar with the instrumentation Sentry does. I know broadly what it does, but I don't know the details, so I don't want to speak poorly of it. There's nothing technically that would stop it from being just as good because at the end of the day, what we're hoping to encourage with this is that -- there are two parts like you said; there's the quality of the data that's coming out of the application process and then there's the analysis piece of that data that's being done by the proprietary tool. So since it's all open source, since the part where you're instrumenting the code to collect that data and you're building all these hooks and stuff like that that's all open source, there's no reason that the data quality can't be the same. The advantage of OpenTelemetry is that as the community improves these plugins and improves these integrations, then the benefits of that improvement are applied equally. So now it's not just like oh, you have to use Sentry because Sentry gives you a plugin that gives you the best data quality. You can use OpenTelemetry, and OpenTelemetry will naturally get that improved data quality through contributions and through people and vendor partners contributing their code back to it. So the real question then becomes which tool is going to work best for me? Maybe Sentry is going to do since it has some whiz-bang collection or sampling algorithm or a way to do analysis that sets them apart from other vendors and then they can compete on that.

What I see happening right now a lot is people get forced into using things they don't want to use because of this problem, because it's like, well, I have to use this because this is what -- this might not be the best tool for me, but it's what IT approved, or it's what we already have a contract for. And we can't use this new open source thing that might work better for us, or we can't use this other proprietary thing as a point solution to fix or to give us a really good analysis about this one set of our services. Open Telemetry makes it very easy to take the same data and then type it off to different places. So you can actually do stuff like, hey, I have stuff from the web. I have a web app and then I have a back-end app and all that data is going to OpenTelemetry collectors. And then maybe I take all the web stuff and I forward it to Sentry and then all the back end stuff and I forward it to Jaeger. So there are a lot of options. It’s like Kubernetes in that way. Right now OpenTelemetry works at a pretty low level, which is good and bad. It's good in the sense that it's very easy to make it do what you want it to do because we're giving you access to a lot of stuff really down in the weeds. It's maybe bad because it's a little hard to get started with because of that. You have to know what you're doing. So that's one of the things that I think the project is working on. What we’re really focused on over the next few months is trying to make that getting started experience a lot easier.

CHARLES: Could I ask a little bit? Because this is the question that was brewing in my mind as I was listening and then you started talking about okay, we've got a front end, we've got a back end, and maybe we've got an app. What does the developer experience look like now if I want to start collecting metrics with OpenTelemetry? Do I install an SDK? Do I install some middleware into my app? What does it look like if my back end is, you know, I've got Node services on the back end, and I've got an iPhone app and a web app, how would I work in each of those different environments?

AUSTIN: That's a good question. The overall process is pretty similar in each language. You are going to install an API, and you're going to install an SDK, and then you're going to install the appropriate integrations. Now, in some languages, this is a lot easier than others, probably the easiest is Java. So let's say I have a spring boot application. I can go download a repository; it’s OpenTelemetry-Java auto-instrumentation. I go there, they publish a JAR file. I download that JAR, I add the Java agent flag to my command line and I give it some config properties. And then that code when the application starts up, it'll look at all the libraries. It'll see what has instrumentation available? What has plugins available? It'll hook into all those and then it'll start collecting data and sending it off wherever I told it. So that's the easiest. And honestly, for languages that support that integration, that's what we would love to see everything get to. For other stuff, it's a little trickier right now because one of the good things and bad things about a project like this is that it's designed to be very broad. We're not trying to make a really good thing for Java and just Java. We're trying to make a really good thing for pretty much every language, every modern language, at least which includes stuff like Swift and PHP, even though I just said modern language. But we use PHP, right?

CHARLES: [laughs] PHP is a postmodern language.

AUSTIN: Exactly. But a lot of people use it. [laughter] It's not abandoned, and it's not dead. It's actually like there's a ton of PHP in production. The problem is that you have different levels of -- there are just different amounts of support in the community for those languages. There are a lot of people that know Java and Go really well and JavaScript and Node and things like that that are working on it, and so that's where a lot of heat and light is. Maybe the Swift repo is a little behind. It's not quite there. One way we've helped with this as a project or it will help you as an end-user understand the maturity of things is that when you go in, you go and look at one, there's a maturity matrix. This is all being done through -- there's a specification and our big recent thing is that specification hit 1.0 for the distributed tracing part of this. And then the next thing will be metrics, and then the next thing will be logs. So those metrics and logs are both technically in an alpha/experimental stage. It really just means they can change, but we're not throwing out the entire thing and starting over. But you can see oh, Swift is maybe a little bit behind. Swift doesn't support everything. It’s not completely up-to-date to the spec and maybe supports spec version 0.5 rather than spec 1.0. But you would go and you would install the API; you would install the SDK and then if integrations existed, you would import those based on however it works. Like in Node, if you have a Node app, there are instrumentation plugins for Express. And so you can import that and then register that as a middleware and then that will start tracing your incoming requests. There's an instrumentation plugin for HTTP and HTTPS, so you can import that and you can have that hook into your outgoing HTTP requests. There are plugins for gRPC. So if you're using gRPC for RPCs, you can bring that in, configure your gRPC server and client with those as middleware and go from there.

It's all designed to be somewhat conventional to the language it's in. Each language SIG is supposed to hue to whatever the conventions of their language are for lack of a better term. So we want the Java stuff to look and feel like Java stuff. We want the C# stuff to look and feel like C# stuff. We want the Node stuff to look and feel like Node stuff. The advantage though is because all of these are running off the same specification, the default state, everything should work together. So if I have my front end single-page app in React and I go and I Install the React instrumentation and I install the XMLHTTPRequest instrumentation to my web app and then I have my Express server that's running the Express plugin, when I give a request from the single-page app back to my API server running in Express, that should emit a single trace. Then it'll handle all the tricky things like injecting context, extracting context. All the things that used to be really hard in terms of inter-op between multiple systems, out of the box, most of that stuff should just work. And as the project matures, it'll get from it should just work to it will just work. [laughter]

TARAS: And it does just work.

AUSTIN: The spec is 1.0 but the implementations aren't quite there yet, but in Java, Go, Node, Python probably by the time this is out most of those will have their first release candidate for 1.0. And that'll be for tracing and context and stuff. The metrics stuff should be 1.0 by this fall. But one other part of it also is we're not trying to sit here and say like, oh, you have to change everything. Because a big part of this is playing nicely with stuff that already exists in the cloud-native ecosystem so Prometheus and OpenMetrics, for example. We have our own metrics API that exists for a lot of reasons and I think solves a lot of problems that people writing metrics have identified in the past. But we also want to make sure that, hey, you're using Prometheus? Cool, that will still work with OpenTelemetry, and you can integrate that and you can have your Prometheus stuff play nicely in this ecosystem. We're not writing our own logging API. There are already a million logging APIs, and people are generally pretty happy with whatever they use. So we don't want to come in and reinvent the wheel. What we want to say is that okay, cool, your logging API can integrate with this idea of OpenTelemetry like the context layer of it so that you can correlate your existing logs with the traces or the metrics that OpenTelemetry is generating.

We want to be a good -- I don't want to say we want to be invisible, but we don't want people to have to change the way they do everything. We want to work with what you're already doing. A huge part of this project was making sure that it's composable. You can reimplement parts of it if they don't work for you. Or if you already have a legacy tracing system, because a lot of people do, they might have a more basic one. They might have just simple correlation IDs that are sent through logs. But we want to make sure that hey, you've got some system that you wrote five years ago and it works well for you and you're thinking oh, I want to do something more advanced? Well, cool, we designed this in a way that you should be able to without a ton of extra work stitch that old stuff and this new stuff together and have everything work in harmony.

TARAS: I can see this being really powerful especially in the context of now with cloud-native platforms and companies are creating their own internal cloud-native platforms. And most quite often building a cloud-native platform from scratch no matter how many people you have, you are probably going to be understaffed. So having something like OpenTelemetry as an API, like a document, that API that you could use right off-the-shelf so you don't have to document your own solution, which I think is often the biggest challenge with integrating with cloud-native platforms is that the infrastructure that a developer is supposed to use it's usually not documented enough. And so having something like OpenTelemetry where you could just take it right off-the-shelf and then have the platform essentially say just use OpenTelemetry in whatever language you're using to write your service, I think that's a really big value. So, developers don't need to learn or reverse engineer their particular platforms set up for logging and maybe how that thing works with Jaeger and all that stuff. I think it would be really helpful there.

AUSTIN: I think one other thing as well by offering an open-source standard for this, something that is very strictly being done under the auspices of the CNCF -- and so there's a governing structure that does prevent this quite deliberately from ever being taken over by one vendor, which was something that I think a lot of people were concerned about. And I'm concerned about it. I work for a pretty small company compared to a lot of people in this space. Nobody wants this to just become the fang show. Nobody really wants there to be a situation where the big five or the big three cloud vendors of the world are calling the shots and everyone else just has to follow along without at least having the ability to influence things. So the governing structure of this project is set up to ensure that there is representation from multiple companies on the governance and technical committees. You can't just have one thing take it and run. But I think what that helps with is that since that does make it truly agnostic, it allows these big cloud players to say like, okay, well, here's something we haven't done before. Like if you think about Amazon, let's assume you're running everything on Amazon, there are tools in Amazon to give you -- there are things like X-Ray and then CloudTrail and CloudWatch and stuff where it's like, okay, you can get visibility into metrics and traces that maybe go into managed services. You can see a trace about S3 or about Aurora or whatever. OpenTelemetry by existing outside of this a little bit lets them say like, “Okay, we'll also make this data available in OpenTelemetry format,” and that way, you can take it. Stuff that previously you could only use within the confines of a proprietary vendor in their tools, now you can take that and export it and send it somewhere else. So you can send it to your own open-source, maybe you built your own observability stack, cool. You can take that data that's coming to you in OTLP, and you can put it into your stuff. You can take it and put it into another proprietary vendor. That gives you I think a lot of freedom and flexibility as an end-user to build what works for you.

I don't think everyone needs the same thing is maybe the best way to put it. Some people don't have either the operational complexity or the organizational maturity or whatever for a big hotshot observability solution. Not everyone needs all the features of every single platform out there either. Some people are going to be fine using completely off-the-shelf open-source thing, or they're going to want to do their own weird custom stuff. Maybe they're providing Telemetry back to their end-users in some way so they need something that lets them split-brain this and send some stuff back out to the internet and to who knows. So the advantage of OpenTelemetry now is because it acts as this neutral agnostic standard in the middle, suddenly vendors that before wouldn't provide Telemetry data to you or couldn't provide Telemetry data to you now can say in the future, “Oh yeah, we offer OTLP. We offer this Telemetry, these traces, these metrics, these logs, or whatever, and you just give us an endpoint to write to, and we'll send you this stuff in OpenTelemetry format. And then it'll integrate with whatever else you got as long as it's all talking OpenTelemetry as well.”

CHARLES: So who else is involved in the Telemetry project/consortium? You mentioned it was part of the cloud-native...was it Cloud Native Foundation?

AUSTIN: Yeah. So it's like a Cloud Native Computing Foundation sandbox project right now. We're hoping to move into incubating this year. We're actually one of the most popular CNCF projects by contributor so second or third only to Kubernetes. It depends on when you look at the stats. But we have contributors and governing members from everyone from Google to Amazon, Splunk, Datadog, New Relic, Honeycomb, LightStep, Dynatrace, AppDynamics is in there a little bit. Any sort of observability company you can think of anyone that's selling dashboards and traces and metrics and all that stuff they're involved, Grafana, Elastic is involved. So there's an extremely broad base of support. It's been really exciting to see all of these different vendors that have their own agendas come together and work on this. I think, for the most part, it's all with very pure motives.

TARAS: Yeah. The cloud-native community to me is very inspiring because it's really great to see how this vendor-neutral approach they've taken has worked out. It seems to be working. I'm sure it's not perfect. I'm sure it has some rough edges, but it's like what you're saying it's different companies that are operating in the same space that are coming together to have something that does bring value and does provide flexibility. That's really great to see especially for something like cloud-native space where we are talking about building huge platforms that require huge investments over a long time. So being able to have flexibility and be able to have evolution is really important.

AUSTIN: Well, a lot of it also especially, in this case, let's go back to earlier talking about Sentry and the data quality issue. The question of instrumenting all these different existing pieces of code, was actually really silly when you think about it because maybe you were using New Relic. Oh, I'm sorry. I forgot to shout-out New Relic. I don't remember if I shouted-out New Relic, but they're also very involved in OpenTelemetry. Before, you would say like okay, maybe I'm using New Relic, and I have this proprietary agent that's going to do this instrumentation and send data to their APM endpoints and then Datadog says, “Well, we want to do an APM tool.” So Datadog has to go in and they have to reimplement all this stuff and signal effects before they [inaudible 30:12] by Splunk. It's like oh, we have an APM tool, so we have to reimplement all this stuff and so on and so forth. And it's silly because there's only really a layer of software there, that layer of instrumentation for Express or whatever is completely undifferentiated. It's just like I have a hook on something starting, and there’s something finishing.

CHARLES: Right. The value is in what you do with the data not in actually saying here's the data.

AUSTIN: Right. These were incredibly undifferentiated, incredibly commodified instrumentation libraries. But all these different vendors had to provide and maintain roughly the same code base duplicated across 5 or 6 or 10 or however many different companies. And so I think a lot of the inspiration behind OpenTelemetry is simply that one, wouldn't it be nice if we didn't have to do that? There are a bunch of individual vendors -- I'm not going to put anything on anyone's motive here, but there's certainly a story in open source about socializing maintenance costs. But I think the bigger thing is probably it's not so much about the today; it's about the tomorrow. And the tomorrow looks a lot more like well, if there is an actual open standard that everyone agrees on, there's an API that everyone accepts and there's an exposition format for this data that is neutral and that we can all work with, then why not just support that in the hopes that upstream maintainers will integrate it natively into their libraries? So instead of having to install a plugin to Express to get it to do tracing, why doesn't Express just have OpenTelemetry support?

And so I think that's what you're going to see over the next couple of years is this API hits 1.0 because we have stability guarantees on this for three years. So once we get an actual 1.0, I think there's going to be a bit more of like oh well, we'll just take this code that was previously this third-party plugin that you installed. And instead of having this external middleware package, what if there was just a built-in tracing middleware for Express that when you create your Express server or you just say, “Express that use tracing.” And then it'll just look for OpenTelemetry, an OpenTelemetry SDK, and ta-da! Now, this is not an OpenTelemetry problem; this is an upstream framework maintainer problem. And that also helps everyone too because that means that the people that actually write the software, the people that actually write the frameworks and libraries can start to have better control over the instrumentation and can say, “This is what we think is really important. We can provide a recommended dashboard or a recommended troubleshooting guide that says, hey, this is what you should watch for. These are the attributes that are important for you to be aware of. This is maybe a good metrics query that you can run or an alert or whatever that we can tell you about because now there's this common language for expressing those metrics, there’s a common language for expressing those traces for expressing the semantic attributes of a request.” And I think that could be I think a pretty big revolution in the way that we, as an industry, think about performance and talk about performance and how to optimize performance and understand incidents and things like that just because now we have this lingua franca of OpenTelemetry to discuss these sort of problems in.

TARAS: The topic of Telemetry, so is there overhead to -- I don't really know how to ask this question, but I’m curious what is the standard that you measure for performance of any of these tools? Is there an acceptable impact that you would see? In theory, I would imagine there'd be like no impact at all, but is there overhead that's introduced when you start to do tracing that is measurable?

AUSTIN: I mean, it depends on the perspective. So what I would say is this: we have standards and specifications. We've specified benchmarking suites and things like that. That part of this journey to GA includes these full benchmarking suites and testing and being able to give a quantified answer to this question. What I would say is from an end-user perspective, generally, no, there's no real end-user impact or at least not a ton. So if you think about how a span is created, it's an allocation, and you have a span for every single request, if you're doing 100% sampling, you're creating spans for every single request that moves through your server. So there is some overhead because every single request is now going to have the additional overhead of oh, I had to create a span and that span needs all these attributes and so on and so on and so forth. But the actual act of exporting that data and doing that stuff is done off the main threads, so to speak. So that part is pretty snappy, and we've done, I think, the best we can, and we can always do better. But certainly, when we developed these SDKs, they were done with an eye towards performance. So where you can, avoid unnecessary allocations, where you can, avoid expensive operations.

The best case where you're most likely going to see it is on the export side of things because that requires you have to write this data somewhere. And so you're either writing over a network link or you’re writing into a file. If you're writing into a file, then that's pretty snappy. But over a network link, then you obviously have the additional overhead of creating requests out to this other service through gRPC or through HTTP or whatever, and then compression and TLS negotiation and all that. That said, in my experience, most of the overhead people incur with distributed tracing aren't necessarily user-facing impact. It's not like oh, I've added 10 milliseconds round trip. It's mostly memory pressure or CPU overhead on my process goes up because I'm just doing a little extra work for every single request and a little extra work for every single request will add up. So that just means you need to horizontally scale more.

In my opinion, the benefit of tracing far outweighs the trade-off in headroom just because having that ability to zoom in on one individual request and see what happened at every single step is invaluable when you're dealing with distributed system problems and having 100% -- this is one thing that we do at LightStep. We encourage you to use 100% on sampled data and send that to us and then we use fancy dynamic sampling stuff to make it manageable, but the ability to just take this entire firehose of data and look and drill all the way down to one individual request and one individual service out of potentially millions is it's really unimaginable. And this is a tangent, and this is not a shill, I promise, but Spotify uses us. And so I've talked to some of their engineers, and you can imagine the request load of Spotify at every second of every day the amount of traffic moving through that system, and the ability that they have to look into those billions and billions of requests and narrow it down and just say oh, I want to find this one error or this error that's localized to this zip code is just incredibly powerful, and that's something that's really hard to do I think with traditional -- certainly, it's hard to do with just metrics. It's annoying to do with metrics and logs, but when you have metrics and logs and traces and you’ve integrated all of these things, it becomes easier to do. And it also lets you fix problems a lot faster and also understand how problems are impacting actual users rather than aggregate groups of users.

CHARLES: Yeah, that actually sounds amazing.

TARAS: Do you encourage to capture all the traces and then provide tooling to find a needle in a haystack?

AUSTIN: To be a little more clear here, OpenTelemetry lets you pick how you sample, and we offer tools to help you do things like tail-based sampling where you send all of your traces to a pool of OpenTelemetry collectors and then you ensure that all the spans for one trace wind up on the same collector, and then you can make a tail-based decision about hey, does this have an error, or does this meet some criteria? Now, LightStep also supports OpenTelemetry. We actually natively support ingesting the OpenTelemetry protocol and our unique architecture allows us to take 100% of that data with no upfront sampling and then do dynamic sampling on it where we can create trace aggregates, and we can do a lot of really cool stuff to not only give you that high-level picture but also let you see we can pick the important stuff out to keep for longer for that long-term analysis. I didn't necessarily come here wanting to talk about that too much, but if you're interested, lightstep.com, we have a bunch of very fun marketing things that can tell you about how great we are.

TARAS: Well, it's interesting. Part of the reason why I wanted to talk to you about this is because one of the challenges that Frontside has been dealing with is we’re building this brand new testing platform. It's essentially a really complex distributed system that you run locally. I mean, from the developer perspective or from the user perspective, a lot of the complexity of the implementation is to make it really easy for developers and make it really fast. So we have all these perks, but one of the trade-offs you make when you have a distributed system is that you have different processes that are communicating over HTTP and in some cases over web sockets. And then when something breaks down, you're like, I have no idea what the hell is going on. [laughs] And that's the situation that we're dealing with. And one of the things I was thinking about is actually integrating OpenTelemetry into the library that we use to create the testing tool so we can have a Jaeger trace that shows us a structured log of the particular operation. So you run a test, and you can actually see the entire structured trace of the test. And then the coolest thing about this is that when you have this as a starting point, then you can actually extend this with the test results that reach all the way into the system. So the root is the initiation of the execution of the test, but your trace can include the data that is coming from essentially one long trace going all the way into the infrastructure. So you can actually get at a place where you see the output. You can actually see information about what's deeply inside of the microservice that is providing the data for the test result. That seems like a really appealing thing to be able to do.

AUSTIN: Totally. I think that’s a really good example of when you would want to do no upfront sampling because -- and this is a challenge with the way -- I think people have this idea they’re like oh, sampling is...there's maybe two camps here. There's you can't do this without sampling, and then there's you sample everything. You can't lose anything. And the truth is in the middle because, in any production system, 90%, 99% of the data is honestly not that useful. The reason why is because a system at rest tends to stay at rest, or you tend to get very similar stuff. And a system where everything is broken, everything is going to be broken in a similar way. With OpenTelemetry, we give you the ability, I think, to separate the sampling decision and move them to a different layer. So instead of saying, “I have to make these sample decisions in my process,” it’s more okay, you can exfiltrate your data to a collector or to a pool of collectors and then you can do the sampling there, which I think will unlock in the future more advanced sound sampling algorithms and the ability to make better decisions about your data and what gets sampled and what gets sampled out.

For something like you were describing, actually, you'd probably want to err on the side of yeah, keep everything because you're not running tests every single minute of every single day. And you do want to have that full visibility into everything that's going on because tests can tend to be, from run to run, there can be a high level of variance in what actually happens because the code is changing all the time, right? You're not generally running the exact same tests every time. I would definitely say OpenTelemetry would be a perfect thing to integrate there to get that visibility into what's going on.

TARAS: Well, we'll give it a shot, and we'll let you know how it goes. [laughs]

AUSTIN: Hey, poke me on Twitter if you have any questions.

TARAS: Yeah, awesome. And where can people find you on Twitter?

AUSTIN: Yeah, I'm @austinlparker. I tweet regularly. It's like the only social media I pay attention to.

CHARLES: All right. Well, thank you so much for coming by, Austin L. Parker. We already mentioned you work for LightStep, lightstep.com. Where can people go to find out more about the OpenTelemetry project? Is there a URL there?

AUSTIN: Check out our website opentelemetry.io and that has links to our GitHub, to our chat on the CNCF Slack. You can find meetings for all the SIGs, and you can start there, best place.

CHARLES: Fantastic. I love the longer that I'm a developer, I love just how in many ways it becomes a more and more comfortable experience because all of these things that were hard to do or that you didn't even know about you should be doing all of a sudden you just wait a few years and they just happen for free, and you're like, oh, wow.

AUSTIN: Yeah.

CHARLES: It's like you're Neo in The Matrix where you just keep learning all these different kungfus, and all you have to do is just flutter your eyes for a little bit. So it definitely sounds like this is one of those things that is in the works and that people are going to get to benefit from with very little effort going forward.

AUSTIN: Yeah, definitely. Like I said, check it out. By the time this goes up, you should see a lot more 1.0s starting to pop up out there. So we definitely want user feedback.

CHARLES: Well, fantastic. Thank you again and thank you, everybody, for listening, and we will see everybody next time.

Thank you for listening. If you or someone you know has something to say about building user interfaces that simply must be heard, please get in touch with us. We can be found on Twitter at @thefrontside or over just plain, old email at contact@frontside.io. Thanks and see you next time.


Please join us in these conversations! If you or someone you know would be a perfect guest, please get in touch with us at contact@frontside.io. Our goal is to get people thinking on the platform level which includes tooling, internalization, state management, routing, upgrade, and the data layer.

This show was produced by Mandy Moore, aka @therubyrep of DevReps, LLC.

Subscribe to
our podcast:

Listen on Apple Podcasts