How to Build a Data-Driven Company: From Infrastructure to Insights

Companies like Buffer, SeatGeek, and Asana aren’t just talking about the value of data, they’re building data infrastructure that can actually deliver it. Join this 45-minute webinar to learn why these companies are investing in data and what you need to know to keep up.


Shaun: So, we actually have a lot on the agenda today. It only looks like there's three points, but there's a lot packed into those three points. The core of our presentation is going to focus on how companies like yours are solving their data infrastructure challenges. We're going to cover the challenges engineers should expect around data integration, why Amazon Redshift is quickly becoming the data warehouse of choice, and culture barriers around building a data engineering company, plus a lot more.

So, my name is Shaun McAvinney, I am a sales engineer here at RJMetrics. What that means is that I spend most of my time talking to IT people, CTOs, and data architects about their data infrastructure. My co-presenter today is Dillon, from Looker. Dillon, can you introduce yourself and tell us a little bit about Looker?

Dillon: Hi, there. Yeah, thanks, Shaun. I'm a manager of product marketing and analytics here at Looker. But for about the past year, I was also a sales engineer. So, very similar experience to you, there, Shaun. Just to give a quick overview of Looker, we'll certainly elaborate on this as we go throughout the presentation. Looker is a modern BI tool that represents the next step in the natural progression of BI tools.

Looker has a fundamentally different architecture than most tools, which ultimately provide a centralized location for all users, via technical or non-technical, to access consistent analytics and pull any data they need from their company's database. And all this can be done on the fly, which empowers organizations to make truly data driven decisions. And again, I'll elaborate on this a little bit more as we go throughout the presentation. But yeah, would you like to share a little bit about Pipeline as well, Shaun? Hello?

Shaun: Oh, sorry about that. I think I was actually on mute, so I'll say the exact same thing I just said again. So, quick plug for RJMetrics Pipeline, I was wondering, I was like, "This sounds weird even to me." We just released Pipeline in early October, it's a seamless data infrastructure that consolidates all of your data and streams it to Redshift. So from there, you can connect Redshift to Looker, and you're ready to go. If you want to learn more about either of these products and how they work together, stick around for the demo at the end. And with that, I'm going to hand it over to Dillon, who won't be on mute to kick us off.

Dillon: Thanks, Shaun. All right, cool. So the first thing I'm going to talk about is data infrastructure, or the actual architecture of legacy and modern data pipelines. So this highlights the traditional approach, for about the last 30 years or so, really since the inception of modern databases, data warehousing has been the standard model to aggregate data and provide business directed analytics. Data is extracted from various sources, via databases, third-party applications, flat files, etc., and then transformed into a pre-defined model, then loaded into a data warehouse. This ETL process results in data cubes, and data silos, where analytics are separated by key groupings, for various departments, such as marketing, product sales, etc.

They would probably all have their own data cues or data silos that they're working off of. This results in a few issues that are fundamentally prohibitive to creating a data driven organization. First, it's very resource intensive, and expensive to manage all of the transformations and data loading. Second, it results in latency in the analytics process. So end users only have access to pre-defined metrics, which are typically either too broad, or too inflexible to guide really nimble decision making. This means that end users aren't really getting any actionable insights from their metrics. They're just looking at more high-level analysis.

This also restricts the ability to drill down, and figure out what is driving certain trends, or patterns or anomalies that people are noticing in their data. These are commonly recognized patterns, or excuse me, commonly recognized problems, so nowadays, modern tech companies have reworked this process.

So, nowadays, companies are collecting more data than ever before, and database technology has witnessed significant advances in the last several years. Databases themselves are now capable of performing sophisticated analysis very quickly on this huge increased data load that companies usually have. This removes the need for data silos and data cubes. So, all analytics can be performed directly in the central database. What that means is that now, it makes sense to shift the burden of the complex transformations to the front of the pipeline, to the BI tool, where transformations can be formed on the fly at query time. So, what are the benefits of this approach? Well there are several benefits, some of which I just mentioned, but first, you no longer require this huge resource intensive engineering or ETL team to move all of your data. So, it's much cheaper on the resource side.

Secondly, technical users can pull data in a language they're used to, which is just SQL. And if you have a modeling layer like Looker provides, then you can actually query the data directly from the UI without having any technical knowledge of SQL. Transformations are being done by engineers on the back end. They're being performed as the user pulls data. So they're much easier to repeat, and much easier to understand. This also allows you to audit transformations much easier, so your users will actually understand the components behind your analysis. They'll understand how metric is actually defined, uniformly across the organization. And, Shaun, I think you have a few examples of this in practice.

Shaun: Yup, exactly. Thanks, Dillon. So, just like Dillon was saying, in the process of data engineering going from that data mart centric, back end heavy, multi-year project, it's actually gained some geek cred. Over the past year, we've watched one company after the other share their "How we built our data infrastructure," and even Looker put out a blog post about this. So, at some point, the data infrastructure gained some geek cred. And we were really interested in the details behind all these projects, so we did a meta-analysis, where we looked at how these companies solved their core data engineering analysis.

We looked at Zulilly, we looked at Spotify, we looked at SeatGeek, Buffer, ASANA, and many more. And some of these companies, like Netflix and Spotify, are actually building data products, which are recommendation engines. That stat can look slightly different, so for this event, we're really going to focus on companies who are building a data structure for analytics. And for these companies, what we saw is that the process looks very much like what Dillon was just describing in the new school, or the new world, of data infrastructure.

First, they extract the data from a variety of sources, and then they load it into the data warehouse, and then they do transformations on top of that. So, more of an ELT model, more than an ETL model. Let's start at the first part of the conversation, extract a load, or more simply put, data integration. Just to clarify, the reason why this data integration piece is so important is because all future insights depend on it. Here are some of the use cases that the ASANA team laid out. These are marketing campaign delivery and quality users. Are our customer success programs successfully driving revenue expansion? Did our most important customer just hit a bad bug, and do we need to reach out? All of these questions require you to join disparate data sets that currently live in data silos. They're not talking to each other in an analytics perspective. And ASANA team, what they said was, regarding joining this data together, "It's difficult work, but an absolute requirement of great intelligence."

So, here are the most common data sources we saw companies connecting to. Our analysis of how companies built their data infrastructure was based largely on blog posts and some conversations on the topic. One limitation here is that engineers typically tend to write these pieces fairly soon after completion of the project, and there's often the understanding that more data sources will be added later on. The ASANA team built connections to the most sources, but there's an enormous amount of data that can be derived just from connecting ad spend to purchase history, which would be living in production databases. So, for some audience participation, if you guys could grab your mouse and get ready to fill out a poll, we're wondering what top five data sources are a top priority for you to integrate and keep integrated. So, you should all be seeing this poll now, and I'll be showing the results in just a moment when we get some responses.

While you're filling out your answers, let me just say that data consolidation comes with its own special challenges. When ASANA first began building their data infrastructure, they did it using Python Scripts and MySQL, pretty straight forward. And if you're just starting out, that can pretty much work for you too, but you're probably going to outgrow it eventually. I'm going to say much more on that in a second, but first, let's take a look at the results. Looks like, wow, we got a lot of responses here. And they'll still keep filling out if you guys are still filling them in. So it looks like, yup, so the data that's in production databases is one of the top sources, email marketing, with a close third of CRM. Dillon, for people that are using Looker and using these sources as a data source, does this kind of fall in line with the type of data that you'd expect seeing coming into a data warehouse?

Dillon: Yeah, yeah, absolutely. It's pretty funny, it's actually lining up in almost the exact order I would have thought. This is definitely very representative of what we see of our customers.

Shaun: Okay, yeah, great. And we see the similar things for Pipeline as well, databases, and then you get CRM's hooked up, some email marketing data coming in. I'm surprised, the one thing that's surprised me is that events isn't one of the top ones, like on-site events, I'd be surprised. And also, I don't want to skew the results at all by saying that. Okay, thanks everyone for filling out the poll. Okay, so moving on, in the ASANA team's own words, here are some of the challenges they faced during consolidation.

They had doubts about data integrity due to a lack of monitoring and logging, there were questions around was this an actual insight or was it a bug that just makes us think it's an insight, and then of course, there were always urgent fires, when systems went down. And then, this is from Medi Markets. Braintree's team said, "Deletes are nearly impossible to keep track of, regarding database deletes and pumping that over to a data warehouse, you have to keep track of all the data that changed, batch updates are slow, and it's pretty difficult to know how long they're going to take."

So, a big part of my job, again, involves talking to people every day about their data infrastructure. These posts that we just discussed touched on some of the problems you can expect, but keep in mind, those people are the successful ones. I've been on calls with many a frustrated engineer throwing in the towel on their data infrastructure project after one year or more at the task. Data consolidation is hard, and here are some of the seven core challenges.

Connections, every API is a unique and special snowflake. You're pulling data from a bunch of SaaS products that you're already using, but they can change. And they can be different for every SaaS product you're trying to pull data from. I actually built our first Salesforce connector for our sales team here to get data into our geometrics, and let me tell you, it's not easy to get to deal with all of the custom objects, and making sure that, while not maxing out your Salesforce API.

Accuracy, ordering data on a distributed system. What happens when data point A goes to server one, data point B goes to server two, and server two is faster than server one, and now the data points are out of order? You have to account for all of that on distributed systems. Latency, large object data stores like S3 and Redshift are optimized for batches and not streams, so all of your infrastructure is streaming and then you have to figure out how to do batches into those other systems, it just adds another layer of complexity.

Scale, I think that's one everybody kind of understands. Data will grow exponentially as your company grows. So if you build a data infrastructure today, it might not really work that well tomorrow. And then, flexibility, you're interacting with systems you can't control. That goes in line with the connections piece. You're pulling data that they haven't architected to run their product very well, and it might change, things just happen. And then another one is monitoring, and this is something that is sometimes a last place thing to build. But notifications for expired credentials and errors, notifications of disruptions, are actually really important.

So that when your CEO logs into their dashboard, and they see that revenue has flat-lined for today, it could just be someone changed their password in Salesforce and needs to log in with Salesforce again, something as simple as that. Without knowing that, knowing that that is the problem, it can cause a lot of frustration to narrow down the pain point. And then finally, maintenance, and that kind of wraps up all this. You have to justify the investment in ongoing maintenance and improvement. So, data infrastructure is not really a one-off project, it's an ongoing project for improvements, adding new connections, and committing to the investment on the service that will be dedicated to those resources.

Early last month, we released the SaaS project designed to solve all those problems, called Pipeline. It takes data from any number [inaudible 00:15:22] data flows into data warehouse with super low latency. We're aggressively releasing new integrations each month. So if you need an integration and you don't see it here today, please let us know. Again, if you want to learn more about this, stick around at the end for a demo. So the next step in that process is data warehousing. Hands-down, the top pick for warehousing is Redshift. Among the companies that we looked at, Redshift was the most popular choice for an analytics warehouse. The most common reason, speed. People are seeing dramatic improvements in query time using Redshift.

Some said that queries that were taking hours, now only take a few seconds. Similarly, SeatGeek had a critical query that was taking 20 minutes and now takes half a minute in Redshift. So, here are the results of Airbnb tests, that show performance in both query time and cost. They were loading billions of rows of data. In Hive, it took about 28 minutes, in Redshift, 6 minutes. They had a query that was doing two joins with millions of rows of data, it was taking 182 seconds in Hive, and in Redshift, it takes only 8 seconds. And then, something to not discount here, is cost. Hive was a little bit more expensive, so $1.29 per hour per node, while Redshift was clocking in around 85 cents per hour per node.

And here's some great research from Periscope, showing Redshift versus Postgres. And it shows similar performance gains. So, query time is much lower in Redshift versus RDS Postgres. And more research, this is from Diamond Stream, and it's showing how much better their internal dashboards perform when built on Redshift versus Microsoft SQL server. It's less than half of the time that Microsoft SQL server would take to load their dashboard than Redshift would. And I think this is really the final reason why Looker is such a big fan of Redshift and recommends it to its clients. Would you say that's true, Dillon?

Dillon: Yeah, yeah, absolutely, Shaun. Thank you. So, I'd actually like to elaborate on that a bit. So earlier I talked a bit about the structural differences between legacy architecture and modern architecture, the MySQLs versus the Redshifts. So now I will elaborate a bit on how that architecture can impact business intelligence and analytics workflows. So, this slide shows workflows with the legacy architecture I was describing earlier. Quick reminder, was legacy architecture, each department is working in silos, they're all serviced by the central IT team, or the analysts team. And again, this is prohibitive to creating a truly data-driven culture for a few reasons.

First, as Shaun was mentioning, manual centralization is difficult, so it's extremely resource intensive for the central data team to not only centralize all the data, but then service all the data needs of their business users. So this creates a bottleneck in the analytics process. You'll see that the arrows on this slide are flowing away from the central data team, and that's for a specific reason. The data team will provide these predetermined metrics for various departments, then re-run and re-distribute those metrics periodically. Again, those metrics are typically overly broad or not very actionable. And if a user has further questions about the analysis, and that is often the case, they need to submit a request for the data team.

Who will take maybe a few days to turn it around. This latency just fundamentally restricts the end users from making quick, informed business decisions based on their data. Plus, in most companies, there's typically a hierarchy to who receives this data, so the executive team can get all the data they want, while requests from sales reps, marketing managers, etc., are usually pushed to the back of the line. These people in the back of the line rarely have the ability to make strategic decisions based on the analysis that they're requesting.

Lastly, this model can often result in disparate reporting. So, if five different departments request the same metrics from five different DBAs, it's highly likely that those analysts will have differing ideas about the appropriate way to calculate a metric. So maybe when you're talking about something like gross margin, revenue minus cost, that's a little bit more straightforward, but when you get to more sophisticated things, things like affinity analysis for example, if I buy item X, what's the likelihood I'll buy item Y. There's a few different statistically defensible ways you could go about calculating that metric. And, in practice, it's very common for large organizations to have non-unified definitions around these metrics.

Which leads to, I could tell you anecdotally, leads to a lot of headaches, chaos, and an inability to really make decisions based on the data. So, one of the factors that contributes to these workflow issues, which is the last point I touched on, the difficulty of consistently defining metrics across the company. Part of this is because of the nature of SQL, which as everyone knows is the defacto language for querying databases. SQL can be very easy to write, but often difficult to read or audit. And again, if you have 5 different analysts with the same metric, you'll very likely get 10 different queries that the analysts would write. Some of which might yield the same result, some of which might not. In practice, this often results in data analysts recycling and slightly modifying old queries, without ever really understanding the inner workings of the query. So, they might just be adding another date, or another way or cause, or something like that.

What this does, though, is it then jeopardizes the actual integrity of the data, if people don't really understand the inner workings of the query, and that makes it difficult to consistently interpret the results, or actually derive business value out of those insights. So, how do we solve for this? How do we solve for the issues of knowledge of one-off queries, but also the silo data reporting? We create a data model as an intermediary. All definitions of metrics and data transformations are defined in one place, where all users can then access and understand them. So now, you don't need those 5 or 10 different analysts, you need maybe 1 or 2 who monitor the modeling layer. And you can then be confident all users are working off the same definitions and interpretations of the results. In this modeling layer, you can also take care of some of that heavy centralization load that you were talking about earlier, Shaun. So, you can take and link together data from different sources, things like Salesforce, Marketo, EndDesk, Stripe or Curly, to get a really comprehensive view of your customer.

And this allows us to maintain what is commonly referred to as data governing, which is a term you've probably heard. So, how did this modeling layer impact work flows? This slide depicts BI and analytics work flows with modern architecture. It creates a truly data-driven environment. All users have equal access to the data through the UI, so they don't need to know SQL. So now marketing, sales, finance, customer success, these teams that previously could not directly access the data, have the ability to explore their database in full detail. And since everyone is looking at the same numbers and the same reports, business users can collaborate and facilitate actually meaningful conversations based on these shared insights.

This allows business users to make really informed, strategic decisions on the fly, which results in tangible and significant competitive advantages. And, I think I would like to do a quick customer example that really highlights the competitive advantages that you can get here. So, Looker has one customer, Infectious Media, who offers digital advertising for a whole host of different companies. With Looker, their sales optimization team has the ability to see, in real time, how various advertising campaigns are performing across every website in every publisher. If a certain type of website is driving the most clicks or conversions, the optimization team can immediately determine why, and then redirect future campaign efforts toward those specific websites or those specific publishers, or perhaps new and similar ones.

In this digital advertising space, where advertisements sometimes only last maybe a week, a month, or so, the ability to consistently iterate on and refine the strategy of that campaign will result in tangible differences in top-line revenue or top-line sales. This represents one of the most significant competitive advantages a company can have in this space. And they really wouldn't be able to do this unless they have this centralized modeling layer and equal access to the data for all of their employees.

So, the next question is, what can you do to set up an infrastructure like this? Now that we understand the benefits, I'll explain how setting this kind of infrastructure up is actually easier than ever nowadays. And, I'll illustrate this using the example with RJ, Pipeline, and Looker. So, say you're a company that collects data from a number of various sources, here we have highlighted some third party applications. Rather than needing to perform complex transformations like you would with legacy architecture, you can dump all of your data directly into a centralized location, using a middleware tool like RJ Pipeline. This completely centralizes all of your data and prepares it for analytics, with just a few clicks.

You don't need that engineering team or that heavy ETL load. Once the data is centralized, you can quickly add a tool with modeling layers to help distribute the data to all of your end users. And again, the modeling layer is really key here. Working with a tool like Looker, for example, we actually have an offering called Looker Blocks, which is essentially pre-templated code for your modeling layer, for all sorts of third party applications and common types of analysis across a number of various verticals. These blocks can be copied directly into your data model. So, now even the actual development of the data model to distribute data to all of your end users, the majority of that is actually taken care of for you already. And so the result of this is going from, a company going from having silo data in these several disparate applications, or databases, or whatever the case might be, with unequal access for all their users to having data centralized in a modern database, like Redshift, with a full analytics suite on top of it, that can be accessed by any user.

So, what would have taken quite literally, probably months of intensive engineering efforts, as Shaun, you were making mentions to, sometimes you've seen engineers do this for years. This entire infrastructure can be set up in maybe one, two, or three weeks, which is pretty astounding, that time to value from your data is something we've never really seen before in the data space. And it's only made possible by these new tools like RJ and like Looker.

Quickly, I just want to show a few sample screen shots of some of the analytical outputs that these types of blocks provide. Again, you're just plugging these blocks directly in to the centralized data that RJ has centralized for us. Here's a sales and marketing batch forward which links to Marketo and Salesforce data. Here's another one with Salesforce, we have some event analytics from various event collectors, here's another quick sample screen shot. And again, as you mentioned Shaun, I'd be happy to elaborate on any of these specific points, or give a more full-length demo separately to anyone who's interested. But unless you have anything else to add, I think we can probably jump into the Q & A.

Shaun: Excellent, yeah, so for everybody playing along at home, thanks so much. That wraps up our presentation for today. We're going to kick off the Q&A. And just as a reminder, if you want to keep learning more about Looker and Pipeline, hang out for another five minutes after the Q&A, we're going to be demoing both products after the Q&A, and announcing the cupcake winner, which is probably the most important part of this whole thing.

So, question from Matt, "Is Looker or RJ looking to support the Azure SQL data warehouse that's currently in preview?" So, I can speak from the RJMetrics point of view, and then, Dillon, if you want to take it from there on the Looker side. So, our current go to market endpoint is Redshift for data warehousing. We're always going to be looking for more endpoint integrations, and it's really going to be market driven. So, when we're doing market research on developing this project, it really seemed like Redshift is becoming the defacto spot for all of your data. That's what we decided to support right out of the gate, but of course, we're [going to be supporting] other ones, but currently, as far as I know, no plans for any [inaudible 00:30:03] now.

Dillon: Cool. Thanks, Shaun. So yeah, from the Looker side, Looker actually works with any SQL compatible database, so we do work with Azure, we're actually formal partners with Microsoft, we even work with more modern SQL and Hadoop type interfaces, things like Spark, Hive, and then of course all of your classic SQL warehouses as well.

Shaun: Great. Another question, from Effie, is just asking us to clarify a little bit on the production database comment that we had in the poll. When we mentioned production databases, this is the type of data that lives in your product database. So, the database that is, that you're using for all of your users, your events that people are taking in your product, things like usually orders are stored in there, things like that. And there's a lot of rich data that's just being stored in your product database, that you're probably already running queries on now. And, by the way, it doesn't have to be your production product database that you connect to Pipeline or Looker, a lot of clients are spinning up slave databases, which are essentially clones of that production database, but won't affect your website if some intense query is being run on them.

Great, so another question is, "So what are the most interesting analysis you have seen from your customers?" This is for you, Dillon. What data sources do they usually compare?

Dillon: Oh, good one. So I think the really interesting one, and one that I spoke to a bit with our example is when we've seen customers link several different SaaS applications together, to really get a comprehensive, or as you probably hear a lot of times, 360 degree view of the customer. You have some customers that are taking their Salesforce, excuse me, their Marketo, their Salesforce, their Zen Desk, and then their Stripe or their [Couriers], whatever payment system they're using, they'll link all of these different applications together in our modeling layer, or they'll link all the data from these applications together, and then it allows that user to really get a comprehensive view of all of their customers. So you can see from the second you first interacted with them, with Marketo, what campaign hit them, to who made the sale in Salesforce to monitoring their health in ZenDesk, to actually determining ROI from these payment systems. At any point along that customer's lifecycle, with your company, you can determine things like again, ROI, or where certain methods are working well, all sorts of that cool stuff. And it's a really wide world once you get into there, but that's one of the cooler ones I've seen.

Shaun: Great. Question from Uki [SP] "Why is the modern ETL method that you were describing earlier, Dillon, more interactive than the model that Silos data?" And I just want to throw something in there real quick, and you might have a follow-up to this also. So, the method that Dillon was describing is like taking data from data silos, doing transformations on them, and then putting them back into other silos. Like, where they start is usually very deep, but they're not very wide data sets, so you have a lot of data joined together.

And then the, let's call them data marts, that the transformation puts them into, are usually very wide, but they're not as deep as they need to be for the various teams that are accessing them. And to get them any deeper, you need to go back to the engineer and ask them to do a different type of transformation on them, or add something new into the transformation layer. That's why it's helpful to have all the data in one place, instead of having data marts, it's more like extract and load, and then transform, instead of extract, transform, and then load into various data marts. Dillon, do you have any follow ups on that?

Dillon: No, I think you actually described that really well. I think it's just what's important to note here as well is that those data silos existed previously, just because of the nature of databases. Previously, 10, 15 years ago, databases couldn't service the analytical needs of the companies quickly enough. It just took too long to return those queries. Now that you have these modern databases like Redshift, etc., these are quick enough to handle the needs of all the business users. Now it's much easier to centralize and serve it from one location. I hope that answered the question.

Shaun: Yeah, I think it did, but we'll find out later. Another question is, "how much of a time saver has this shift in infrastructure been to data teams?" On the Pipeline side, I mentioned earlier that I've had conversations with people that, where their project has ranged from 6 months to 15 [inaudible 00:35:29], and either they're still working on it, or it's over budget, or both. And, they are, they're not happy with how it's going.

So, with Pipeline, you could literally get set up in a matter of hours, just connecting to databases, and various cloud services that you're already using, and those cloud services connect, you just click connect to Salesforce, and it'll connect over to Salesforce. Then, you just connect to a Redshift data warehouse, and we take care of pretty much everything else in piping the data over there. That's one thing that's totally done for you, already, that could take six months to a year and a half that is automatically automated for you with pipeline. Then, for the bid analysis part, Dillon, do you want to take that?

Dillon: Yeah, absolutely. I think there's one fundamental theme here, and the time savings are really just going to grow exponentially as your company grows. But a good example is we'll go into a lot of organizations who maybe would need...just have a series of analysts that exist just to write SQL and respond to different data requests that users have across an organization. Once you have Looker, all of those different end users are then able to access all of that data without knowing SQL and without needing to go through that analyst.

So depending on how large your organization is, we've had some companies, or some customers who have 500, thousands of different users who are all issuing queries against this database. And, I doubt it would have previously allowed all those thousands of people to make consistent data requests, of a centralized team, but you can imagine if they did, that would be days, weeks, and months of time to just respond to all the queries that are in that analyst's queue. So, it's really going to...the time savings are really going to accelerate as your data needs accelerated.

Shaun: Excellent. Another question for you, Dillon. For the modeling layer, they were wondering basically if it's a software tool they can buy off the shelf or if it has to be built. And I think they're wondering if there's other options out there, other than what you described.

Dillon: Good question. Sorry, I can't speak to general open source tools. I don't know that there are very many, I really doubt it. The modeling layer that I'm describing is something that is native within our product Looker. So there's going to be some other BI tools out there, I'm sure, that also have modeling layers, and they're going to do it, I'm sure, very differently than we would. But the modeling layer that we have, Look ML, is really just an abstraction of SQL, where you define all this business logic and all of the relationships between the components of your database. You just define those one time in this modeling layer, and then that tells Looker how to translate our UI into the appropriate queries to execute against your database. Anyone that really understands SQL, I've never seen anyone have any trouble picking up Look ML.

Shaun: Okay, great. Another question about Pipeline, it looks like there's a couple of these around this. For Pipeline, what if we need to integrate with more services that you don't have connectors for, currently? What are other services we're working on now.? I can tell you again, we're aggressively building out new data connections. We're going to be rolling them out on a monthly basis. I can't announce any now, but we'll stay tuned for more information on that. And we're also working on publicizing our import API, where you'll be able to just push raw JSON data to that, and it will end up into Redshift as well, in a structured format.

Two sides to the same coin, right? You can, if you're engineers and you want to push data to RJ Metrics, we're going live with that soon. And then also for business users that just want to click a button and get Facebook data over there, we're going to be rolling out more of those types of integrations as well. Okay, great. Let's see here. Looks like we're getting a lot more questions around just how the products work, so it's probably a great time for us to go into the demo. Before I do that, I do want to announce the cupcake winner. And that cupcake winner is Matt Rollins. So, congratulations, Matt, you are the envy of all of us, including myself and Dillon. I've seen those cupcakes before, and they look, they always look really good. Congratulations, Matt. Just reach out to to claim your prize and they'll make sure you get them, I promise.

Now, we're going to go into the demo portion, and just kind of quickly run through and show you guys how the products work from a high level, and then hopefully whet your appetite for a more in-depth demo with our team members after this. Let me see if I can share my screen here.

Great, so you guys should be able to see my screen now. This is RJMetrics Pipeline, pretty straight forward, right now you can see the current status, all systems are go. We've got green on integrations and green on our data warehouse. If there were, again, with the monitoring piece, if there were any issues with either one of these, you'd see either an orange or a red, depending on which one's having issues, so you can dig in deeper. And then, pretty importantly, you can see the rows that have been replicated this month from the various services we have connected.

Speaking of integration, these integrations we currently have listed and are out of beta, so MySQL was our go to market for database replicators. We're going to be, again, expanding on replicators in the next couple of months, and weeks. And then, surrounding that are a bunch of SaaS products and SaaS tools that you guys are probably using, and we're going to again be expanding in that so there's things like advertising channels, there's customer support centers, there are e-commerce shopping carts, payment providers, things like that. So, all this can flow directly into RJMetrics at the click of a button.

Once you have those data sets connected, let's take this Vandalay database for example. Again, some more information around stats, around replication, and making sure that all systems are go, we can go into the database and choose which tables we want to sync over to our data warehouse. In your databases, there's a ton of information, and some of that information might not be relevant for analysis. You still want to choose what data you want to sync over into your data warehouse. So let's say for instance, I've got this Innotech [SP] [leads], I want to pull that over. We have a couple different replication methods, here. Incremental replication is recommended, and this allows Pipeline to just check for changes, or additions to this table, versus doing a full table replication or full refresh of that table every time.

I'm going to choose that here, and now we'll start syncing that data over to Redshift. And again, it's pretty straight forward to connect Redshift, you just go into the warehouse settings and choose the required fields for Redshift to connect. We have one, an IP address you can whitelist in your VPC, and then once that's authorized, we'll start syncing the data over. And you can start seeing the data over there in minutes, like half an hour or so, you'll start seeing data arrive into Redshift. So, that's a pretty quick demo, and if you want to see more or have any questions you can reach out to me after this call to go into this a little bit more in depth.

And now, Dillon, do you want to show us what you can do with all that data once it is in a centralized data warehouse?

Dillon: Yeah, absolutely. This will be probably the quickest demo I've ever given, these are usually about a half hour, so I'm going to try to burn through this as quickly as I can. But again, more than happy to give a more thorough demo to anyone who wants afterwards. Can you see my screen all right?

Shaun: Yup.

Dillon: Awesome. Okay, so, we're going to start with a dashboard with Looker. We're going to start by looking at some fictitious Salesforce data here. What we've done is gone into our underlying modeling layer here, we have all this data, our Salesforce data, centralized into a Redshift instance. What we've done is gone into our modeling layer, defined all of the relationships between the various components of our database, and described all the business logic, and then created a dashboard with some high level KPIs. With Looker, we typically like to start with a dashboard, because we consider a dashboard to be the entry point for data exploration, rather than the end results. What I mean by that is dashboards are often great for allowing you to view things like certain higher level trends or patterns.

They often don't change anyone's day-to-day decision making process. What does change that day-to-day decision making process is when users can drill down into, say, any value, any entire visualization, or any, say peak on a chart, to figure out what is really driving that trend, or driving that pattern, or anomaly that they're noticing. The dashboards are also very interactive, so we have filters you can go render on your phone, your tablet, whatever browser you're using, all that sorts of cool stuff. But say we actually do actually find something we want to investigate further. Say, we're looking at our one opportunities by business segment over time, and we want to find out more information about this.

Sounds like small business has really been driving much of our sales, so let's see what sales reps are involved in those sales. Clicking on a visualization or any single data point is going to bring up what's called our explore view. And explore view is really where all the cutting and slicing of your data is going to get done. So you'll notice on the left hand side here, we have a number of various fields grouped into two categories, dimensions and measures. Dimensions would be entities, measures would be aggregates over those entities. So, this is how users can just drag and drop different dimensions from our UI. These are all either columns that are pre-existing in our database, or new columns that we've created, or new measures that we've created, just within our modeling layer. To show you how this works really quickly, let's say I want to see who our sales reps are that are making these sales over time.

I could say remove our business segment, run this, and this will just show us our sales over time. Now, if I wanted to look at this by our sales rep, I can go look at our sales rep name. I'll go ahead and pivot on that. Hit run, and to show you what's going on underneath of it here, is Looker's actually dynamically producing SQL, based on the different dimensions and measures that we're selecting, to execute the appropriate and optimized query against your database. So you'll see here, say I get rid of this sales rep name, get rid of that, you'll see the SQL code will update. And again, that's because we've defined all of these underlying dimensions and measures just one time in our modeling layer, and Looker forever understands how to translate them to the appropriate SQL code. And of course, we support visualizations as well, so now any user can go in here and cut and slice information, look at any component of their database, and run and perform self-service analytics on it.

And once they have some analytics that they're happy with, or some metrics that they want to share, in Lookers web base, it's really easy to save, share, send that information, add this to dashboards, all that sorts of cools stuff. You also can download results to various formats, or you can schedule this to get reported out on a periodic basis or based on trigger values. To say a certain metric didn't hit by the end of the month, then maybe we want to send an alert to our whole sales team to continue to push sales, or whatever the case is. It's a very interactive tool, and ultimately really provides technical and non-technical users the ability to perform completely self-service analytics. So, I'll stop there because if you let me go, I'm sure I'll keep going on for another half hour. So I'll stop it. Again, I'm more than happy to elaborate afterwards.

Shaun: Excellent. Great. That pretty much concludes our presentation. If you want to continue this conversation with either Dillon or I offline, please let us know. Let me actually pull up this slide here. Please let us know here, and we'll have someone follow up with you today with the next step for getting started with either product. Thank you, Dillon, so much for joining us, this is a blast, hope we can do it again soon. And for Dillon, and I'm Shaun McAvinney, thank you all for joining.