The data scientist is another persona that takes advantage of the simplicity of accessing the data likes via cue ball. In particular, the data scientist will likely leverage the notebook feature. I’ve got a number of notebooks here, and actually, before I jump into it, there are two different ways you can use notebooks in Cuba. There’s the original Cubo notebooks, which are based on Apache Zeppelin. We now support Jupiter integration as well. Jupiter out of the box notebooks. I can show you both of those, but carrying along the way the same story that we’ve been working with the ecommerce insights, let’s say done the analysis and showing you how analyst might look at this.
If I’m a data scientist, I’m using notebooks. Got some paragraphs here and I can actually rerun this and show you that it will actually basically doing the same thing. What’s really important to a data scientist typically is not only having access to something like a notebook interface, but the reason that’s important is that it gives them a number of things.
One of those things is the ability to leverage a mix of interfaces or languages. In this case, I’m using pipespark, which is python. Within my Python I’ve got SQL embedded and that allows me to run queries against the data sets and explore the data. Again, this is the same example. I think this is the top ten products by quantity ordered, I might look at raw data. In this particular table, the access logs table, this is log data. It’s a little bit less structured. I can see here I’ve got a query that gives me the ability to break this down by views. I might parse that URL and pick out the actual product and make that more presentable, but cut some corners here for this demo. Here’s an example where I’m using Scala, right? So I had Python, I had SQL. I’ve got some Scala in here.
What I’m doing with this is I’m actually using the machine learning library Fpgrowth, which is built into Spark. This is a frequency pattern matching library. What I’m doing in this particular paragraph is I’m building a model and I’m actually running the model against my data set. This model, the frequency matching model, will allow me to then do things like market basket analysis. I can do things like what are the common combinations of products that are bought together. So let this particular model run. It takes about 30, 40 seconds to execute, faster this time. I can take the outputted data frame or object that I put together where I collected all this information. I can start to sort this by the number of times they bought together. As you can see, I’ve got two product combinations and then I’ve got another one which shows three product combinations.
Again, the idea is really to show you as a data scientist, the different types of things I can use Q Ball for. Not only again do I have access to SQL, I have access to Spark using Scala as well as Python, I can do machine learning, as you’ve seen here. Finally, the other thing I might do is I might publish some of the stuff so that a business user, San Jose, run it and try to view what’s happening via a dashboard. Cuba allows me to take this notebook. I can create a dashboard out of it. I actually got one here. The dashboard, as you can see, will effectively drop all of the code or hide all of the code, so that a user, a business user, can actually execute those paragraphs without having to worry about the code. Again, this dashboard feature can be published as an output of the data scientists work.
Free access to Qubole for 30 days to build data pipelines, bring machine learning to production, and analyze any data type from any data source.
See what our Open Data Lake Platform can do for you in 35 minutes.