094: Machine Learning with Katharine Beaumont

Hosted byCharles Lowell

January 25th, 2018.

Katharine Beaumont: @katharinecodes

Show Notes:

In this episode, we hit the topic of machine learning from a 101 perspective: what it is, why it is important for us to know about it, and what it can be used for.

Transcript:

CHARLES: Hello everybody and welcome to The Frontside Podcast, Episode 94. My name is Charles Lowell, a developer here at The Frontside and your podcast host-in-training. Today I’m going to be flying it alone, but that’s okay because we have a fantastic guest who’s going to talk about a subject that I’ve been dying to learn about. But you know, given the number of things in the world, I haven’t had a chance to get around it. But with us today is Katharine Beaumont who is a machine learning consultant. And she’s going to talk to us, not surprisingly, about machine learning. So welcome, Katharine.

KATHARINE: Hello. Thank you very much for having me.

CHARLES: No, no, it’s our pleasure. So, I guess my first question is, because I’m very much approaching this from first principles here, is what is machine learning as a discipline and how does it fit into the greater picture of technology?

KATHARINE: Okay. Well, if you think about artificial intelligence which is one of those slightly undefinable fields because it encompasses so much, so it encompasses elements of robotics, linguistics, math, probability, philosophy, it has six main elements. So, a really basic definition of machine learning is getting, and this comes from Arthur Samuel in 1959, it’s about getting computers to learn without being explicitly programmed. And that’s hugely paraphrasing. But machine learning is an element that sits under the wider discipline of artificial intelligence. Artificial intelligence is one of those tricky to define fields because people have different opinions about what it is. And obviously philosophers can’t agree what intelligence is, which makes it slightly complicated.

But artificial intelligence as a broad brush is a discipline that borrows from philosophy, math, probability, statistics, linguistics, robotics, and spawned subfields like natural language processing, knowledge representation, automated reasoning, computer vision robotics, and machine learning. Machine learning is the, in a sense, the mathematical component of artificial intelligence in that from a basic point of view, even though you’re looking at it from the perspective of computer science, you’re utilizing algorithms that a lot of mathematicians will say, “Look, we’ve been doing this for years. And you’ve just stolen that from us,” that try and find patterns in data. And that pattern could be as basic as mapping, say, the square footage of a house to the price that it will sell at and making a prediction based on that for future examples, or it could be looking for patterns in images.

CHARLES: Okay. You mentioned something that I love to do. I love stealing ideas from other disciplines. It feels great.

KATHARINE: Who doesn’t?

CHARLES: Yeah. It’s like free stuff. And the best part of ideas is the person who had it still has it after you’ve lifted it off of them.

KATHARINE: Yeah. You just have to reference and then it’s not plagiarizing.

CHARLES: Yeah. So, how did you actually get into this?

KATHARINE: Well, a few years ago, I was desperately bored in my job.

CHARLES: So, what was that job that you were working on that was so desperately boring? You don’t have to name a company.

KATHARINE: Oh, I won’t name the company but I will – I have to make a confession now which links back to something that we were saying off recording earlier, which was that it was doing web development. So, I’m sorry. And that’s not to say that web development is boring at all. It’s just that I wasn’t particularly engaged, which is not a reflection on web development.

CHARLES: No, no, no. I actually came – I was doing, before I got into web development, I was actually doing backend stuff for years. That was all I did.

KATHARINE: Yeah, me too. I would have described myself as a server-side Java developer who then cross-trained into Ruby. And I thought I’d be doing exciting backend things in Ruby. But unfortunately, it was more, “We’d like you to move this component from this part of the page to this part of the page.” And I didn’t really connect with that. And I started to wonder if I even should be a developer.

CHARLES: Wow.

KATHARINE: Larger forces than myself were at work to try and push me into management or analysis. And as happens, I think, after a few years. So, I started doing, in my spare time, looking at a website (and I’m sure you’ve heard of it) Coursera.

CHARLES: Yeah.

KATHARINE: So, this is the birthplace of the massive online, I can’t remember what the second O is, MOOCs. Massive Online something learning. Maybe a Q in there. I’m not sure what. Do you know what the acronym is?

CHARLES: I actually don’t know.

KATHARINE: Well, MOOCs anyway. Massive online learning courses. And there was one offered Andrew Ng from Stanford on machine learning. So, I took that and I just loved it. I really enjoyed it. And I really connected with the programming. I really enjoyed the programming. It was very fulfilling. So, it grew from that, really. And now, I’ve decided to go back to university. So, I’m a mature postgraduate student and I’m just currently weighing up my PhD options. So, whether to sacrifice four years for the greater good and the pursuit of knowledge or go back into an employment. So, we’ll see. We’ll see. And I’m quite enjoying not being employed, I have to admit. Or being employed on a freelance basis. It’s wonderful.

CHARLES: Right, right, right. Now, a couple of things stuck me when you were talking about – so obviously, you’re studying a lot of the mathematics behind it. And you said that machine learning involves a lot of the – it’s the mathematical component of artificial intelligence. But what strikes me is learning, to me, implies a lot of statefulness where you’re accumulating state. Whereas my experience with mathematics is usually you're solving equations. You start from some set of facts and whether it’s a dataset or some other thing, and you derive, boom, boom, boom, boom, boom, you get your answer. Whereas with learning, at least when I think about school learning, like spelling or, I don’t know, paleontology or something, you’re accumulating facts over a very long time. And the inferences that you make are not necessarily – they’re drawn from all fo the sources that you got over all that period of time rather than some one set of facts that then you make this logical argument and poof, presto, you’ve got your answer. How does that square? I guess it’s just a little bit off from my experience with mathematics.

KATHARINE: So, I am being a bit reductionist. So probably, one way to explain it is that essentially, behind a lot of the machine learning algorithms, you’re inputting numbers. And that might be the percentage of red, green, blue in a pixel for example. Or it might be the diameter of a wheel, for example, if you’re looking at a component of a car. Or it might be a binary configuration if you want to input the configuration of a control panel, for example, and you’re looking for anomalies. And you’re running these numbers through an algorithm. And what you’re getting out is either a continuous value, if you’re looking at a problem with continuous data like house prices, or you’re getting a probabilistic output like 60% certain these pixels together make a cat, for example. So, I am simplifying by saying it’s math because what you're really doing is looking for patterns in data but a way to get a computer to understand it is to somehow input it as numbers, essentially, and to get numbers out of it.

CHARLES: Oh.

KATHARINE: Yeah, it’s more algorithms, really. And I shouldn’t have said that it was essentially math because I’m sure I’m going to get shouted at on the internet.

CHARLES: Well, I certainly don’t want to get you in trouble. But maybe that’s a point that we should shy away a little bit, the high theoretical stuff, and bring it back. If I’m excited, not even if I’m excited, why should I be excited about it? I’ve heard that it’s a hot topic. I’ve heard that a lot of people are excited about it. Is there a way that I as someone who has no specialization in this might actually be able to bring some of these techniques to bear on the problems that I’m working on? Perhaps even without understanding them first, like understanding how it works. What are some problems that I might be able to attack with these techniques?

KATHARINE: Yeah, absolutely. So actually, and one of the things about machine learning that I should say is don’t think, “Oh, it’s not for me. I’m rubbish at math. I don’t understand these concepts. I’m not willing to get my head around an algorithm,” because there are so many pre-configured APIs available from big companies like Google and Amazon and Microsoft and IBM, and I’m sure many, many more. And I’m not paid by any of them, I should say. So, you don’t need to understand the inner workings of an algorithm to use it.

So, one example is speech-to-text. So, if you imagine that you’re working on a website and you want to make it accessible, maybe you could have a component to your navigation bar that allows users to record their voice and say, “I want to navigate to the shopping cart,” for example. And machine learning would be behind that processing. So maybe, behind it you’d have an API, I’ve used a few of them before just to play it, where you make a call to, say, IBM service and it returns you the text. And in your program you match on keywords like shopping cart and then change the menu bar for them. So, that’s one really simple way you could do it.

Another more complicated way to do it is to implement something like a recommender system. So, say you have a website where you offer customers products of some description. And the most famous example of this is Amazon and Netflix. Amazon, the shopping site, rather than now the big, big corporation. And you see what other customers like you bought, or Netflix, what you might enjoy. And that’s based on taking your information, comparing your viewing habits to other people’s viewing habits, and then drawing some kind of correlation between the programs you watch and trying to find programs that other people have watched that you haven’t, that you might enjoy. That’s more complicated, to be honest.

CHARLES: But that is an example. Machine learning is what underlies all that.

KATHARINE: Absolutely. And at the heart of some recommender systems, the mathematics behind it is finding a way to quantify people’s preferences and measuring distances between them. But you don’t need to understand that to understand the basics of how a recommender system works.

CHARLES: Okay. And so, how does a recommender system work?

KATHARINE: So, imagine me, yourself, and Mandy each read four books. And we rate them. But I read four books, you read three of the same books, and one different one, and Mandy reads three of the same books as me and one different one, for example. So, we’ve got a little gap but we’re not really sure what the other person will think. And we know the genres of the books. And you can compare the genres and the ratings. So, you might rate sci-fi 6 out of 10 and romance 7 out of 10. And I might rate sci-fi and romance in equal ways. So then you might say, okay, there’s a similarity between our preferences. So for this book, that Charles read, Katharine might like it.

CHARLES: That makes so much sense.

KATHARINE: Yeah. And maybe Mandy only likes romance, only rates it 0.3 for example. So we think, “Okay, well Mandy might not be able to recommend a book to Katharine and Charles.”

CHARLES: Right, I see. Implicit in this though is there’s this step of the actual learning, I guess. Or the actual teaching. How do you actually teach? Again, and this is kind of me trying to wrap my head around the concept, is I’ve got these set of facts and I’m inferring and I’m pattern-matching and I’m trying to draw conclusions with some certainty from this set of data. But is there this distinct actual teaching phase where you have to actually teach the computer and then it takes new facts and gives stuff? How does that work? How does it incorporate the different – I guess what I’m saying is, are there distinct phases? Or is it…

KATHARINE: Yes, and it’s not the same for every algorithm. So, I’m going to try and give you two examples of training. So, there’s something called online learning where as a new example comes in, for example it might be – I’ll just explain what a classifier is. So, this is a brief diversion. So, a classifier is a machine learning process where you’re trying to put information in and you’re trying to get discrete information out. So, discrete meaning like it’s a cat or it’s a dog or a weasel or a minion or something like that. Or, it’s cancer or it’s not cancer. Whereas continuous output might be the price of a car. So, in a classifier you’re trying to work out what type something is. So, a really good example is there’s a Google project where they’ve clustered artworks. So, they’ve taken lots of different artwork and their algorithms, which I won’t explain now, have determined, “This is a ballet dancer. So, we’re going to group all of these ballet dancer paintings together. This is a landscape, so we’re going to group all the landscapes together.” So, online learning, you might get a bit if information in, like a picture, and you will classify it and you add it to your existing information. Whereas another type of learning is you take all of the information you have, you train the algorithm, and then you make predictions. So, you either make predictions as you go along with online learning, or you do all of the work upfront. So, one algorithm – have you heard of decision trees?

CHARLES: No, I haven’t.

KATHARINE: So, do you ever read those rubbish teen magazines where you have a flowchart and it starts at the top like, “Do you like cats? Yes or no?”

CHARLES: Oh right, yeah.

KATHARINE: “Do you likes dogs? Yes or no,” and it tells you what kind of a person you are or what make up you should wear or something like that.

CHARLES: Right, right, right, yeah.

KATHARINE: Yeah, so decision trees are kind of like that. One example is you might get information about, it’s a famous toy dataset, information about passengers on the Titanic about gender and age. And we all know the techniques on the Titanic, women and children first. I’ve lost my use of normal English. I’m sorry.

CHARLES: Right. Maybe like a trope?

KATHARINE: Yeah. So, women and children first. So, you put this data in a decision tree. And what happens is at the beginning you have all of this data. So, person A is a male. They’re in their 50s. This is the type of ticket they had. And this is their income. I don’t think income is one of them, but just as an example. And then you have person B, person C. So, you might have a hundred people. And the decision tree algorithm goes, “Okay, if I just looked at one of these features like gender, would that differentiate the people the most?” So, it already has the answer as to whether or not they survived or did not survive. And it’s looking for the one feature that gives it the most information. And then it will split on that feature. So, you go form your thing at the top and the first question might be, “Were they male or female?” And then the decision tree will split down a level. And then your algorithm will go, “Alright, what’s the next feature?” Maybe the next feature is, “Were they under the age of 30?” for example. And it works down. And you end up with this sort of flowchart. And once it’s trained, you then get a new example you have, person X. and you basically just work your way through the decision tree to make the prediction of whether they survived or did not survive.

CHARLES: I see. And so, do you do it with some sort of certainty? Because there’s going to be variation, right? You’re going to have some people who are a poor fellow in his 70s who survived. Like, it’s not certain that he went down but there’s some probability at the end?

KATHARINE: Yes. Because you’re using it to make a prediction, there is always a probabilistic element. And it depends on the decision tree algorithm. So, there are some decision tree algorithms that really don’t work with contradictory data, for example.

CHARLES: I see.

KATHARINE: There’s an element of picking your algorithm. If you’re approaching a machine learning problem, you have three elements. The first one is choosing your feature and your algorithm. Then you have evaluating it, so you need a way of saying how good or bad the algorithm’s doing on your data, how accurate is it for example. And then you have optimizing it, which is the dark art of machine learning.

CHARLES: Right. That’s the wrap across the knuckles. It’s coming up with wrong numbers.

KATHARINE: Yeah. That’s – oh no, this took two weeks to run and I need it to take 20 seconds. Or, this is only 60% accurate and I need it to be more accurate. And it’s easy, just as an example I’m just working on a course that I’m giving in a few weeks’ time. And I just took a day to set up a website called Kaggle, K-A-G-G-L-E, for wine quality. So, it’s got acidity, citric acid, residual sugar, chlorides. They’re the features. They're the components that you’re going to put in. And it has a quality score. So, what you’re trying to do is you’re trying to find a relationship between the features and the quality. And it’s very easy to get it to work. So, I now have it working. But I have it working with a really low accuracy. So, it took maybe five minutes to get it to work and it’s going to take me about half an hour to make it more accurate, and that’s the optimization element.

CHARLES: I see. How stable are these processes? Is it finicky and fragile so that if you get new types of features it just throws things way off? Or are there ways you could control for that?

KATHARINE: Well, that’s a particular type of problem and that all links in with the evaluation and optimization. And that’s to do with something called overfitting. So, decision trees is an algorithm notorious for overfitting, depending on the data, that is. So, overfitting is when you get the algorithm to perform really well on the training data. And then you feed it in a new example and it might grossly misclassify it because it hasn’t learned to generalize beyond the examples that you’ve given it.

CHARLES: I see. Okay. So, it’s just too concrete. It hasn’t recognized deep patterns. It’s only recognized something superficial.

KATHARINE: Yes. So typically, if you have a finite amount of data, you only train your algorithm on a certain percentage. And then you test it on the rest.

CHARLES: I see.

KATHARINE: So, you hold back some data. But back to our conversation earlier about when does the training happen? Another example is something called K-nearest neighbors. You could just imagine that means three nearest neighbors, for example. So, we’re trying to find who we’re most similar to in a room. So, it’s a room full of people. And all the people are standing next to already similar people, for example. So, you might have a room where marketing’s in one corner and the software developers are in another corner and the project managers are in another corner. And you go it and you’re looking for the three people who are the most similar to you and you’re going to go and stand in that group, for example. So, in that type of machine learning, the training is happening as each new sample comes in, rather than upfront. And there are drawbacks to both methods and there are positives to both methods. And really, in a horrible, unsexy way, it’s to do with the data. And that’s normally where most people because it’s the most boring part when you’re talking about machine learning.

CHARLES: Is the data?

KATHARINE: Yeah. It’s this backlash from data scientists, from the golden age of data scientists where it was the hottest job on the internet to now everyone cringing going, “Oh, I really don’t want to deal with my data.” But with any machine learning problem, you can’t just go, “Okay. Here you go, Charles. Here’s a dataset. Learn something from it.” You need context. You need to understand it. You need to have an idea of what you’re looking for. So, you’re getting the machine to learn but you're using it as a tool to complement your knowledge, really. And you’re feeding in your knowledge to it. And part of that are the decisions that you make on the algorithms to use.

CHARLES: Okay.

KATHARINE: And what you’re looking for.

CHARLES: So, here’s something that I’m wondering is related, and again I have no idea – for some reason I always associate when people talk about neural nets as being related to teaching a computer something. Is that part of the discipline of machine learning? Or no?

KATHARINE: I would say it is. But it has its own cool and trendy title of Deep Learning. But it’s very much powerful machine learning. So, let’s go back to this. This is a classic example of machine learning. It’s probably the first example you’ll come across if you do any course. House prices. So, imagine that you have a piece of paper and you're going to draw one line at one side and that is the price of a house, and you’re going to draw a line at the bottom which is the size in square feet, and you’re going to plot examples that you have. And you might find that there’s a linear correlation between the two. So, you’ll draw a line and that line of best fit is the human equivalent of doing linear regression on a computer, for example, where you’re just trying to find a linear correlation between two things.

And a lot of the principles in linear regression, which is a very simple learning algorithm, are found in some examples of neural networks, so some basic examples of neural networks. But instead of having one input, the size in square feet, you might have 20. And you might be repeating that process of trying to find the line of best fit with different combinations of features in different places again and again. And it scales up in complexity very quickly. But it’s very similar basic principles. I’m hesitating to say ‘very similar’ because they’re notoriously more complicated.

CHARLES: Yeah. I guess I’m not really divining what exactly, what makes it – why is it called a neural net? What makes it special? It sounds like if I’m just comparing the regressions of house prices, I’m comparing those datasets over and over again, how is that different from just a loop? What are you getting out of it?

KATHARINE: Historically, neural networks come from a very simplified idea about how the brain works. So, in the early 20th century people had performed autopsies and divined the inner workings of kidneys and hearts and livers. And the brain was still a bit of a mystery. And then 2 men jointly won the Nobel Prize for Physiology, and I’m going to pronounce these names wrong, so I’m sorry. I think it’s Santiago Ramón y Cajal is one and Camillo Golgi is another. And they completely disagreed about the brain but they used a staining technique from Golgi to look, using silver nitrate, at the cells in the brain. And the idea of the neuron doctrine was borne out of that, that the simplest unit to look at the brain at in order to understand it is the level of the neuron, this cell in the brain.

And from there, several – well, everybody was a polyglot really, back then, so I don’t want to say computer scientists. So, you had people like Frank Rosenblatt with perceptrons, Marvin Minsky and so many other people looking at a very simple idea which is that a neuron either fires or doesn’t. And then you’re linking boolean algebra to this cell. So, you’re saying it either fires or it doesn’t. And from that principle, people started drawing similarities between neurons and a basic function machine.

And when I say function machine, I mean imagine when you were in elementary school and you’re learning how adding up works. You might have a box with a plus on it and your teacher says, “I want you to put a three in a box and a four in a box and I want you to add them together. And what do you get out?” And obviously the answer is seven. And you can think about that little box with a plus as a function machine. So, now you could think of a little mathematical function machine where you put in some inputs and there something happens in the box. And then you’ll either get a one or a zero out of it.

CHARLES: And so, that’s like your neuron, is the little box?

KATHARINE: Yes.

CHARLES: Okay.

KATHARINE: Yes. So since then, the neuron doctrine is pretty much contested. There are several other elements of the brain that compose thinking and the circuitry. So, any cognitive scientist listening to this will say, “That’s really not how the brain works.” You have to say it with a lot of disclaimers. But the whole idea of neural networks was borne out of this idea of thinking of a neuron as like a function machine.

CHARLES: Right. And also, it doesn’t actually discount the usefulness of neural networks. There are a lot of things where people didn’t find what they set out to find but what they found was useful.

KATHARINE: Absolutely. Yeah, and they are incredibly powerful, especially with multilayer networks which are deep networks or deep learning.

CHARLES: Okay. So, I didn’t want derail you from your explanation. So, you’ve got these little function boxes and those are the kind of neurons inside the neural network?

KATHARINE: Yes. So, what happens is you might have a layer of 10 of them and you might have another layer after that of another 10. And each of them are connected in a simple neural network. And then you might have an output layer of five because you’re looking to classify, I don’t know, an apple into five different types of apple, for example.

CHARLES: Now, when you’re talking about a later, you’re talking about, I’ve got the outputs of one layer of the network are the inputs to the next layer.

KATHARINE: Yes. And in different neural network architectures they’ll be connected differently. But in a simple neural network, you can assume that every neuron is connected to every neuron in the next layer. So, if you have two 10-layer, two layers each with 10 neurons in, the top neuron in one layer is going to have 10 connections going out of it into the next layer.

CHARLES: Oh really? Wow, that’s interesting.

KATHARINE: Yes. And each connection has its own sort of configuration.

CHARLES: So, you’re like cross-wiring all the – okay. Wow, that’s kind of…

KATHARINE: And yeah, that’s where the complexity comes in.

CHARLES: Yeah. I was thinking it was like a simple exponential fan-out. But it’s even more complicated, the number of combinations you can get.

KATHARINE: Yes. But in theory, it’s very simple because each neuron is like this function machine with inputs coming in and something going out. It just might go out to several different locations. But it scales up in complexity very quickly. So say we’re classifying apples. So, we have five different attributes of an apple like its color, its texture, its weight, acidity. I don’t know how else you would measure an apple. How shiny it is, for example. And we’re putting all of that information in. And each one of those things that I’ve just listed is a feature. So, we might have an input per feature and we’ll link them all up to the neurons in a layer and we’ll move that information onto the next layer. And the really important thing is what’s happening on those connections between them. Because on the connections you have weights, which is a way of changing the input from one neuron to another. So, the weight might be like 0.5 for example. So, it squashes the input. Or it might magnify it. And then you get the output.

Now, once you have the output you can compare the output with what you know the right answer is. And then you have this idea of an error. So, you might be like, “You got this so wrong. It’s not a Braeburn apple at all. It’s actually a Granny Smith apple. And what you do is with each example that you train in, you use your information about the error to train the network. Because what you’re trying to do is get the error to be as small as possible. And one of the techniques of that is called back propagation. And it’s notoriously difficult to understand because it involves partial derivatives and a large element of calculus. But essentially, what you’re doing is comparing the right answer with the answer that the network gave it and asking it to go back and change it, change those connection weights.

CHARLES: And so, do you make them fluctuate at random or is there some – is there a method to the madness of changing the weights?

KATHARINE: There is a method to the madness and it’s called back propagation. And the reason I linked linear regression in early, so our really simple map of house size in square feet, and the price, is because it uses a similar technique called gradient descent which is an algorithm for looking at the error and changing the weight, those numbers on the connections, to try and get it to a minimal point. So, if we go back to our house price problem, I just want you to imagine in your mind that we’ve got this one axis going up which is the price, and one going across which is the size in square feet. And you’ve got line drawn, a diagonal line. If you just imagine now in your mind moving that line down till it’s completely flat at the bottom, and then moving it up so it’s vertical, so it’s aligned with either axis in every point in between, what you could do is you could take the error on each of those lines.

So, if we imagine we have these two axes, we have the price of a house on one side and we have the size in square feet in another. And we’re going to draw a line from the top axis and we’re going to sweep it down and draw a line at each point as it goes down until it’s aligned with the bottom axis. And at each point we draw that line, we measure the error. So, what we’d do for that is all of the little points, all of the example data, we’d measure the difference between them and the line and we’re going to, say, add them all up. So, what you’d end up with is you’d end up with a graph mapping the error against the gradient at all fo those different points. And it would look like a bowl. You’d have a lowest point. You’d have a point for some gradient where the error is the lowest.

CHARLES: Right, yeah. Okay. I’m seeing it. I think I’m seeing it. So, you want to take that error function and you want to, what is it? Now this is – boy, I’m going back to high school math. You would take the derivative and find out where the tangent is and that’s your min point? That’s the root of the equation and that’s the point where your error is lowest?

KATHARINE: Yeah, exactly. Or there’s an algorithm called gradient descent that does it automatically by taking little steps. So, it looks at the tangent of the gradient at a point and says, “If I move the gradient down is the error going to decrease. And if so, move in that direction.” So automatically, it tries to take steps to get to the bottom. And optimization problem with that is that you can configure the step size. So, you could take tiny, tiny steps and take forever to get to the bottom or you could take massive steps and completely miss the bottom. So, you can imagine it like walking down a hill. If you’re a minion then it will take a really long time because you’re tiny. And if you’re a giant, you might never get to the valley.

CHARLES: Right. You might just leap right across the chasm.

KATHARINE: Yeah, just miss it completely. So, that’s linear regression where we’re looking. And that’s really, even though we have a two-dimensional graph that’s a one-dimensional problem because we just have the one input feature. Now, when we’re looking at neural networks and we’re looking at gradient descent in neural networks, each one of those connections is something that we’re trying to configure to reduce the error. So suddenly, you have maybe a hundred-dimension landscape and you’re trying to get to the bottom of a hill. And there might be several local optimas and one deepest valley, but you might have lots of other valleys that you could get stuck in. So, it becomes a very difficult problem. Does that make sense?

CHARLES: Yes. No, that does make sense. I’m just trying to let it sink in.

KATHARINE: It’s completely impossible to visualize a hundred dimensions, yeah.

CHARLES: I actually had to sit back and kind of close my eyes and stare up at the ceiling.

KATHARINE: I think the trick is not to think about it. I heard someone say, there’s another fantastic course on Coursera by a famous computer scientist studying neural networks called Geoffrey Hinton. And his advice in one of the videos on visualizing these multidimensional landscapes, say it’s 15 dimensions, is to close your eyes and shout, “15!” And that hasn’t worked for me, but I’m sure it’s worked for some people.

CHARLES: But it certainly probably makes you feel better.

KATHARINE: Yeah. I think it’s one of those things that’s just beyond comprehension. But we can just quietly accept it.

CHARLES: Right. It’s just – yeah, what’s nice about I guess math is you just don’t have to understand it. I mean, you do. You just understand that there really is no mapping to our physical experience. And that’s okay. And just let that go. It’s just like, this is just some…

KATHARINE: Yeah.

CHARLES: We had some numbers that existed in the domain of understanding, that we can understand. And there are just some rules that we follow and if you look at the intermediate steps, well the points don’t really exist in that domain of physical experience and understanding. And that’s okay. We just accept and let it go. We just hope that at some point, we can translate that model back into the domain of ‘we can understand it’.

KATHARINE: Oh, completely. There’s a lot of faith.

CHARLES: Yeah.

KATHARINE: And also, I remember coming into machine learning and thinking, “This is like magic. It’s amazing.” And then you study it a bit and you’re like, “This is so easy. This is just basic – this is just functions. These are just glorified function machines.” And then you look into it some more and you’re like, “Nope. It’s definitely magic.”

CHARLES: Yeah. It’s a phase of, every time you come up against the wall, right? And then you realize, “Oh no, it’s actually something that I can close my mind over,” until you come to the next hurdle of magic.

KATHARINE: Yeah. You think you’ve got a grasp on things and you think, “I know the landscape,” and then you suddenly realize how much more there is to learn. And that sinking feeling that you’ll never learn it all.

CHARLES: Yeah, yeah. Yup. Unfortunately, it seems like in tech that’s like, that’s just the condition.

KATHARINE: Yeah, like all of my sad, unread books.

CHARLES: If I wanted to get into just really start experimenting with this stuff and start saying, “Maybe I can utilize some of these techniques for some of these problems that I’m encountering,” Where would be a good place to get started? What libraries? What online resources? What people are good to follow and ask questions of?

KATHARINE: Okay. Let’s start with websites. So for a start, this website called Kaggle which I mentioned earlier, and that is K-A-G-G-L-E dot com. And that has a lot of dataset resources. It also has a community of people discussing how they use the datasets. It has competitions. And it has a lot of links. I discovered recently as well a really good blog. It’s on Medium and it is called ‘Machine Learning for Humans’. And that’s really well-written. I really like it, actually, and it has a good section on resources called ‘The Best Machine Learning Resources’. And I should probably plug my own blog, but this one’s so much better.

CHARLES: But please do.

KATHARINE: No, no. I have to actually write stuff for it. But there’s a lot of things there about, well, if you want to learn linear algebra, what if you want to learn probability and statistics, calculus, and then just go straight to machine learning and pick up the math on the way. I would say go on Coursera, because there are courses like Andrew Ng’s course on machine learning from Stanford. And Geoffrey Hinton from the University of Toronto. But there are also courses there on things like calculus and probability and statistics if you want to level up your math. If you don’t want to do anything to do with the math, I would say go to the vendor websites, like AWS, Google. If you’re into Java, go on deep learning for J. And a lot of them have tutorials that complement their products. Deep learning for J is one of my favorites at the moment. It’s an open source Java library. It’s pretty plug and play, actually. You don’t need to understand a lot of it to get started with it. But it helps. And obviously then there’s TensorFlow although personally, I find just other Python libraries like SciPy a lot easier than using TensorFlow.

And I think naturally you'll find the resources and the people to follow on Twitter from that. But the crucial thing I’d say is don’t get hung up on which language to start playing around with. So, a lot of people say, “Oh, I must need to know Python. I must need to know math.” And really, you don’t. It just depends on what level you want to approach things at. So, if you want to write your own gradient descent algorithms, then Python is probably more for you, or Matlab or R or something like that. But there are libraries where you can do it in Java. I’ve heard rumors that there’s a JavaScript library and I wouldn’t be surprised. So, I would just have a look at what’s out there. But try and get a grasp of the fundamentals just from an intuition point of view, because it will make your life so much easier. You might for example realize that you're using the completely wrong algorithm for the problem that you're looking at. And that’s invaluable.

CHARLES: Yeah. Knowing what not to do certainly is. Alright. Well, thank you so much for that, Katharine. Thank you for being on the show. Thank you for curing us of at least a small portion of our ignorance. And if people want to get in touch with you perhaps and continue the conversation, or follow you, what’s a good place to get in touch?

KATHARINE: Sure. Probably tweet me on Twitter. I’m @KatharineCodes but it’s spelled like Katharine Hepburn. It’s K-A-T-H-A-R-I-N-E codes. Because when I joined Twitter, I didn’t have much of an imagination. I still don’t. So, it’s not particularly clever. But it’s there.

CHARLES: Alright. Well, fantastic. And for everybody listening at home, you can also get in touch with us. We’re @TheFrontside on Twitter. Or you can just drop us a line at info@frontside.io. Thanks for listening and we will see you all next time.

Listen to our podcast:

✓