I’m happy to report that I’ve made a start on gathering the data I need for my Neural Network betting application. Since my last post on the subject I’ve looked at the racing in the UK and the weather has written off a lot of the meetings so we’ve been left with mainly poor All Weather events.
My initial thoughts were to create a network for Flat racing on turf as opposed to AW so until the season starts that’s on the back burner. If it turns out any good then it wouldn’t be too much of an issue to create one for Jump racing but I don’t want to try and run before I can walk.
I have a good idea of the data I need and the places that I can find that data. Consistency is the key here. When you go about undertaking something like this your input has to come from exactly the same source each and every time. You can’t be taking all your info from the Racing Post one week then using info from say the Sporting Life site the next. The issue I have is that much of the data I need can only be found from the racecards as printed in the Racing Post and some of the Trainer and Jockey stats that are found on the Sporting Life site. It would be a tremendous ball ache to try and find past racecards and past Trainer/Jockey stats for those previous races, plus there are things like forecast SP’s that would be a nightmare to find. What I’m trying to say is that I don’t think there is a ton of historical data that I could just pull and use to create my network as some of the key stuff just won’t be available. Right now though there’s no ‘proper’ racing to be had in the UK so rather than just sit and twiddle my thumbs until the end of March I thought I’d strike out and do something for the US racing.
So the way I will proceed is the way I did it before. I’ll be using current racecard info taken from the free cards available at http://www.brisnet.co.uk. Everything I need is there on those cards. I started with yesterdays meeting at Tampa and will just keep going until I think I have enough in my spreadsheet to be able to start training the network. It’s a lot of work though. Each horse in each race has 18 different bits of info that has to be manually entered into a spreadsheet including whether or not it won or lost. When I think I have enough races entered in I feed that data into a Neural Network that is created using software that is available either as a free 30 day trial or a one off payment of $70 (or thereabouts).
The idea here is that the network is ‘trained’ to spot patterns that exist among the many data sets it sees and analyses… patterns that otherwise may not be spotted or may be ignored by a human trying to pick a winner from a mountain of form and other data presented in your average racing trade paper (or online if you prefer). Once the initial training has been performed and the network stops the learning process it can then be tested using previously unseen race data sets that we know the outcome of but we don’t tell the network…we give it the same bits of info as in the training cases but we don’t tell it which horse won. The network is tasked with forecasting the winner based upon what it has learned during the training phase.
The testing phase can be as long or short as a piece of string. By that I mean it’s entirely up to me to decide if I think the network is ‘ready’ or not to be given live race data for races that have not yet been run. If, during the testing phase the network’s predictions are generally incorrect then more training is required. If it forecasts a result incorrectly during testing then I simply tell it that it got it wrong. It it forecasts correctly I tell it that it got it right. All of this new info that is presented to the network during testing is used by the network to internally adjust the weights it has previously assigned to each piece of data it sees. This iterative process continues until I am satisfied that the current strike rate of the network is acceptable. Only then do I present it with live data for races that have not yet been run.
Obviously as more and more data is input the network will continue to make subtle tweaks here and there but the aim obviously is to have a piece of software that can objectively analyse a bunch of racing information and predict winners with some degree of accuracy overall. The drawback is that it is a tremendous amount of work and unless the network has a good strike rate not worth pursuing over the longer term. You might as well just sign up with the latest hot shot tipping service and save yourself the effort.
The thing that spurs me on to do this is the performance of one particular network I created back in 1997 for my final year honours project, Neural Networks For Horserace Prediction. As with many honours projects students tend to do little about them until they realise that the pressure is on and they’d better get their arses in gear and do some work. I was no different. In a relatively short period of time, only a few months out of the whole year, I managed to get it done start to finish. Lots of midnight oil burned and tons of stress. The outcome though was worth it.
During that period I created 8 different networks each with differing ‘topologies’ to see how the various networks fared given the same data sets. Would more neurons in a hidden layer be better? Would more than one hidden layer of neurons yield a better result? A network topology basically describes how the network looks on paper. The following shows a simple network topology…in this case 4:5:1
For example in my project I had 9 individual bits of data that were the inputs to the 8 networks. Those 9 bits of data were fed into 8 different network topologies. The best performing network had not one but two hidden layers…so the topology was 9:4:3:1…nine inputs going into a hidden layer of 4 neurons, the output of which is fed into a further hidden layer of 3 neurons, the output of which gives us our final output.
When it came to interrogating my networks with live data for races that hadn’t yet been run only one of them yielded a loss. Three failed to give a prediction either way (no selection given) but the other 4 all returned a profit. Every network was interrogated using the same live data, so each got the same races fed in. As I said previously I didn’t spend all year on this project and in total I only entered 125 races worth of data – 70 races to train the network, 35 races during the testing phase and only 20 during the live interrogation phase. Hardly a drop in the ocean of all the racing info that’s now recorded as history but that’s not the point. The point is that several of the networks, even with the small amount of data they were exposed to, still managed to make a profit to level stakes.
As I mentioned above the best performing network had two hidden layers as opposed to one but there again so did four of the other networks so it’s not as straightforward as simply just settling for one particular topology and calling it done. You have to create many different ones and compare them to see which one is providing the best results. But I digress…the best performing network back in 1997, out of a total of 20 previously unseen data sets for races that hadn’t been run, gave 5 winners 2 losers and in the other 13 races failed to give a selection at all….that’s a 71.42% strike rate.
Now I’m not saying that if the network was exposed to ten times the amount of data that it eventually got it would continue with such a great performance…no-one will ever know the answer to that…but the results were extremely encouraging. The only drawback as I’ve pointed out before is there’s a hell of a lot of work goes into getting these things working and when it returns no selection after you’ve spent ages inputting race data for say 10 horses in the 2.30 at Donny it tends to piss you off a bit. That’s the only downside but I believe that will happen less as the network matures and is exposed to more and more input.
I think that’s where I’ll call it a day for this post. I just wanted to get something out there for those that might be interested. There’s something really satisfying about doing a project like this and even more so if it bears fruit but right now it’s in it’s infancy and there’s lots to do. Every journey begins with that first step…who knows where this will end? Bye for now.