wtorek, 20 marca 2018

Solving for Quadratic Equation with TensorFlow and Custom C++ Code

Let's just start with what is our problem. The problem we want to solve is to find quadratic equation's $Ax^2 + Bx + C$ coefficients $A$, $B$ and $C$ that best approximate a given set of 2D points $(x, y)$. This is a linear problem that can easily be solved with Least Squares Fitting [1]. However, here we will solve it using Gradient Descent (GD), which is a general algorithm that lets you solve for both linear and non-linear problems. This is a basic algorithm that all neural networks frameworks are based on.

Some background for this blog post. Initally I started doing my first neural networks work using TensorFlow and found it very difficult to 1) work with that framework (I find PyTroch much more intuitive to work with) and 2) to force it to use GPU. Because of that I thought that since GD is not a hard algorithm to implement I thought that I could write my own implementation of it using DirectCompute. But before doing that I had to see if I correctly understand what TensorFlow is doing under the hood. I was able to verify that. I wrote a simple quadratic equation solver with TensorFlow and then a simple quadratic solver in C++ using GD. Now, the huge difference between the two is that TensorFlow computes derivatives (which are base for GD) symbolically whereas my C++ solver does so numerically. Obviously, the first approach - symbolic differentiation - is the preferred way as it provides more stable results. But for quick verification numerical differentation was enough for me.

Now a little bit on GD itself. The idea is that we have some cost function as a starting point. This cost function could be the sum of distances of all points to the quadratic curve. Usually our aim is to minimize that cost function, that is, find input arguments for that function (in our case $A$, $B$ and $C$ of the quadratic equation) such that the cost function has value as small as possible (i.e. minimized). We also know there this mathematical entity called gradient which, when computed, tells us what values we should add to our input arguments so that the value of the cost function increases (yes, increases, not decreases) by more than it would have increased if we had chosen different values to add. If we repeat this step a few times we might eventually reach the top of the function, i.e. such input arguments where the value of the cost function is the biggest. Now since we're interested in finding the bottom of the function, not the top, we just need to subtract the gradient from our input arguments, not add it.

As stated, we'll be finding $A$, $B$ and $C$ of the quadratic equation given a set of points $(x, y)$ that the quadratic equation is supposed to approximate. We will use here a short, predefined list of points: $(1, 0), (2, -1), (3, -2), (4, -3)$. Yeah, you've probably noticed that they all lie on a straight line so straight line would be what best approximates that set. But we want a quadratic. Because that's why :).

TensorFlow code that solves this comes here:
import tensorflow as tf
import numpy

# TensorFlow model

tf_a = tf.Variable([0.0], dtype=tf.float32)
tf_b = tf.Variable([0.0], dtype=tf.float32)
tf_c = tf.Variable([0.0], dtype=tf.float32)

tf_x = tf.placeholder(tf.float32)
tf_y = tf.placeholder(tf.float32)

tf_cost = tf.reduce_sum(
  tf.square(tf_a*tf_x*tf_x + tf_b*tf_x + tf_c - tf_y))
tf_optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(tf_cost)

# run the model

points = [ [1, 2, 3, 4], [0, -1, -2, -3] ]

sess = tf.Session()
for i in range(1000):
    { tf_x: points[0], tf_y: points[1] })

a, b, c, cost = sess.run(
  [tf_a, tf_b, tf_c, tf_cost],
  { tf_x: points[0], tf_y: points[1] })
print("A: ", a, "  B: ", b, "  C: ", c, "  cost: ", cost);
As you can see, we're first building a TensorFlow model of the computational graph and then run it with data provided (1000 iterations, learning rate set to $0.001$). We define the cost function as the sum of differences between the quadratic equation with $x$ provided for all points, and actual $y$ provided for all points. For anyone who's worked at least a little bit with TensorFlow this code should be pretty self-explanatory. For those of you for which it is not I highly recommend going to [2].

Okay. So there was the magic of TensorFlow. We will now see how to write code that does exactly the same calculations but with C++ (but using numerical differentation for ease of implementation and generalization). Let's just see the whole code:
float Cost(const SVector2& pt, float a, float b, float c)
  float value = a*pt.x*pt.x + b*pt.x + c - pt.y;
  return value * value;


vector< SVector2 > points;
points.push_back(VectorCustom(1.0f, 0.0f));
points.push_back(VectorCustom(2.0f, -1.0f));
points.push_back(VectorCustom(3.0f, -2.0f));
points.push_back(VectorCustom(4.0f, -3.0f));

const float epsilon = 0.0001f;

float a = 0.0f;
float b = 0.0f;
float c = 0.0f;
float cost = 0.0f;
for (uint i = 0; i < 1000; i++)
  cost = 0.0;
  for (uint j = 0; j < points.size(); j++)
    cost += Cost(points[j], a, b, c);

  float dCost_dA = 0.0f;
  float dCost_dB = 0.0f;
  float dCost_dC = 0.0f;
  for (uint j = 0; j < points.size(); j++)
    dCost_dA += ( Cost(points[j], a + epsilon, b, c) -
      Cost(points[j], a - epsilon, b, c) ) /
      (2.0f * epsilon);
    dCost_dB += ( Cost(points[j], a, b + epsilon, c) -
      Cost(points[j], a, b - epsilon, c) ) /
      (2.0f * epsilon);
    dCost_dC += ( Cost(points[j], a, b, c + epsilon) -
      Cost(points[j], a, b, c - epsilon) ) /
      (2.0f * epsilon);

  a -= 0.001f * dCost_dA;
  b -= 0.001f * dCost_dB;
  c -= 0.001f * dCost_dC;

cout << "A: " << a << "  B: " << b << "  C: " << c << "  cost: " << cost << endl;
We start with defining a cost function for a single point (note that TensorFlow, as it operates on tensors, computes the error for all points in single expression). This cost function is further used in iterations (also 1000) loop where costs for all points are summed together. The next step is doing GD using numerical differentiation. Values dCost_dA, dCost_dB and dCost_dC represent partial derivatives, the gradient. In the points loop we are using well-known two-point differentation formula ([3], see the one with $2h$ in the denominator) to calculate the partial derivatives $\frac{dCost}{dA}$, $\frac{dCost}{dB}$ and $\frac{dCost}{dC}$.
Once we have looped through all the points we have a joint gradient for all the points that tells us the direction of the greatest rate of increase of the cost function. Since we want to go towards the bottom of the function, not the top, we need to subtract the gradient (using learn rate of $0.001$, just like in TensorFlow example) from the model parameters $A$, $B$ and $C$.

And that's it. Really. I can't be sure if TensorFlow doesn't do anything more sophisticated with the derivatives but the results I got from both TensorFlow code above and the C++ code above are very similar.
The reason that the results are not identical could be that TensorFlow uses doubles whereas I used floats. Another reason could be that TensorFlow uses symbolic differentiation while I used numerical.

Now one mind wonder... Since as input we gave points that lie on a straight line why didn't we end up with $A=0$ and $0$ cost? A straight line should approximate these points best, right? That is correct. And in fact we can end up with $0$ cost once we have performed a sufficient number of iterations. In both cases above we made a thousand of them. But what coefficients we would end up with if we did 100k iterations? Here's the C++ code's answer:
We got almost straight line equation $-x + 1$.
For such a simple linear problem least squares algorithm would be a much better (cheaper and more numerically stable) option. Using GD for that is like using heavy artillery to shoot a mosquito. But its great advantage is that you can shoot... solve many other kinds of functions. That's what it is for.

Both the C++ code and TensorFlow code process all of the points and run gradient descent on the whole set of them. In machine learning that is called Batch Gradient Descent (BGD). There are alternative approaches. One of them, called Stochastic Gradient Descent (SGD), instead of processing the whole set of points picks one of the points at random and runs gradient descent only on that one sample. There is also some middle ground between BGD and SGD and it's called Minibatch Gradient Descent (MGD). The idea is to take a batch of $n$ random points and do exactly the same things as with regular BGD. There are different approaches here that you might take. You might take $n$ completely random points, where some of them might be included in a minibatch more than once. You can also make sure there are always different $n$ points in a minibatch. It's up to you. I recommend book [4] as it has a lot of nice pictures showing differences between BGD, SGD and MGD. It's a great book about machine learning in general.

The C++ code I wrote for GD was to be the first step towards my custom GPU-based neural network implementation. For the time being I have ditched that idea because I found it surprisingly easy to work with PyTorch and that is what I'm exploring now with regard to GPU-based neural nets.

[1] http://www.mn.uio.no/ifi/english/services/knowledge/cs/forelesn.kap5.pdf
[2] https://www.youtube.com/watch?v=wuo4JdG3SvU
[3] https://en.wikipedia.org/wiki/Numerical_differentiation
[4] https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291