Tuesday, March 20, 2018

Solving for Quadratic Equation with TensorFlow and Custom C++ Code

Go to Index

Let's just start with what is our problem. The problem we want to solve is to find quadratic equation's $Ax^2 + Bx + C$ coefficients $A$, $B$ and $C$ that best approximate a given set of 2D points $(x, y)$. This is a linear problem that can easily be solved with Least Squares Fitting [1]. However, here we will solve it using Gradient Descent (GD), which is a general algorithm that lets you solve for both linear and non-linear problems. This is a basic algorithm that all neural networks frameworks are based on.

Some background for this blog post. Initally I started doing my first neural networks work using TensorFlow and found it very difficult to 1) work with that framework (I find PyTroch much more intuitive to work with) and 2) to force it to use GPU. Because of that I thought that since GD is not a hard algorithm to implement I thought that I could write my own implementation of it using DirectCompute. But before doing that I had to see if I correctly understand what TensorFlow is doing under the hood. I was able to verify that. I wrote a simple quadratic equation solver with TensorFlow and then a simple quadratic solver in C++ using GD. Now, the huge difference between the two is that TensorFlow computes derivatives (which are base for GD) symbolically whereas my C++ solver does so numerically. Obviously, the first approach - symbolic differentiation - is the preferred way as it provides more stable results. But for quick verification numerical differentation was enough for me.

Now a little bit on GD itself. The idea is that we have some cost function as a starting point. This cost function could be the sum of distances of all points to the quadratic curve. Usually our aim is to minimize that cost function, that is, find input arguments for that function (in our case $A$, $B$ and $C$ of the quadratic equation) such that the cost function has value as small as possible (i.e. minimized). We also know there this mathematical entity called gradient which, when computed, tells us what values we should add to our input arguments so that the value of the cost function increases (yes, increases, not decreases) by more than it would have increased if we had chosen different values to add. If we repeat this step a few times we might eventually reach the top of the function, i.e. such input arguments where the value of the cost function is the biggest. Now since we're interested in finding the bottom of the function, not the top, we just need to subtract the gradient from our input arguments, not add it.

As stated, we'll be finding $A$, $B$ and $C$ of the quadratic equation given a set of points $(x, y)$ that the quadratic equation is supposed to approximate. We will use here a short, predefined list of points: $(1, 0), (2, -1), (3, -2), (4, -3)$. Yeah, you've probably noticed that they all lie on a straight line so straight line would be what best approximates that set. But we want a quadratic. Because that's why :).

TensorFlow code that solves this comes here:

import tensorflow as tf
import numpy


# TensorFlow model

tf_a = tf.Variable([0.0], dtype=tf.float32)
tf_b = tf.Variable([0.0], dtype=tf.float32)
tf_c = tf.Variable([0.0], dtype=tf.float32)

tf_x = tf.placeholder(tf.float32)
tf_y = tf.placeholder(tf.float32)

tf_cost = tf.reduce_sum(
  tf.square(tf_a*tf_x*tf_x + tf_b*tf_x + tf_c - tf_y))
tf_optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(tf_cost)


# run the model

points = [ [1, 2, 3, 4], [0, -1, -2, -3] ]

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(1000):
  sess.run(tf_optimizer,
    { tf_x: points[0], tf_y: points[1] })

a, b, c, cost = sess.run(
  [tf_a, tf_b, tf_c, tf_cost],
  { tf_x: points[0], tf_y: points[1] })
print("A: ", a, "  B: ", b, "  C: ", c, "  cost: ", cost);
As you can see, we're first building a TensorFlow model of the computational graph and then run it with data provided (1000 iterations, learning rate set to $0.001$). We define the cost function as the sum of differences between the quadratic equation with $x$ provided for all points, and actual $y$ provided for all points. For anyone who's worked at least a little bit with TensorFlow this code should be pretty self-explanatory. For those of you for which it is not I highly recommend going to [2].

Okay. So there was the magic of TensorFlow. We will now see how to write code that does exactly the same calculations but with C++ (but using numerical differentation for ease of implementation and generalization). Let's just see the whole code:
float Cost(const SVector2& pt, float a, float b, float c)
{
  float value = a*pt.x*pt.x + b*pt.x + c - pt.y;
  return value * value;
}

...

vector< SVector2 > points;
points.push_back(VectorCustom(1.0f, 0.0f));
points.push_back(VectorCustom(2.0f, -1.0f));
points.push_back(VectorCustom(3.0f, -2.0f));
points.push_back(VectorCustom(4.0f, -3.0f));

const float epsilon = 0.0001f;

float a = 0.0f;
float b = 0.0f;
float c = 0.0f;
float cost = 0.0f;
for (uint i = 0; i < 1000; i++)
{
  cost = 0.0;
  for (uint j = 0; j < points.size(); j++)
    cost += Cost(points[j], a, b, c);

  float dCost_dA = 0.0f;
  float dCost_dB = 0.0f;
  float dCost_dC = 0.0f;
  for (uint j = 0; j < points.size(); j++)
  {
    dCost_dA += ( Cost(points[j], a + epsilon, b, c) -
      Cost(points[j], a - epsilon, b, c) ) /
      (2.0f * epsilon);
    dCost_dB += ( Cost(points[j], a, b + epsilon, c) -
      Cost(points[j], a, b - epsilon, c) ) /
      (2.0f * epsilon);
    dCost_dC += ( Cost(points[j], a, b, c + epsilon) -
      Cost(points[j], a, b, c - epsilon) ) /
      (2.0f * epsilon);
  }

  a -= 0.001f * dCost_dA;
  b -= 0.001f * dCost_dB;
  c -= 0.001f * dCost_dC;
}

cout << "A: " << a << "  B: " << b << "  C: " << c << "  cost: " << cost << endl;
We start with defining a cost function for a single point (note that TensorFlow, as it operates on tensors, computes the error for all points in single expression). This cost function is further used in iterations (also 1000) loop where costs for all points are summed together. The next step is doing GD using numerical differentiation. Values dCost_dA, dCost_dB and dCost_dC represent partial derivatives, the gradient. In the points loop we are using well-known two-point differentation formula ([3], see the one with $2h$ in the denominator) to calculate the partial derivatives $\frac{dCost}{dA}$, $\frac{dCost}{dB}$ and $\frac{dCost}{dC}$.
Once we have looped through all the points we have a joint gradient for all the points that tells us the direction of the greatest rate of increase of the cost function. Since we want to go towards the bottom of the function, not the top, we need to subtract the gradient (using learn rate of $0.001$, just like in TensorFlow example) from the model parameters $A$, $B$ and $C$.

And that's it. Really. I can't be sure if TensorFlow doesn't do anything more sophisticated with the derivatives but the results I got from both TensorFlow code above and the C++ code above are very similar.
TensorFlow:
C++:
The reason that the results are not identical could be that TensorFlow uses doubles whereas I used floats. Another reason could be that TensorFlow uses symbolic differentiation while I used numerical.

Now one mind wonder... Since as input we gave points that lie on a straight line why didn't we end up with $A=0$ and $0$ cost? A straight line should approximate these points best, right? That is correct. And in fact we can end up with $0$ cost once we have performed a sufficient number of iterations. In both cases above we made a thousand of them. But what coefficients we would end up with if we did 100k iterations? Here's the C++ code's answer:
We got almost straight line equation $-x + 1$.
For such a simple linear problem least squares algorithm would be a much better (cheaper and more numerically stable) option. Using GD for that is like using heavy artillery to shoot a mosquito. But its great advantage is that you can shoot... solve many other kinds of functions. That's what it is for.

Both the C++ code and TensorFlow code process all of the points and run gradient descent on the whole set of them. In machine learning that is called Batch Gradient Descent (BGD). There are alternative approaches. One of them, called Stochastic Gradient Descent (SGD), instead of processing the whole set of points picks one of the points at random and runs gradient descent only on that one sample. There is also some middle ground between BGD and SGD and it's called Minibatch Gradient Descent (MGD). The idea is to take a batch of $n$ random points and do exactly the same things as with regular BGD. There are different approaches here that you might take. You might take $n$ completely random points, where some of them might be included in a minibatch more than once. You can also make sure there are always different $n$ points in a minibatch. It's up to you. I recommend book [4] as it has a lot of nice pictures showing differences between BGD, SGD and MGD. It's a great book about machine learning in general.

The C++ code I wrote for GD was to be the first step towards my custom GPU-based neural network implementation. For the time being I have ditched that idea because I found it surprisingly easy to work with PyTorch and that is what I'm exploring now with regard to GPU-based neural nets.

[1] http://www.mn.uio.no/ifi/english/services/knowledge/cs/forelesn.kap5.pdf
[2] https://www.youtube.com/watch?v=wuo4JdG3SvU
[3] https://en.wikipedia.org/wiki/Numerical_differentiation
[4] https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

Monday, February 26, 2018

DirectX 11, HLSL, GatherRed

Go to Index

Every once in a while I am in need to use one of those Gather functions from DirectX 11's HLSL library. GatherRed in this case. This function is useful because it allows you take four samples with just one instruction and store them all in float4. As the name indicates, of the four texels that are sampled simultaneously, only the values from the red channels will be stored in float4. If you need data from other channels you can use respective functions. It is really worth using this function(s) if you only need data from one channel as calling one gather is faster than taking four samples individually.

If instead of using your original UV coordinates to take one regular sample with Sample or SampleLevel you call GatherRed which four samples (their red channels) exactly will be taken? This is something the DirectX's documentation doesn't specify so this short blog post is here to fill this gap. You can also stumble on that information in various DirectX 11 presentations.
 
Take a look at the picture:
 
 
 
The grey pixel is the one whose UV coordinates we have in the shader (the very center of that texel, to be more specific). If you call GatherRed you will get the four labeled samples (again, only their red channel's values). Probably a little bit counter-intuitively the value of the "base" sample is not stored in return value's $x$ component but $w$ as the image above shows. For better picture the two following snippets are equivalent:
float r1 = myTexture.Sample( mySampler, uv, int2(0, 0) ).x;
float r2 = myTexture.Sample( mySampler, uv, int2(1, 0) ).x;
float r3 = myTexture.Sample( mySampler, uv, int2(0, 1) ).x;
float r4 = myTexture.Sample( mySampler, uv, int2(1, 1) ).x;
and:
float4 samples = myTexture.GatherRed( mySampler, uv + float2(0.5f, 0.5f)/myTextureDim );
float r1 = samples.w;
float r2 = samples.z;
float r3 = samples.x;
float r4 = samples.y;

And these as well:
float myValueR = myTexture.Sample( mySampler, uv ).x;
float myValueG = myTexture.Sample( mySampler, uv ).y;
float myValueB = myTexture.Sample( mySampler, uv ).z;
float myValueA = myTexture.Sample( mySampler, uv ).w;
and:
float myValueR = myTexture.GatherRed( mySampler, uv + float2(0.5f, 0.5f)/myTextureDim ).w;
float myValueG = myTexture.GatherGreen( mySampler, uv + float2(0.5f, 0.5f)/myTextureDim ).w;
float myValueB = myTexture.GatherBlue( mySampler, uv + float2(0.5f, 0.5f)/myTextureDim ).w;
float myValueA = myTexture.GatherAlpha( mySampler, uv + float2(0.5f, 0.5f)/myTextureDim ).w;
You probably noticed that when using Gather there is a half-texel offset applied. That is because this instruction does not blindly sample the texture at the specified uv and its right/bottom/right-bottom neighbors. Instead, it picks uvs that would have been chosen if we wanted to perform custom bilinear texture filtering. Have a look at the image below:
 

Here we do not apply the half-texel offset. As a result Gather picks different samples than Sample. It picks the samples that would have been used for bilinear filtering. In order to "invalidate" that, and make sure that Gather will always have the upper-left sample under the current/sampled uv we need to apply the half-texel offset.

There. I hope you won't have to wonder anymore the order of samples returned by gathers :). At least I know I won't.
 
ACKNOWLEDGEMENTS: I'd like to thank Klaudiusz Zych for pointing out to me the need to apply the half-texel offset. I missed it in the first version of this post. Klaudiusz also drew the image that explains graphically the need to use the half-texel offset.
Also thanks to @xi@g@me for pointing out the same mistake in the comments.