wtorek, 20 marca 2018

Solving for Quadratic Equation with TensorFlow and Custom C++ Code

Let's just start with what is our problem. The problem we want to solve is to find quadratic equation's $Ax^2 + Bx + C$ coefficients $A$, $B$ and $C$ that best approximate a given set of 2D points $(x, y)$. This is a linear problem that can easily be solved with Least Squares Fitting [1]. However, here we will solve it using Gradient Descent (GD), which is a general algorithm that lets you solve for both linear and non-linear problems. This is a basic algorithm that all neural networks frameworks and based on.

Some background for this blog post. Initally I started doing my first neural networks work using TensorFlow and found it very difficult to 1) work with that framework (I find PyTroch much more intuitive to work with) and 2) to force it to use GPU. Because of that I thought that since GD is not a hard algorithm to implement I thought that I could write my own implementation of it using DirectCompute. But before doing that I had to see if I correctly understand what TensorFlow is doing under the hood. I was able to verify that. I wrote a simple quadratic equation solver with TensorFlow and then a simple quadratic solver in C++ using GD. Now, the huge difference between the two is that TensorFlow computes derivatives (which are base for GD) symbolically whereas my C++ solver does so numerically. Obviously, the first approach - symbolic differentiation - is the preferred way as it provides more stable results. But for quick verification numerical differentation was enough for me.

Now a little bit on GD itself. The idea is that we have some cost function as a starting point. This cost function could be the sum of distances of all points to the quadratic curve. Usually our aim is to minimize that cost function, that is, find input arguments for that function (in our case $A$, $B$ and $C$ of the quadratic equation) such that the cost function has value as small as possible (i.e. minimized). We also know there this mathematical entity called gradient which, when computed, tells us what values we should add to our input arguments so that the value of the cost function increases (yes, increases, not decreases) by more than it would have increased if we had chosen different values to add. If we repeat this step a few times we might eventually reach the top of the function, i.e. such input arguments where the value of the cost function is the biggest. Now since we're interested in finding the bottom of the function, not the top, we just need to subtract the gradient from our input arguments, not add it.

As stated, we'll be finding $A$, $B$ and $C$ of the quadratic equation given a set of points $(x, y)$ that the quadratic equation is supposed to approximate. We will use here a short, predefined list of points: $(1, 0), (2, -1), (3, -2), (4, -3)$. Yeah, you've probably noticed that they all lie on a straight line so straight line would be what best approximates that set. But we want a quadratic. Because that's why :).

TensorFlow code that solves this comes here:
import tensorflow as tf
import numpy

# TensorFlow model

tf_a = tf.Variable([0.0], dtype=tf.float32)
tf_b = tf.Variable([0.0], dtype=tf.float32)
tf_c = tf.Variable([0.0], dtype=tf.float32)

tf_x = tf.placeholder(tf.float32)
tf_y = tf.placeholder(tf.float32)

tf_cost = tf.reduce_sum(
  tf.square(tf_a*tf_x*tf_x + tf_b*tf_x + tf_c - tf_y))
tf_optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(tf_cost)

# run the model

points = [ [1, 2, 3, 4], [0, -1, -2, -3] ]

sess = tf.Session()
for i in range(1000):
    { tf_x: points[0], tf_y: points[1] })

a, b, c, cost = sess.run(
  [tf_a, tf_b, tf_c, tf_cost],
  { tf_x: points[0], tf_y: points[1] })
print("A: ", a, "  B: ", b, "  C: ", c, "  cost: ", cost);
As you can see, we're first building a TensorFlow model of the computational graph and then run it with data provided (1000 iterations, learning rate set to $0.001$). We define the cost function as the sum of differences between the quadratic equation with $x$ provided for all points, and actual $y$ provided for all points. For anyone who's worked at least a little bit with TensorFlow this code should be pretty self-explanatory. For those of you for which it is not I highly recommend going to [2].

Okay. So there was the magic of TensorFlow. We will now see how to write code that does exactly the same calculations but with C++ (but using numerical differentation for ease of implementation and generalization). Let's just see the whole code:
float Cost(const SVector2& pt, float a, float b, float c)
  float value = a*pt.x*pt.x + b*pt.x + c - pt.y;
  return value * value;


vector< SVector2 > points;
points.push_back(VectorCustom(1.0f, 0.0f));
points.push_back(VectorCustom(2.0f, -1.0f));
points.push_back(VectorCustom(3.0f, -2.0f));
points.push_back(VectorCustom(4.0f, -3.0f));

const float epsilon = 0.0001f;

float a = 0.0f;
float b = 0.0f;
float c = 0.0f;
float cost = 0.0f;
for (uint i = 0; i < 1000; i++)
  cost = 0.0;
  for (uint j = 0; j < points.size(); j++)
    cost += Cost(points[j], a, b, c);

  float dCost_dA = 0.0f;
  float dCost_dB = 0.0f;
  float dCost_dC = 0.0f;
  for (uint j = 0; j < points.size(); j++)
    dCost_dA += ( Cost(points[j], a + epsilon, b, c) -
      Cost(points[j], a - epsilon, b, c) ) /
      (2.0f * epsilon);
    dCost_dB += ( Cost(points[j], a, b + epsilon, c) -
      Cost(points[j], a, b - epsilon, c) ) /
      (2.0f * epsilon);
    dCost_dC += ( Cost(points[j], a, b, c + epsilon) -
      Cost(points[j], a, b, c - epsilon) ) /
      (2.0f * epsilon);

  a -= 0.001f * dCost_dA;
  b -= 0.001f * dCost_dB;
  c -= 0.001f * dCost_dC;

cout << "A: " << a << "  B: " << b << "  C: " << c << "  cost: " << cost << endl;
We start with defining a cost function for a single point (note that TensorFlow, as it operates on tensors, computes the error for all points in single expression). This cost function is further used in iterations (also 1000) loop where costs for all points are summed together. The next step is doing GD using numerical differentiation. Values dCost_dA, dCost_dB and dCost_dC represent partial derivatives, the gradient. In the points loop we are using well-known two-point differentation formula ([3], see the one with $2h$ in the denominator) to calculate the partial derivatives $\frac{dCost}{dA}$, $\frac{dCost}{dB}$ and $\frac{dCost}{dC}$.
Once we have looped through all the points we have a joint gradient for all the points that tells us the direction of the greatest rate of increase of the cost function. Since we want to go towards the bottom of the function, not the top, we need to subtract the gradient (using learn rate of $0.001$, just like in TensorFlow example) from the model parameters $A$, $B$ and $C$.

And that's it. Really. I can't be sure if TensorFlow doesn't do anything more sophisticated with the derivatives but the results I got from both TensorFlow code above and the C++ code above are very similar.
The reason that the results are not identical could be that TensorFlow uses doubles whereas I used floats. Another reason could be that TensorFlow uses symbolic differentiation while I used numerical.

Now one mind wonder... Since as input we gave points that lie on a straight line why didn't we end up with $A=0$ and $0$ cost? A straight line should approximate these points best, right? That is correct. And in fact we can end up with $0$ cost once we have performed a sufficient number of iterations. In both cases above we made a thousand of them. But what coefficients we would end up with if we did 100k iterations? Here's the C++ code's answer:
We got almost straight line equation $-x + 1$.
For such a simple linear problem least squares algorithm would be a much better (cheaper and more numerically stable) option. Using GD for that is like using heavy artillery to shoot a mosquito. But its great advantage is that you can shoot... solve many other kinds of functions. That's what it is for.

Both the C++ code and TensorFlow code process all of the points and run gradient descent on the whole set of them. In machine learning that is called Batch Gradient Descent (BGD). There are alternative approaches. One of them, called Stochastic Gradient Descent (SGD), instead of processing the whole set of points picks one of the points at random and runs gradient descent only on that one sample. There is also some middle ground between BGD and SGD and it's called Minibatch Gradient Descent (MGD). The idea is to take a batch of $n$ random points and do exactly the same things as with regular BGD. There are different approaches here that you might take. You might take $n$ completely random points, where some of them might be included in a minibatch more than once. You can also make sure there are always different $n$ points in a minibatch. It's up to you. I recommend book [4] as it has a lot of nice pictures showing differences between BGD, SGD and MGD. It's a great book about machine learning in general.

The C++ code I wrote for GD was to be the first step towards my custom GPU-based neural network implementation. For the time being I have ditched that idea because I found it surprisingly easy to work with PyTorch and that is what I'm exploring now with regard to GPU-based neural nets.

[1] http://www.mn.uio.no/ifi/english/services/knowledge/cs/forelesn.kap5.pdf
[2] https://www.youtube.com/watch?v=wuo4JdG3SvU
[3] https://en.wikipedia.org/wiki/Numerical_differentiation
[4] https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

poniedziałek, 26 lutego 2018

DirectX 11, HLSL, GatherRed

Every once in a while I am in need to use one of those Gather functions from DirectX 11's HLSL library. GatherRed in this case. This function is useful because it allows you take four samples with just one instruction and store them all in float4. As the name indicates, of the four texels that are sampled simultaneously, only the values from the red channels will be stored in float4. If you need data from other channels you can use respective functions. It is really worth using this function(s) if you only need data from one channel as calling one gather is faster than taking four samples individually.

If instead of using your original UV coordinates to take one regular sample with Sample or SampleLevel you call GatherRed which four samples (their red channels) exactly will be taken? This is something the DirectX's documentation doesn't specify so this short blog post is here to fill this gap. You can also stumble on that information in various DirectX 11 presentations.

Take a look at the picture:

The grey pixel is the one whose UV coordinates we have in the shader (the very center of that texel, to be more specific). If you call GatherRed you will get the four labeled samples (again, only their red channel's values). Probably a little bit counter-intuitively the value of the "base" sample is not stored in return value's $x$ component but $w$ as the image above shows. For better picture the two following snippets are equivalent:
float myValueR = myTexture.Sample(mySampler, uv).x;
float myValueG = myTexture.Sample(mySampler, uv).y;
float myValueB = myTexture.Sample(mySampler, uv).z;
float myValueA = myTexture.Sample(mySampler, uv).w;
float myValueR = myTexture.GatherRed(mySampler, uv).w;
float myValueG = myTexture.GatherGreen(mySampler, uv).w;
float myValueB = myTexture.GatherBlue(mySampler, uv).w;
float myValueA = myTexture.GatherAlpha(mySampler, uv).w;
And these as well:
float r1 = myTexture.Sample(mySampler, uv).x;
float r2 = myTexture.Sample(mySampler, uv, int2(1, 0)).x;
float r3 = myTexture.Sample(mySampler, uv, int2(0, 1)).x;
float r4 = myTexture.Sample(mySampler, uv, int2(1, 1)).x;
float4 samples = myTexture.GatherRed(mySampler, uv);
float r1 = samples.w;
float r2 = samples.z;
float r3 = samples.x;
float r4 = samples.y;
There. I hope you won't have to wonder anymore the order of samples returned by gathers :). At least I know I won't.

czwartek, 15 września 2016

Conversion Between std::string and std::wstring

While developing my brand new (being over a year old at least...) framework I decided I want to use ANSI chars (std::string or char) as often as possible, instead of more complex encodings like Unicode (stored in std::wstring or wchar_t). As file names for instance. And for tons of other stuff. However, many functions, most notably from Windows SDK, expect wide chars instead of pure chars. This is the case with D3DCompileFromFile function that I used. My custom function called CompileShaderFromFile calls this function but takes as input file name to a shader file stored in std::string, which cannot be passed to D3DCompileFromFile. At least not without conversion. So I started to search Google for conversion functions between the two, chars and wide chars. To my surprise I found a lot of code that did not work straight away, was very complex or was platform-specific. Eventually, I thought that since built-in STL library is so packed with various conversion functions then maybe there is something to convert between string and wstring. Turns out there is. You can add a wchar_t to string and a char to wstring, and the append function (+= operator in my code) will perform proper conversion.
Here goes the code:
inline wstring StringToWString(const string& s)
    wstring temp = L"";

    for (uint i = 0; i < s.length(); i++)
        temp += (wchar_t)s[i];

    return temp;

inline string WStringToString(const wstring& s)
    string temp = "";

    for (uint i = 0; i < s.length(); i++)
        temp += (char)s[i];

    return temp;
Or equivalently:
inline wstring StringToWString(const string& s)
    return wstring(s.begin(), s.end());

inline string WStringToString(const wstring& s)
    return string(s.begin(), s.end());
And that's it. With these two functions (or actually only the first one) I was able to pass string to all my custom functions and convert it to wstring when necessary. Simple, clean and elegant solution.

IMPORTANT EDIT (17.09.2016):

In a GameDev.net thread (http://www.gamedev.net/topic/682185-conversion-between-stdstring-and-stdwstring/) it was pointed out to me that the code above will only work for ISO-8859-1 encoding and that I'm actually not converting chars and wide chars but simply cast them. This is fine for the just-mentioned encoding but might cause troubles in others.

Anyway, my problem has been solved with this simple solution so if you don't have funky chars in your paths and just want to pass wide chars which you know are ANSI (and stored in chars) then casting chars to wide chars will do the trick.

środa, 23 września 2015

Deferred Lighting on Android

This post is a bit not on time. I made a demo of deferred lighting running on Android (tested on Tegra 2, Tegra 3 and Adreno 220 devices) back three years ago when it could still be considered realtively fresh. It's probably not today but nevertheless I wanted to share it with you.

The demo is here; the package also contains README.txt file with steps needed to run it. Basically you will need Eclipse, Android NDK and possibly other stuff. To get everything up and running when developing Android apps I use this https://developer.nvidia.com/AndroidWorks.

Now a few words on how my implementation works.

The demo uses OpenGL ES 2.0. This API allows to barely render into one render target at a time and as far as I know does not provide floating point formats. How do you generate the g-buffer with that limitation? Well, you could go with a few passes but yeah, that is insane in terms of performance for Android devices. That's why I settled with one pass and fit everything into a single RGBA8 buffer. All I need for basic deferred lighting is position and normals. Instead of postition we store (linear) depth using 16 bits of the buffer (position is reconstructed later in the lighting pass). In case of normals I store $x$ and $y$ components in view-space and reconstruct $z$ in the lighting pass, hence only two (8-bit) channels suffice. All fits into a 32-bit buffer. Please examine shaders code to see how it all works.

The demo can use two common optimizations for the lights in the lighting pass. One is using scissor test and the other using stencil buffer. Both can be turned on/off in code using macros USE_SCISSOR and USE_STENCIL.

The demo renders 18 small-to-medium lights on Tegra 3 with 21 FPS which I consider not bad. Here goes a picture:

sobota, 19 września 2015

As Simple As Possible Explanation of Depth of Field in Ray Tracer

Recently I was working on a ray tracer and came to implement depth of field. I looked at a few "tutorials" on how to do it and must admit I had some difficulties understanding them. They often introduce the concept of a lens and try to explain how actual physics work. Sort of. Well, I didn't like these explanations so I decided to share my own.

Have a look at the picture:

It shows a side view of the view frustum. There is the eye along with the "plane" (let's call it eye plane) the eye is on, the near plane at near plane distance (you could also call it film plane) and the focal plane at focal plane distance.

The point of depth of field is to make the image appear blurry everywhere where the generated rays don't intersect scene at focal plane distance. To achieve this, instead of ray tracing only one regular pinhole ray (the black ray in the picture), we generate a bunch of rays (let's call them depth of field rays) that originate at random places on the eye plane (one of these rays is the red ray in the picture). So, we know where the new ray originates. Now what we need to know is its direction. Since our image cannot be blurry at the focal plane, all of the depth of field rays must meet at the same point on the focal plane. This intersection point is simply calculated by finding the intersection point of the regular pinhole ray (the black ray in the picture) with the focal plane. So now, for each depth of field ray you have its origin (random point on the eye plane; the bigger the variance of those rays the bigger the DOF effect) and point it passes through (on the focal plane) hence the ray's direction is easy to be calculated. Once you've traced all DOF rays and got their radiances you just average them and get a nice depth of field effect.

That was it!

wtorek, 19 maja 2015

Spherical Harmonic Lighting - Analytical Lights

Recently I decided to finally get the hang of spherical harmonics (in the context  of computer graphics, of course). I started with the standard reading on the subject, which is Robin Green's paper. It's nice but I found an even better reading in A Gentle Introduction to PRT (and this paper as a bit of basis for this note). Both papers nicely explain how to integrate numerically the lighting environment and project it into SH. What they don't talk about is how to project analytical, directional lights into SH basis. One article I found on this subject is chapter 2.15 in ShaderX3 but (to me) it's not thorough enough, or at least it doesn't discuss some details I was interested in. A lot of light was shed on me here and here.

SH Formulas

Here come the first three bands of SH functions (both in spherical and cartesian coords, with Condon-Shortley phase). I've seen them in various places but often with typos. The formulas here are taken from chapter 3.2 from GPU Pro 2.
Y_0^0 &=& \frac{1}{2} \sqrt{\frac{1}{\pi}} &=& \frac{1}{2} \sqrt{\frac{1}{\pi}} \cr
Y_1^{-1} &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} \sin(\theta) \sin(\phi) &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} y \cr
Y_1^0 &=& \frac{1}{2} \sqrt{\frac{3}{\pi}} \cos(\theta) &=& \frac{1}{2} \sqrt{\frac{3}{\pi}} z \cr
Y_1^1 &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} \sin(\theta) \cos(\phi) &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} x \cr
Y_2^{-2} &=& \frac{1}{2} \sqrt{\frac{15}{\pi}} \sin^2(\theta) \sin(2\phi) &=& \frac{1}{2} \sqrt{\frac{15}{\pi}} xy \cr
Y_2^{-1} &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} \sin(\theta) \cos(\theta) \sin(\phi) &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} yz \cr
Y_2^0 &=& \frac{1}{4} \sqrt{\frac{5}{\pi}} (3 \cos^2(\theta) - 1) &=& \frac{1}{4} \sqrt{\frac{5}{\pi}} (3z^2 - 1) \cr
Y_2^1 &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} \sin(\theta) \cos(\theta) \cos(\phi) &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} zx \cr
Y_2^2 &=& \frac{1}{4} \sqrt{\frac{15}{\pi}} \sin^2(\theta) \cos(2\phi) &=& \frac{1}{4} \sqrt{\frac{15}{\pi}} (x^2 - y^2) \cr
We'll work only with the first three bands as this is enough for accurately representing directional diffuse (Lambertian BRDF) lighting.

Projecting Light and Transfer Functions into SH

Remember we're dealing with directional lights acting on Lambertian surfaces. The light function $L$ is simply constant (light) color ($lc$), projected into SH in the direction of the light ($ld$). Thus, $L$ projected into SH gives $L_l^m$:
L_0^0 &=& \frac{1}{2} \sqrt{\frac{1}{\pi}} lc \cr
L_1^{-1} &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} ld_y lc \cr
L_1^0 &=& \frac{1}{2} \sqrt{\frac{3}{\pi}} ld_z lc \cr
L_1^1 &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} ld_x lc \cr
L_2^{-2} &=& \frac{1}{2} \sqrt{\frac{15}{\pi}} ld_x ld_y lc \cr
L_2^{-1} &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} ld_y ld_z lc \cr
L_2^0 &=& \frac{1}{4} \sqrt{\frac{5}{\pi}} (3ld_z^2 - 1) lc \cr
L_2^1 &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} ld_z ld_x lc \cr
L_2^2 &=& \frac{1}{4} \sqrt{\frac{15}{\pi}} (ld_x^2 - ld_y^2) lc \cr
This is just light. We also need to project into SH the transfer function $A$, which in case of our Lambertian BRDF function is simply $\cos(\theta)$. The formula for this function projected into SH is derived here; have a look at equations (19) (this equation has two typos which are fixed here, thanks to KriS) and (26) in particular. Formulas:
l = 1: & A_l &=& \frac{\sqrt{\pi}}{3} N_l \cr
l > 1, \mbox{odd}: & A_l &=& 0 \cr
l, \mbox{even}: & A_l &=& 2 \pi \sqrt{\frac{2l+1}{4 \pi}} \frac{(-1)^{l/2-1}}{(l+2)(l-1)} \frac{l!}{2^l ((n/2)!)^2} N_l \cr
\mbox{where} & N_l &=& \sqrt{\frac{4 \pi}{2l + 1}} \cr
Note that $A$ varies per SH band index. $A$ values for the first three SH bands are:
A_0 &=& \pi \cr
A_1 &=& \pi \frac{2}{3} \cr
A_2 &=& \frac{\pi}{4} \cr
The final formula for each SH coefficient of light and transfer functions, projected into SH and convolved is:
E_l^m = L_l^m A_l

Final Vertex Color

To finally calculate the lighting for a vertex, given its normal $n$, we need to project the vertex's color ($vc$) in the direction of the normal into SH. This is pretty much the same what we did for the light function:
V_0^0 &=& \frac{1}{2} \sqrt{\frac{1}{\pi}} vc \cr
V_1^{-1} &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} n_y vc \cr
V_1^0 &=& \frac{1}{2} \sqrt{\frac{3}{\pi}} n_z vc \cr
V_1^1 &=& \frac{-1}{2} \sqrt{\frac{3}{\pi}} n_x vc \cr
V_2^{-2} &=& \frac{1}{2} \sqrt{\frac{15}{\pi}} n_x n_y vc \cr
V_2^{-1} &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} n_y n_z vc \cr
V_2^0 &=& \frac{1}{4} \sqrt{\frac{5}{\pi}} (3n_z^2 - 1) vc \cr
V_2^1 &=& \frac{-1}{2} \sqrt{\frac{15}{\pi}} n_z n_x vc \cr
V_2^2 &=& \frac{1}{4} \sqrt{\frac{15}{\pi}} (n_x^2 - n_y^2) vc \cr
Convolved with $E_l^m$ it gives the final vertex color $C_l^m$ for each band:
C_l^m &=& V_l^m E_l^m
To retrieve the final vertex color $C$ all we need is to sum all SH components:
C = \sum\limits_{l=0}^2 \sum\limits_{m=-l}^l C_l^m

środa, 1 kwietnia 2015

Data-oriented Design and Fourier Transform

For some time now I've been thinking about this whole data-oriented design/programming idea that, among others, BitSquid guys have been preaching. To put it simple, the idea is to have data and functions that operate on this data declared separately. So it's closer to functional programming than to standard object-oriented programming. On the other end we have the just-mentioned object-oriented design. In this case we have some sort of an object that stores its state (data) and has methods (functions) that operate on that data. This is a simplification but should give you the general idea of the concept.

I'm not going to go into details of how to code in data-oriented manner. I'm a very adept of this technique. What I want is to share with you some of my old code (2010-2011 or something) and its refactored counter-part (2015); refactored with data-oriented design in mind. How I see it.

The code consists mostly of a rather simple implementation of math and image library. There is also a sample that performs Discrete Fourier Transform using the image library.

The math library code is in math/include/blossom_math/ and math/src/.
The general image code is in common/include/blossom_common/image.hpp and common/src/image.cpp.
The sample's code is in samples/fourier/src/main.cpp. Also, the Fourier functions can be found here.

The math code is in include/math/ and src/math/.
The image code is in include/image/ and src/image/. Also, the Fourier functions are here.
The sample's code is in samples/fourier/src/main.cpp.

Both archives come with binaries (for Windows). In samples/fourier/data/ there is run.bat file that will run the sample and perform Fourier transform on selected image using specified filter.

Feel free to express what you think about my data-oriented understanding :).