Monday, January 29, 2018

Flush Less Often

Here's a riddle:
Q: What does a two year old and OpenGL have in common? 
A: You never know when either of them is going to flush.*
In the case of a two-year old, you can wait a few years and he'll find different ways to not listen to you; unfortunately in the case of OpenGL, this is a problem of API design; therefore we have to use a fairly big hammer to fix it.

What IS the Flushing Problem?

Modern GPUs work by writing a command buffer (a list of binary encoded drawing instructions) to memory (using the CPU) and then "sending" that buffer to the GPU for execution, either by changing ownership of the shared memory or by a DMA copy.

Until that buffer of commands goes to the GPU, from the GPU's perspective, you haven't actually asked it to do anything - your command buffer is just sitting there, collecting dust, while the GPU is idle.

In modern APIs like Vulkan, Metal, and DX12, the command buffer is an object you build, and then you explicitly send it to the GPU with an API call.

With OpenGL, the command buffer is implicit - you never see it, it just gets generated as you make API calls. The command buffer is sent to the GPU ("flushed") under a few circumstances:
  1. If you ask GL to do so via glFlush.
  2. If you make a call that does an automatic flush (glFinish, glSwapBuffer, waiting on a sync with the flush bit).
  3. If the command buffer fills up due to you doing a lot of stuff.
This last case is the problematic one because it's completely unpredictable.

Why Do We Care?

Back in the day, we didn't care - you'd write commands and buffers would go out when they were full (ensuring a "goodly amount of work" gets sent to the GPU) and the last command buffer was sent when you swapped your back buffer.

But with modern OpenGL, calling the API is only a small fraction of the work we do; most of the work of drawing involves filling buffers with numbers. This is where your meshes and hopefully constant state are all coming from.

The flushing problem comes to haunt us when we want to draw a large number of small drawing batches. It's easy to end up with code like this:

// write some data to memory
// write some data to memory

Expanding this out, the code actually looks more like:

// map a buffer
// write to the buffer
// flush and unmap the buffer
// map a buffer
// write to the buffer
// flush and unmap the buffer

The problem is: even with glMapBufferRange and "unsynchronized" buffers, you still have to issue some kind of flush to your data before each drawing call.

The reason this is necessary is: glDrawElements might cause your command buffer to be sent to the GPU at any time! Therefore you have to have your data buffer completely flushed and ready to go after every drawing call.

How Do We Fix It?

You basically have two choices to make code like the above fast:

  1. If your are on a modern GL, use persistent coherent buffers. They don't need to be flushed - you can write data, call draw, and if the GL happens to send the command buffer down, your data is already visible. This is a great solution for UBOs on Windows.
  2. If you can't get persistent coherent buffers, defer all of your actual state and draw calls until every buffer has been built.

This second technique is a double-edged sword.

  • Win: it works every-where, even on the oldest OpenGL.
  • Win: as long as you're accumulating your state change, you can optimize out stupid stuff - handy when client code tends to produce crap OpenGL call-streams.
  • Lose: it does require you to marshal the entire API, so it's only good for code that sits on a fairly narrow foot-print.
For X-Plane, we actually intentionally choose not to use UBOs when persistent-coherent buffers are not also available. It turns out the cost of flushing per draw call is really bad, and our fallback path (loose uniforms) is actually surprisingly fast, because the driver guys have tuned the bejeezus out of that code path.

* My two-year old has figured out how to flush the toilet and thinks it's fascinating. What he hasn't figured out how to do is listen^H^H^H^H^Hwait until I'm done peeing. (And yes, non-parents, of coarse peeing is a group activity. Duh.)  The monologue went something like:

"Okay Ezra, wait until Daddy's done. No, not yet. It's too soon. Don't flush. Ezra?!  Sigh.  Wait, this is exactly like @#$@#$ glDrawElements!"

Saturday, January 13, 2018

Fixing Camera Shake on Single Precision GPUs

I've tried to write this post twice now, and I keep getting bogged down in the background information. In X-Plane 11.10 we fixed our long-standing problem of camera shake, caused by 32-bit floating point transforms in a very large world.

I did a literature search on this a few months ago and didn't find anything that met our requirements, namely:

  • Support GPUs without 64-bit floating point (e.g. mobile GPUs).
  • Keep our large (100 km x 100 km) mesh chunks.

I didn't find anything that met both of those requirements (the 32-bit-friendly solutions I found required major changes to how the engine deals with mesh chunks), so I want to write up what we did.

Background: Why We Jitter

X-Plane's world is large - scenery tiles are about 100 km x 100 km, so you can be up to 50 km from the origin before we "scroll" (e.g. change the relationship between the Earth and the primary rendering coordinate system so the user's aircraft is closer to the origin).  At these distances, we have about 1 cm of precision in our 32-bit coordinates, so any time we are close enough to the ground that 1 cm is larger than 1 pixel, meshes will "jump" by more than 1 pixel during camera movement due to rounding in the floating point transform stack.

It's not hard to have 1 pixel be larger than 1 cm. If you are looking at the ground on a 1920p monitor, you might have 1920 pixels covering 2 meters, for about 1 mm per pixel.  The ground is going to jitter like hell.

Engines that don't have huge offsets don't have these problems - if we were within 1 km of the origin, we'd have almost 100x more precision and the jitter might not be noticeable. Engines can solve this by having small worlds, or by scrolling the origin a lot more often.

Note that it's not good enough to just keep the main OpenGL origin near the user. If we have a large mesh (e.g. a mesh whose vertices get up into the 50 km magnitude) we're going to jitter, because at the time that we draw them our effective transform matrix is going to need an offset to bring the 50 km offset back to the camera.  (In other words, even if our main transform matrix doesn't have huge offsets that cause us to lose precision, we'll have to do a big translation to draw our big object.)

Fixing Instances With Large Offsets

The first thing we do is make our transform stack double precision on the CPU (but not the GPU). To be clear, we need double precision:
  • In the internal transform matrices we keep on the CPU as we "accumulate" rotates, translates, etc.
  • In the calculations where we modify this matrix (e.g. if we are going to transform, we have to up-res the incoming matrix, do the calculation in double, and save the results in double).
  • We do not have to send the final transforms to the GPU in double - we can truncate the final model-view, etc.
  • We can accept input transforms from client code in single or double precision.
This will fix all jitter caused by objects with small offset meshes that are positioned far from the origin.  Eg. if our code goes: push, translate (large offset), rotate (pose), draw, pop, then this fix alone gets rid of jitter on that model, and it doesn't require any changes to the engine or shader.

We do eat the cost of double precision in our CPU-side transforms - I don't have numbers yet for how much of a penalty on old mobile phones this is, but on desktop this is not a problem. (If you are beating the transform stack so badly that this matters, it's time to use hardware instancing.)

This hasn't fixed most of our jitter - large meshes and hardware instances are still jittering like crazy, but this is a necessary pre-requisite.

Fixing Large Meshes

The trick to fixing jitter on meshes with large vertex coordinates is understanding why we have precision problems.  The fundamental problem is this: transform matrices apply rotations first and translations second. Therefore in any model-view matrix that positions the world, the translations in the matrix have been mutated by the rotation basis vectors. (That's why your camera location is not just items 12,13, and 14 of your MV matrix.)

If the camera's location in the world is a very big number (necessary to get you "near" those huge-coordinate vertices so you can see them) then the precision at which they are transformed by the basis vectors is...not very good.

That's not actually the total problem. (If it was, preparing the camera transform matrix in double on the CPU would have gotten us out of jail.)

The problem is that we are counting on these calculations to cancel each other out:

vertex location * camera rotation + (camera rotation * camera location) = eye space vertex

The camera rotated location was calculated on the CPU ahead of time and baked into the translation component of your MV matrix ,but the vertex location is huge and is rotated by the camera rotation on the GPU in 32-bits.  So we have two huge offsets multiplied by very finicky rotations - we add them together and we are hoping that the result is pixel accurate, so that tiny movements of the camera are smooth.

They are not - it's the rounding error of the cancelation of these calculations that is our jitter.

The solution is to change the order of operations of our transform stack. We need to introduce a second translation step that (unlike a normal 4x4 matrix operation), happens before rotation, in world coordinates and not camera coordinates.  In other words, we want to do this:

(vertex location - offset) * camera rotation + (camera rotation * (camera location - offset)) = ...

Heres' why this can actually work: "offset" is going to be a number that brings our mesh roughly near the camera. Since it doesn't have to bring us to the camera, it can change infrequently and have very few low-place bits to get lost by rounding.  Since our vertex location and offset are not changing, this number is going to be stable across frames.

Our camera location minus this offset can be done on the CPU side in double precision, so the results of this will be both small (in magnitude) and high precision.

So now we have two small locations multiplied by the camera rotation that have to cancel out - this is what we would have had if our engine used only small meshes.

In other words, by applying a rounded, infrequently  changing static offset first, we can reduce the problem to what we would have had in a small-world engine, "just in time".

You might wonder what happens if the mesh vertex is no-where near our offset - my claim that the result will be really small is wrong. But that's okay - since the offset is near the camera, mesh vertices that don't cancel well are far from the camera and too small/far away to jitter. Jitter is a problem for close stuff.

The CPU-side math goes like this: given an affine model-view matrix in the form of R, T (where R is the 3x3 rotation and T is the translation vector), we do this:

// Calculate C, the camera's position, by reverse-
// rotating the translation
C = transpose(R) * T
// Grid-snap the camera position in world coordinates - I used 
// a 4 km grid. smaller grids mean more frequent jumps but 
// better precision.
C_snap = grid_round(C)
// Offset the matrix's translation by this snap (moved back 
// to post-rotation coordinates), to compensate for the pre-offset.
T -= R * C_snap
// Pre-offset is the opposite of the snap.
O = -C_snap

In our shader code, we transform like this:

v_eye = (v_world - O) * modelview_matrix

There's no reason why the sign has to be this way - O could have been C_snap and we could have added in the shader; I found it was easier to debug having the offset be actual locations in the world.

Fixing Hardware Instancing

There's one more case to fix. If your engine has hardware instancing, you may have code that takes the (small) model mesh vertices and applies an instancing transform first, then the main transform. In this case, the large vertex is the result of the instancing matrix, not the mesh itself.

This case is easily solved - we simply subtract our camera offset from the translation part of the hardware instance. This ensures that the instance, when transformed into our world, will be near the camera - no jitter.

One last note: on some drivers I found the driver was very finicky about order of operations - if the calculation is not done by applying the offset before the transform, the de-jitter totally fails. The precise and invariant qualifiers didn't seem to help, only getting the code "just right" did.

Wednesday, June 07, 2017

How to Reset Steam VR When It Can't Talk to the Rift

Periodically in the coarse of writing an OpenVR app, I find that SteamVR can't talk to my HMD. One of the 500 processes that collaborate to make VR work has kicked the bucket. Here's the formula to fix it.

First, kill the process tree based on OVRServer_x64.  All the Oculus stuff should die and then immediately respawn. Minimize their portal thingie.

Kill every vrXXX process (vrserver, vrmonitor,vrcompositor, vrdashboard).  SteamVR should not look like it's running and will not auto-relaunch.

Now you're good - relaunch your game and SteamVR should restart and be able to communicate with the headset.

Friday, June 02, 2017

"Quick Recalibrate" Fixes the SteamVR Floor for the Oculus Rift

I have the same problem that a lot of users report on the web: while my Oculus Rift knows where my floor is when I am in the Oculus home room, SteamVR gets confused and moves the floor up a foot or two, which makes me 3 feet tall when using a "standing" game. You can tell this problem is caused by SteamVR because even the SteamVR loading screens and settings are wrong, not just games.

Turns out there is a quick fix! Go to SteamVR settings, and in the developer section, you can use "Quick Calibrate."  Put your HMD on the floor in the middle of your play area, click once, and everything is fixed.

The only down-side is: you need a play area where your sensors can see the floor.  For me, this means putting the center of the play area unnecessarily far back, since the Rift sensors are on my desk.

Now I just need to find a way to fix the condensation inside the Rift.

Saturday, March 18, 2017

Why Your C++ Should Be Simple

I have a lot of strong (meaning "perhaps stupid") opinions about C++ and how to use it (or not use it), and I've been meaning to write posts on the thinking behind these stupid^H^H^H^H^H^Hstrong opinions, particularly where they go against the "conventional wisdom" of the modern C++ community.

This one is sort of an over-arching goal:
Write C++ that is less sophisticated than you are capable of writing.
This idea is a rip off^H^H^H^H^H^Hriff on a Brian Kernighan quote that I think is perhaps the best advice about  programming you'll ever hear:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it? 
-- Brian Kernighan, the Elements of Programming Style
First, that's not wrong, that's dead on. So for debugging alone, don't write the most complicated C++ you can. C++ is an extremely complex language in its spec and if you use it to its full capability, you can write something that is completely un-debuggable.

Limited Brain Bandwidth

Here's my take on the idea. You're welcome to dismiss this as the mad ramblings of a father who is over 40 and literally cannot remember to finish steeping my tea because it's been an entire four minutes since I left it in the kitchen.

There's a limit to your mental bandwidth. That mental concentration, focus, short term memory, and whatever else the squishy gray GPU between our ears does can be spent on a few different things:

  • Figuring out how to write code.
  • Figuring out how to make code faster.
  • Figuring out how to solve non-trivial algorthm problems.
  • Figuring out how to solve business problems and design the architecture of a large production system.
  • Finding and fixing bugs.
  • Multi-tasking.
  • Looking at cats on the internet.
  • Swearing at your customers.
  • Trying to maintain focus while your 2 year old repeatedly smashes your keyboard with your mouse.
  • Figuring out the fastest path through a series of coding tasks.
I'm sure there's more, but you see where I'm going with this. You have a lot of things you can use your brain for, and some are more useful for getting your job done than others. You have to decide where to spend your brain budget, and you can't spend it on everything. (If you feel like you're not tight on brain budget, it means you're not working on a problem that's hard enough yet! You can always make the problem harder by shortening the time frame. Shipping things sooner is pretty much always a win.)

My view is that "being clever with C++" isn't a fantastic investment. Using the most bleeding edge C++ doesn't have enough benefit in other areas (code is faster, code is more maintainable, code is easier to debug, code is easier to understand) to justify burning brain power on it. In some cases, it actually moves you in the wrong direction.

Just because C++ lets you do it doesn't mean you have to or that it's even a good idea. Here are some examples where I consider a C++ feature to add complexity and not provide returns:

  • Clever operator overloading. Your coworkers will thank you if you give your functions good names and don't make operator+ turn on the coffee maker. You could skip operator overloading entirely and your code will be fine. (I'm okay with a small amount of overloading for a few obvious cases: dereference on smart pointers and math on math types).
  • Template meta-programming. I make myself watch talks on this from cppcon and what I see is the smartest C++ programmers in the world spending huge amounts of brain power to accomplish really trivial tasks (writing a program to make a program) in a way that is almost completely impossible to understand. You don't have to use the compiler to do your meta-program generation.
  • Heavy Duty Templating. This is a matter of degree - simple use of templates is fantastic and I use them in my code all the time. But at some point, as you add layers, you go past an inflection point and the cure starts to hurt more than the disease. This one is a little bit like smut: I don't have a great definition for when you've jumped the shark with your templates, but I know it when I see it. A good rule of thumb: if templates are making it harder to debug, don't add more. (Debuggers are easily good enough to debug the STL, generic algorithms, don't have to drop that stuff. It's not 1998 anymore!)
  • Overly Complicated Class Hierarchies. This is also a matter of degree, but at some point, it's okay for your class hierarchy to be less pure, clean and perfect if it's simpler and creates less chaos in the rest of the program. For example, our UI framework has one view type - parents, children, leaf nodes, roots, none of these are specialized by type. I've used a lot of other frameworks, and my finding is that clever "factoring out" of aspects of a view hierarchy does nothing to make the code better, but it creates issues you have to work around. Just keep it simple.
I could go on, but you get the idea...I'm one of those old bastards who thinks it's okay to just use C++ as "a nice version of C". Because it is! In my time here in the matrix, I have found that I have written past code that was too complex, language wise, and too simple. Here's the difference:

  • For the code that was too simple (I wrote a C struct when I needed a class), the upgrade is simple, easy to write, easy to understand, doesn't produce a lot of bugs, and the compiler helps me.
  • For code that was too complex (I wrote something worthy of Boost when  I should have made POD), ripping out that complexity is time consuming and goes slowly.
So why not err on the side of simplicity? Here are some other things you could do with your brain instead of writing C++ for the C++ elite:

  • Watch John Oliver on Youtube.
  • Put debugging facilities into your sub-system for more efficient debugging.
  • Put performance analysis data collection into your sub-system so you can tune it.
  • Document your sub-system so your coworkers don't poison your coffee.
  • Get the code to high quality faster and go on to the next thing!
If you are a professional software engineer, you are paid to solve people's business problems, and you use code to do that. Every Gang of Four pattern you use that doesn't do that is wasted brain cells.

Friday, February 17, 2017

Vroom, Vroom! I'm Going To Win This Race Condition!

Here's a riddle:
Q: Santa Clause, the Easter Bunny, a Benign Race Condition and Bjarne Stroustrup each stand 100 feet away from a $100 bill. They all run to the bill at the same time. Who gets the bill?

A: Bjarne Stroustrup, because the other three don't exist.
Okay, now that we have established that (1) I am a nerd and (2) race conditions are bad, we can talk about thread sanitizer (tsan), or as I call it, "the tool that reminds you that you're not nearly as clever as you thought you were when you wrote that multi-threaded widget reduplicator."

Address Sanitizer (asan) is fantastic because it catches memory scribbles and other Really Bad Behavior™ and it's fast - fast enough that you run it in real time on your shipping product.  tsan is fantastic because it finds things that we just didn't have tools to find before.

tsan's design is quite similar to asan: a block of shadow memory records thread access to a real location, and an app's memory operations are in-place modified to work with shadow memory.

Let's See Some Dumb Code

Here is an example of tsan in action:

This is a back-trace of the first race condition I hit when starting X-Plane: our main thread is reading preferences and has just inited the texture resolution setting in memory.

As it turns out, background threads are already loading textures. Doh! I had no idea this race condition existed.

The Yeti of Race Conditions

A benign race condition may not be real, but we can still define what properties it would have if it did exist. It would have to be a case where:

  1. All actual racing reads-writes are relaxed atomics for this particular architecture and data alignment and 
  2. We can live with the resulting operation of the code in all possible orderings the code actually runs at and
  3. Clang doesn't reorganize our code into something really dangerous, even though it has permission to do so.
All of these are sort of scary, and I think I just heard Herb Sutter vomit into his shoe, but x86 programmers often get item one for free, and we might be able to reason our way to item two.

Item three is a tall order and getting harder every day. While Clang is not yet Skynet, it is still a lot smarter than I am. Five years ago a C++ programmer could know enough about compilers and code to reason about the behavior of optimized code under undefined conditions and possibly live to tell about it; at this point the possible code changes are just too astonishing. I have begrudgingly admitted that I have to follow all of the rules or Clang and LLVM will hose me in ways that it would take weeks to even understand.

With that in mind, some of the first few race conditions I hit were, perhaps, benign:
  • In the case above, some texture loads have a flag to indicate "always max res, don't look at prefs" - the code was written to set a local variable to the preferences-based res first, then overwrite it; restructuring this code to not touch the preferences data if it was not going to be used silenced the "benign" race condition.
  • We had some unsynchronized stats counter - this is just wrong, and their data was junk, but this was a "don't care" - the stats counters weren't needed and didn't affect program operation. They could have been turned into relaxed atomics, but instead I just removed them.

So That's What That Was

After cleaning up texture load and some stats counters, I finally hit my first real bug, and it was a big one. The background tasks that build airports sometimes have to add shader-layers (groups of meshes drawn in the same part of the rendering pass, sharing a common shader)  to the main scenery data structure; they bounce between worker thread land (where they build meshes slowly from raw data) and the main thread (where they add the shader layers into our global data structures between frames and thus don't crash us).

That "bouncing" idiom is simple, it works great, we use it everywhere. Except... turns out that all of the airport extruding tasks (and there are a bunch of them, 64 per scenery tile divided spatially) share a single vector of layers.

And while one task is not doing async work while the layer list is being mutated on the same thread, another one might be! (The "bounce between threads" idiom only works if the thing bouncing around has exclusive access to its data structures.)

Every now and then, the vector gets resized and the memory recycled fast enough that the stale use of the vector blows up on the worker thread. This was visible in our automatic crash reporting as the airport extruder just randomly exploding on a worker thread for no reason.

Sure enough, tsan caught this, pointing to the vector resize on the main thread and the mutation on the worker thread. Because tsan catches races when they are unsynchronized and not just when that failure to synchronize results in Really Bad Data™, this race condition hit every time I booted the product, rather than every now and then. (While we have crash reports to know this happens in field, I never saw this race condition on my own machine, ever.)

What Tsan Can Find

While tsan is about 1000x better than what we had before (E.g. staring at your code and going "is this stupid?"), it does have some limitations. Because the problem space is harder than asan, the tracking in tsan isn't perfect; scenarios with more than four threads can lead to evictions of tracking data, history can be purged, etc. I saw a number of issues in my first session with the tool, but having reviewed the docs and presentations again, I think a lot of these things can be worked around.

Because tsan looks for race conditions between synchronization points, it is great for finding race conditions without depending on the actual order of execution. That's huge. But there are still cases where your code's execution order will hide race conditions (and not just by not running the buggy code).

Our weather engine uses a double-buffering scheme to mutate the weather while background threads are producing particle systems for the clouds. The main thread, per frame, adjusts weather, then the buffer is switched.

As it turns out, this idiom is totally wrong, because we don't guarantee that worker threads finish their work and synchronize with the main thread in the same frame. In other words, the double buffer might swap twice while a worker runs, leaving the worker reading from the mutating copy of the weather.

The problem is: if the weather system is pretty fast (and we want it to be!!) and the particle system tasks finish within a frame, they will then synchronize with the main thread (to "check in" their finished work) before the main thread swaps the thread buffers.  Since everything on a single thread is "happens before" by program order (at least, if they cross sequence points in C++, and we do here), tsan goes "yep - the thread finished, checked in, and we're all good." What it can't realize is that had the timing been different, the synchronization path would have run after the racy code.

The result was that I had to let tsan run for almost 30 minutes before it caught a race condition so totally brain damaged that it should have been immediate.

Even if a clean short run with tsan doesn't guarantee your threading code is perfect, it's still a fantastic tool that moves a series of bugs from "nearly impossible to find" to "possible to find easily a lot of the time."

Monday, November 07, 2016

Terminal Voodoo for Code Signing

If you develop apps for the Mac or IOS, there may come a day when code signing fails mysteriously. When this day happens, mix in a boiling cauldron 2 tbsp Eye-of-Newt*, 1 lb Toe of Frog, a healthy pinch of Wool of Bat, and honestly you can skip the dog's tongue - have you seen the places that thing has licked?

Should this incantation neither fix your code signing problems nor make you King of Scotland, these terminal incantations are a good plan B.

(In all three cases, app_path can be a path to your bundle, not the executable deep inside - this way signing of the entire bundle is verified.)

codesign -vvv app_path

This tells you whether the app is signed, and it iterates sub-components.

spctl -vvv -a -t exec -vvv app_path

This tells you whether GateKeeper is going to run your app, and if not, why not. For example, if you get an error that an app is damaged and should be thrown in the trash, this command tells you something more specific, like that your certificate is revoked.

pkgutil --check-signature app_path

This command outputs the fingerprints of the signatures used in the signing, as well as some thumbs up/down on the certificate chain. This is useful when an app's certificate is bad and you are trying to figure out which certificate the app went out the door with.

* The former Speaker of the House may be unamused by this, but sacrifices must be made!