Jump to content

Multithreaded Rendering

Josh

2,474 views

After working out a thread manager class that stores a stack of C++ command buffers, I've got a pretty nice proof of concept working. I can call functions in the game thread and the appropriate actions are pushed onto a command buffer that is then passed to the rendering thread when World::Render is called. The rendering thread is where all the (currently) OpenGL code is executed. When you create a context or load a shader, all it does is create the appropriate structure and send a request over to the rendering thread to finish the job:

Image2.jpg.708878cad36d0cb33fb2c014a380a644.jpg

Consequently, there is currently no way of detecting if OpenGL initialization fails(!) and in fact the game will still run along just fine without any graphics rendering! We obviously need a mechanism to detect this, but it is interesting that you can now load a map and run your game without ever creating a window or graphics context. The following code is perfectly legitimate in Leawerks 5:

#include "Leadwerks.h"

using namespace Leadwerks

int main(int argc, const char *argv[])
{
	auto world = CreateWorld()
	auto map = LoadMap(world,"Maps/start.map");

	while (true)
	{
		world->Update();
	}
	return 0;
}

The rendering thread is able to run at its own frame rate independently from the game thread and I have tested under some pretty extreme circumstances to make sure the threads never lock up. By default, I think the game loop will probably self-limit its speed to a maximum of 30 updates per second, giving you a whopping 33 milliseconds for your game logic, but this frequency can be changed to any value, or completely removed by setting it to zero (not recommended, as this can easily lock up the rendering thread with an infinite command buffer stack!). No matter the game frequency, the rendering thread runs at its own speed which is either limited by the window refresh rate, an internal clock, or it can just be let free to run as fast as possible for those triple-digit frame rates.

Shaders are now loaded from multiple files instead of being packed into a single .shader file. When you load a shader, the file extension will be stripped off (if it is present) and the engine will look for .vert, .frag, .geom, .eval, and .ctrl files for the different shader stages:

auto shader = LoadShader("Shaders/Model/diffuse");

The asynchronous shader compiling in the engine could make our shader editor a little bit more difficult to handle, except that I don't plan on making any built-in shader editor in the new editor! Instead I plan to rely on Visual Studio Code as the official IDE, and maybe add a plugin that tests to see if shaders compile and link on your current hardware. I found that a pragma statement can be used to indicate include files (not implemented yet) and it won't trigger any errors in the VSCode intellisense:

Image3.thumb.png.0ed27ec4d07882f01ace4727ca37c2a8.png

Although restructuring the engine to work in this manner is a big task, I am making good progress. Smart pointers make this system really easy to work with. When the owning object in the game thread goes out of scope, its associated rendering object is also collected...unless it is still stored in a command buffer or otherwise in use! The relationships I have worked out work perfectly and I have not run into any problems deciding what the ownership hierarchy should be. For example, a context has a shared pointer to the window it belongs to, but the window only has a weak pointer to the context. If the context handle is lost it is deleted, but if the window handle is lost the context prevents it from being deleted. The capabilities of modern C++ and modern hardware are making this development process a dream come true.

Of course with forward rendering I am getting about 2000 FPS with a blank screen and Intel graphics, but the real test will be to see what happens when we start adding lots of lights into the scene. The only reason it might be possible to write a good forward renderer now is because graphics hardware has gotten a lot more flexible. Using a variable-length for loop and using the results of a texture lookup for the coordinates of another lookup :blink: were a big no-no when we first switched to deferred rendering, but it looks like that situation has improved.

The increased restrictions on the renderer and the total separation of internal and user-exposed classes are actually making it a lot easier to write efficient code. Here is my code for the indice array buffer object that lives in the rendering thread:

#include "../../Leadwerks.h"

namespace Leadwerks
{
	OpenGLIndiceArray::OpenGLIndiceArray() :
		buffer(0),
		buffersize(0),
		lockmode(GL_STATIC_DRAW)
	{}

	OpenGLIndiceArray::~OpenGLIndiceArray() {
		if (buffer != 0) {
#ifdef DEBUG
			Assert(glIsBuffer(buffer),"Invalid indice buffer.");
#endif
			glDeleteBuffers(1, &buffer);
			buffer = 0;
		}
	}

	bool OpenGLIndiceArray::Modify(shared_ptr<Bank> data) {

		//Error checks
		if (data == nullptr) return false;
		if (data->GetSize() == 0) return false;

		//Generate buffer
		if (buffer == 0) glGenBuffers(1, &buffer);
		if (buffer == 0) return false; //shouldn't ever happen

		//Bind buffer
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, buffer);

		//Set data
		if (buffersize == data->GetSize() and lockmode == GL_DYNAMIC_DRAW) {
			glBufferSubData(GL_ELEMENT_ARRAY_BUFFER, 0, data->GetSize(), data->buf);
		}
		else {
			if (buffersize == data->GetSize()) lockmode = GL_DYNAMIC_DRAW;
			glBufferData(GL_ELEMENT_ARRAY_BUFFER, data->GetSize(), data->buf, lockmode);
		}

		buffersize = data->GetSize();
		return true;
	}
	
	bool OpenGLIndiceArray::Enable() {
		if (buffer == 0) return false;
		if (buffersize == 0) return false;
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, buffer);
		return true;
	}

	void OpenGLIndiceArray::Disable() {
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
	}
}

From everything I have seen, my gut feeling tells me that the new engine is going to be ridiculously fast.

If you would like to be notified when Leadwerks 5 becomes available, be sure to sign up for the mailing list here.



15 Comments


Recommended Comments

The increased isolation and simplification of the OpenGL Code also means it is now much much easier to write a custom renderer. It would be pretty simple to create an OpenGL 1 or a DirectX renderer for the engine...or Vulkan support can be added without much trouble.

Share this comment


Link to comment

Would it be possible to simulate the world physics without OpenGL? Would be nice for multiplayer games that need to run on a server that doesn't have a gfx card.

Share this comment


Link to comment

Very cool, but this is still more rendering separately on a thread than multi-threaded rendering. No matter how you cut it in GL the heavy work can't be spread across multiple threads so your GPU is always bored waiting for the under-used CPU to send it work, although in GL this is as good as it's going to get which is good enough. Still like your MoltenVK idea the best.

Either way it is neat to be able to control the frame rate of physics and game logic separately of rendering.

Share this comment


Link to comment
3 hours ago, Crazycarpet said:

Very cool, but this is still more rendering separately on a thread than multi-threaded rendering. No matter how you cut it in GL the heavy work can't be spread across multiple threads so your GPU is always bored waiting for the under-used CPU to send it work, although in GL this is as good as it's going to get which is good enough. Still like your MoltenVK idea the best.

Either way it is neat to be able to control the frame rate of physics and game logic separately of rendering.

All the multithreaded graphics APIs actually just accumulate commands in a buffer and then execute them in a single thread. I plan to separate the culling and rendering threads so that there is absolutely no overhead in the rendering thread. The rendering thread may draw the same list of visible objects several times before the culling thread feeds it a new visibility list. This will involve a fair amount of latency in the system, but the VR head and control orientations will be read each frame at the very last moment possible, so the things that you would notice latency with won't have any. I think it's going to be insanely fast.

Share this comment


Link to comment
On 4/14/2018 at 3:24 AM, Josh said:

All the multithreaded graphics APIs actually just accumulate commands in a buffer and then execute them in a single thread

The benefit to the multi-threaded APIs is that every thread has it's own command pool, and each thread can write to a command buffer so you can use any available threads to write to the command buffers. They are in the end submitted together, yes, but getting to the point where all command buffers are good-to-go is way faster. That's why they designed them this way. In the end, less time is spent waiting for 1 CPU thread to write all the command buffers.

Nvidia has a great document about this: https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/munich/mschott_vulkan_multi_threading.pdf

Share this comment


Link to comment

We will see. Doom 2016 ran about the same or even slower on Vulcan in the benchmarks I saw, on Nvidia hardware.

My final render stage is just a list of visible surfaces and that part could easily be split into a bunch of different threads.

But we will see when I actually test it out.

Share this comment


Link to comment

Doom doesn't use a multi-threaded renderer. Of course Vulkan isn't going to magically make things faster on it's own, it gives you the ability to do it... On OpenGL you don't directly write to command buffers so you can't split the work up between threads. Vulkan in itself does not do anything multi-threading, this is something you have to implement. Vulkan just gives you the tools to design fast multi-threaded designs that were not possible prior to.

I'm not saying this is necessary, your design will be great because the game loop does not have to wait for the renderer. I'm just saying with Vulkan you could get maximum performance, you could still keep the rendering separate of the game loop too then you would end up with both faster and independent rendering.

Just spit-balling ideas because it sounds like you're trying to make LE as fast as possible, and this new API allows you to do what only DX12 could do without worrying about being locked to windows-only.

This optimization would indisputably make LE's renderer way faster, which is perfect for VR. The only question is whether or not it is necessary, is LE fast enough without it in the situations it's designed for? No sense in writing a big complex renderer if the engine is fast enough as is.

Edit:

Also keep in mind that Nvidia's OpenGL drivers are extremely fast and complex, AMDs are not. On AMD cards Vulkan does "magically" make things faster just by implementing it because their driver team went above-and-beyond on their Vulkan drivers.

Share this comment


Link to comment

I’m all for trying it, but at this point we don’t know if it’s just a fix for inefficient AMD OpenGL drivers or if really does increase performance when used correctly. Here's someona claiming that Vulkan is half the speed of OpenGL:
https://steamcommunity.com/app/379720/discussions/0/142261352653876827/

OpenGL has always had low draw call cost compared to DirectX, so it will be interesting to see what the real results are.

Share this comment


Link to comment

Again, Doom doesn't do multi-threading... Why would it be faster than it's OpenGL renderer? They've had years to optimize OpenGL drivers, of course it'll be at least as fast in a single-threaded environment.

It's not magic, it's physics at that point.... Vulkan can use multiple threads to generate command buffers, more at a time; OpenGL can only do 1 at a time. It would indisputably be faster that's just the reality of it.

As time goes on and GPUs get more powerful a renderer in Vulkan that generates cmd buffers on multiple threads would be even faster because not only are you sending more work to the GPU due to the threaded command buffer generation, but the GPU would also be able to handle any work you throw at it. With high end cards today you will see big performance gains, where you wouldn't is with integrated cards... but that shouldn't be a priority.

Furthermore in Vulkan you can physically send draw calls from multiple threads and they are not send to the main thread by the driver, this is one highlight of Vulkan that only DirectX 12 has. Metal is planning this too, I have not read whether or not this is already the case in Metal, of if it's just a future plan.

  • Like 1

Share this comment


Link to comment

Doom Vulkan results are probably better on AMD.  It seems that nvidia hardware is better on DX11 and OpenGL, more traditional rendering methods.

You'll find AMD performing a bit better on Vulkan and DX12, in most cases though.

Share this comment


Link to comment

I have my doubts about OpenGL commands being a significant bottleneck. That's kind of like saying you're going to make a sports car faster by removing the floor mats. Yes, it will be a small bit lighter but I don't think you will see any difference. The number one performance bottleneck we run into is pixel fillrate.

My guess is you will see a massive performance increase with my new architecture, and then Vulkan will produce a small improvement on Nvidia cards, and perhaps a 20-30% improvement on AMD cards. But let's see what the actual numbers turn out to be.

Share this comment


Link to comment

The reason you'd want to multithread the command process is for situations where big, new, powerful GPUs are bored because the CPU's one thread can't send it commands fast enough to utilize it to the fullest extent. That's not a fair analogy :P so long as your GPU can handle it, why would you not want to throw more work at it? Modern GPUs (10 series, etc) can certainly handle it.

A great GPU can handle anything a single core on your CPU can throw at it with ease, so you want to throw more at it. This is the most common bottleneck in games these days with how powerful GPUs are getting. The better your GPU, the better these optimizations will help, it's more planning for the future because as time goes on you'll see more and more improvements from this type of multi-threading, that's why DX12 and Vulkan moved towards it.

Anyways like I said, it isn't usually necessary but it would be optimum, just food for thought so you consider this design if you move towards a Vulkan renderer. It'd be a shame to use Vulkan and just move all the rendering to a thread, instead of using all available threads for command buffer generation.

Share this comment


Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Blog Entries

    • By Josh in Josh's Dev Blog 0
      Textures in Leadwerks don't actually store any pixel data in system memory. Instead the data is sent straight from the hard drive to the GPU and dumped from memory, because there is no reason to have all that data sitting around in RAM. However, I needed to implement texture saving for our terrain system so I implemented a simple "Pixmap" class for handling image data:
      class Pixmap : public SharedObject { VkFormat m_format; iVec2 m_size; shared_ptr<Buffer> m_pixels; int bpp; public: Pixmap(); const VkFormat& format; const iVec2& size; const shared_ptr<Buffer>& pixels; virtual shared_ptr<Pixmap> Copy(); virtual shared_ptr<Pixmap> Convert(const VkFormat format); virtual bool Save(const std::string& filename, const SaveFlags flags = SAVE_DEFAULT); virtual bool Save(shared_ptr<Stream>, const std::string& mimetype = "image/vnd-ms.dds", const SaveFlags flags = SAVE_DEFAULT); friend shared_ptr<Pixmap> CreatePixmap(const int, const int, const VkFormat, shared_ptr<Buffer> data); friend shared_ptr<Pixmap> LoadPixmap(const std::wstring&, const LoadFlags); }; shared_ptr<Pixmap> CreatePixmap(const int width, const int height, const VkFormat format = VK_FORMAT_R8G8B8A8_UNORM, shared_ptr<Buffer> data = nullptr); shared_ptr<Pixmap> LoadPixmap(const std::wstring& path, const LoadFlags flags = LOAD_DEFAULT); You can convert a pixmap from one format to another in order to compress raw RGBA pixels into BCn compressed data. The supported conversion formats are very limited and are only being implemented as they are needed. Pixmaps can be saved as DDS files, and the same rules apply. Support for the most common formats is being added.
      As a result, the terrain system can now save out all processed images as DDS files. The modern DDS format supports a lot of pixel formats, so even heightmaps can be saved. All of these files can be easily viewed in Visual Studio itself. It's by far the most reliable DDS viewer, as even the built-in Windows preview function is missing support for DX10 formats. Unfortunately there's really no modern DDS viewer application like the old Windows Texture Viewer.

      Storing terrain data in an easy-to-open standard texture format will make development easier for you. I intend to eliminate all "black box" file formats so all your game data is always easily viewable in a variety of tools, right up until the final publish step.
    • By Josh in Josh's Dev Blog 1
      I wanted to see if any of the terrain data can be compressed down, mostly to reduce GPU memory usage. I implemented some fast texture compression algorithms for BC1, BC3, BC4, BC5, and BC7 compression. BC6 and BC7 are not terribly useful in this situation because they involve a complex lookup table, so data from different textures can't be mixed and matched. I found two areas where texture compression could be used, in alpha layers and normal maps. I implemented BC3 compression for terrain alpha and could not see any artifacts. The compression is very fast, always less than one second even with the biggest textures I would care to use (4096 x 4096).
      For normals, BC1 (DXT1 and BC3 (DXT5) produce artifacts: (I accidentally left tessellation turned on high in these shots, which is why the framerate is low):

      BC5 gives a better appearance on this bumpy area and closely matches the original uncompressed normals. BC5 takes 1 byte per pixel, one quarter the size of uncomompressed RGBA. However, it only supports two channels, so we need one texture for normals and another for tangents, leaving us with a total 50% reduced size.

      Here are the results:
      2048 x 2048 Uncompressed Terrain:
      Heightmap = 2048 * 2048 * 2 = 8388608 Normal / tangents map = 16777216 Secret sauce = 67108864 Secret sauce 2 = 16777216 Total = 104 MB 2048 x 2048 Compressed Terrain:
      Heightmap = 2048 * 2048 * 2 = 8388608 Normal map = 4194304 Tangents = 4194304 Secret sauce = 16777216 Secret sauce 2 = 16777216 Total = 48 MB Additionally, for editable terrain an extra 32 MB of data needs to be stored, but this can be dumped once the terrain is made static. There are other things you can do to reduce the file size but it would not change the memory usage, and processing time is very high for "super-compression" techniques. I investigated this thoroughly and found the best compression methods for this situation that are pretty much instantaneous with no noticeable loss of quality, so I am satisfied.
    • By jen in jen's Blog 0
      My small project will be called Foregate, it will be a dark medieval Diablo style single player action RPG.
      The graphics will be simple, no PBR, 256x256 map, reasonably low-res models.
      Camera style? Top-down-ish I think? Like in Diablo exactly - and because the camera is not directly in-front of the 3 models, I can get away with low-resolution assets - bonus. Also, with top-down view, I won't have to worry about high resolution sky-boxes. 
      What's my plan for this project?
      I plan to make this project as small and as simple as possible, possibly release it as open-source, and have fun with it of course.
      My previous experience with game development (1-2 years ago?) was amateurish I think, still is now. I want to give it a go again, this time with experience although my skill in C++ is not really that good? Maybe I can improve it in this project.
      More about the game
      The content is not set in stone yet but I have a general idea of how the mechanics is going to look and feel - Diablo-ish obviously. It'll have monsters (ancient & mythical probably), loot when killing a monster, gold as in-game currency, visual grid inventory, player stats (level, strength, agility, vitality, energy, &c.). 
      The game will be single-player. Possibly a coop multiplayer also? I don't have any interest in making massive multi-player. 
      I started my development yesterday with the basic preparations (setting up project environment, &c.), today I made my first step in developing the core components; worker class, game state, task class.
      I have a game state that keeps a single source of truth for the entire application; all game data will be stored in this class as "states". 
      I also have a "Worker" which will do the processing of tasks in the game.
      I also have "object" class, this can be a monster, the player, a weapon, a prop, or an NPC.
      So the idea is to have a CQRS type of interaction between the classes and the data. Any action in the game will be interpreted as "Task" for the Worker class. The worker class iterates through the Task. Tasks can be created by any class interfaced with the Worker class trough "addNewTask" and the new tasks can be of a certain type i.e.: ATTACK, IDLE, SAVE_GAME, EXIT_GAME, the new task will also have a payload data and it's processed according to its task type e.g. an ATTACK with payload "{ Damage: 10, Target: MonsterA }" will reduce the health of MonsterA by 10 - the worker class will change the game state; find MonsterA in MonsterState and reduce its health by 10. 
      I think it's advantageous to have this type of centralized module where all actions are processed; I can do all sorts of procedures during the processes, maybe debug data, filter actions, mutate payloads, and such.
      How much time am I going to put into this?
      A couple of hours a day for 3 days a week maybe.
      So it's all a rough sketch for now and it's heading the right direction. I'll have more to report later on. 

      This is Forgate Castle, minus the castle, in the map Forgate; the starting location for the player. The fortification will have merchants, and quest givers.
       
×
×
  • Create New...