Jump to content

Building a Zero-Overhead Renderer

Josh

1,156 views

The Leadwerks 4 renderer was built for maximum flexibility. The Leadwerks 5 renderer is being built first and foremost for great graphics with maximum speed. This is the fundamental difference between the two designs. VR is the main driving force for this direction, but all games will benefit.

Multithreaded Design

Leadwerks 4 does make use of multithreading in some places but it is fairly simplistic. In Leadwerks 5 the entire architecture is based around separate threads, which is challenging but a lot of fun for me to develop. I worked out a way to create a command buffer on the main thread that stores a list of commands for the rendering thread to perform during the next rendering frame. (Thanks for the tip on Lambda functions @Crazycarpet) Each object in the main thread has a simplified object it is associated with that lives in the rendering thread. For example, each Camera has a RenderCamera object that corresponds to it. Here's how changes in the main thread get added to a command buffer to be executed when the rendering thread is ready:

void Camera::SetClearColor(const float r,const float g,const float b,const float a)
{
	clearcolor.x = r; clearcolor.y = g; clearcolor.z = b; clearcolor.w = a;
#ifdef LEADWERKS_5
	GameEngine::cullingthreadcommandbuffer.push_back( [this->rendercamera, this->clearcolor]() { rendercamera->clearcolor = clearcolor; } );
#endif
}

The World::Render() command is still there for conceptual consistency, but what it really does it add all the accumulated commands onto a stack of command buffers for the rendering thread to evaluate whenever it's ready:

void World::Render(shared_ptr<Buffer> buffer)
{
	//Add render call onto command buffer
	GameEngine::cullingthreadcommandbuffer.push_back(std::bind(&RenderWorld::AddToRenderQueue, this->renderworld));

	//Copy command buffer onto culling command buffer stack
	GameEngine::CullingThreadCommandBufferMutex->Lock();
	GameEngine::cullingthreadcommandbufferstack.push_back(GameEngine::cullingthreadcommandbuffer);
	GameEngine::CullingThreadCommandBufferMutex->Unlock();
  
	//Clear the command buffer and start over
	GameEngine::cullingthreadcommandbuffer.clear();
}

The rendering thread is running in a loop inside a function that looks something like this:

shared_ptr<SharedObject> GameEngine::CullingThreadEntryPoint(shared_ptr<SharedObject> o)
{
	while (true)
	{
		//Get the number of command stacks that are queued
		CullingThreadCommandBufferMutex->Lock();
		int count = cullingthreadcommandbufferstack.size();
		CullingThreadCommandBufferMutex->Unlock();

		//For each command stack
		for (int i = 0; i < count; ++i)
		{
			//For each command
			for (int n = 0; n < cullingthreadcommandbufferstack[i].size(); ++n)
			{
				//Execute command
				cullingthreadcommandbufferstack[i][n]();
			}
		}

		//Remove executed command stacks
		CullingThreadCommandBufferMutex->Lock();
		int newcount = cullingthreadcommandbufferstack.size();
		if (newcount == count)
		{
			cullingthreadcommandbufferstack.clear();
		}
		else
		{
			memcpy(&cullingthreadcommandbufferstack[0], &cullingthreadcommandbufferstack[count], sizeof(sizeof(cullingthreadcommandbufferstack[0])) * (newcount - count));
			cullingthreadcommandbufferstack.resize(newcount);
		}
		CullingThreadCommandBufferMutex->Unlock();

		//Render queued worlds
		for (auto it = RenderWorld::renderqueue.begin(); it != RenderWorld::renderqueue.end(); ++it)
		{
			(it->first)->Render(nullptr);
		}
	}
	return nullptr;
}

I am trying to design the system for maximum flexibility with the thread speeds so that we can experiment with different frequencies for each stage. This is why the rendering thread goes through and executes all commands an all accumulated command buffers before going on to actually render any queued world. This prevents the rendering thread from rendering an extra frame when another one has already been received (which shouldn't really happen, but we will see).

As you can see, the previously expensive World::Render() command now does almost nothing before returning to your game loop. I am also going to experiment with running the game loop and the rendering loop at different speeds. So let's say previously your game was running at 60 FPS and 1/3 of that time was spent rendering the world. This left you without about 11 milliseconds to execute your game code, or things would start to slow down. With the new design your game code could have up to 33 milliseconds to execute without compromising the framerate. That means your code could be three times more complex, and you would not have to worry so much about efficiency, since the rendering thread will keep blazing away at a much faster rate.

The game loop is a lot simpler now with just two command you need to update and render the world. This gives you a chance to adjust some objects after physics and before rendering. A basic Leadwerks 5 program is really simple:

#include "Leadwerks.h"

using namespace Leadwerks;

int main(int argc, const char *argv[])
{
	auto window = CreateWindow("MyGame");
	auto context = CreateContext(window);
	auto world = CreateWorld();
	auto camera = CreateCamera(world);

	while (true)
	{
		if (window->KeyHit(KEY_ESCAPE) or window->Closed()) return 0;
		world->Update();
		world->Render(context);
	}
}

This may cause problems if you try to do something fancy like render a world to a buffer and then use that buffer as a texture in another world. We might lose some flexibility there, and if we do I will prioritize speed over having lots of options.

Clustered Forward Rendering

Leadwerks has used a deferred renderer since version 2.1. Version 2.0 was a forward renderer with shadowmaps, and it didn't work very well. At the time, GPUs were not very good at branching logic. If you had an if / else statement, the GPU would perform BOTH branches (including expensive texture lookups) and take the result of the "true" one. To get around this problem, the engine would generate a new version of a shader each time a new combination of lights were onscreen, causing period microfreezes when a new shader was loaded. In 2.1 we switched to a deferred renderer which eliminated these problems. Due to increasingly smart graphics hardware and more flexible modern APIs a new technique called clustered forward rendering is now possible, offering flexibility similar to a deferred renderer, with the increased speed of a forward renderer. Here is a nice article that describes the technique:
http://www.adriancourreges.com/blog/2016/09/09/doom-2016-graphics-study/

51_grain_pre.jpg.b5b4718bc5962e13f7c200c9454ebca3.jpg

This approach is also more scalable. Extra renders to the normal buffer and other details can be skipped for better scaling on integrated graphics and slower hardware. I'm not really targeting slow hardware as a priority, but I wouldn't be surprised if it ran extremely fast on integrated graphics when the settings are turned down. Of course, the system requirements will be the same because we need modern API features to do this.

I'm still a little foggy on how custom post-processing effects will be implemented. There will definitely be more standard features built into the renderer. For example, SSR will be mixed with probe reflections and a quality setting (off, static, dynamic) will determine how much processing power is used for reflections. If improved performance and integration comes at the cost of reduced flexibility in the post-process shaders, I will choose that option, but so far I don't foresee any problems.

Vulkan Graphics

The new renderer is being developed with OpenGL 4.1 so that I can make a more gradual progression, but I am very interested in moving to Vulkan once I have the OpenGL build worked out. Valve made an agreement with the developers of MoltenVK to release the SDK for free. This code translates Vulkan API calls into Apple's Metal API, so you basically have Vulkan running on Mac (sort of). I previously contacted the MoltenVK team about a special license for Leadwerks that would allow you guys to release your games on Mac without buying a MoltenVK license, but we did not reach any agreement and at the time the whole prospect seemed pretty shaky. With Valve supporting this I feel more confident going in this direction. In fact, due to the design of our engine, it would be possible to outsource the task of a Vulkan renderer without granting any source code access or complicating the internals of the engine one bit.

  • Like 5


5 Comments


Recommended Comments

Will this feature be released in Leadwerks 4.X or just in 5.0?

If it's just in 5.0, will we get a Linux build soon?

Share this comment


Link to comment

Can't wait to see what the future holds for Leadwerks. You will be able to make way better use of the CPU's threads with Vulkan so that'll be fun (if it happens).

Don't forget to always use RenderDoc when you're changing up the renderer. Best tool ever made, I swear... although I'm sure you've used it already :)

Share this comment


Link to comment

@martyj This is the architecture change stuff I was talking about in Leadwerks 5. It also introduces breaking API changes. I will continue to develop this on Windows and only release a Linux or Mac build when it is further along. This will make a huge difference for the type of game you are working on, since I think your game logic is pretty intensive.

@Crazycarpet I have never heard of that program before but I will definitely try it out. It looks cool!

 

  • Like 1

Share this comment


Link to comment

I have very very basic multithreaded rendering working now. Literally all it does it clear the screen but all OpenGL calls are occurring on a second thread.

Share this comment


Link to comment

I think it's important to keep trudging forward with newer hardware.  AMD's affordable core act (RYZEN) means 6 and 8 core CPUs (with 12 and 16 threads) are going to be the norm in the coming years, because intel has had to adapt to their pricing and offering, so are likewise offering more cores for cheaper.  We already saw them move coffee lake to 6 cores after a decade of quad cores.  So to me, offering to leverage this in your core engine is going to be a strong selling point and get developer attention.

AMD has a 15w, mobile 4 core 8 thread CPU.  It's getting heated out there on the CPU front.  As soon as miners stop buying all the GPUs, I think we're going to see some trading blows over the next year starting from this summer.

Offering Vulkan support out of the box is going to turn a lot of heads as well, especially if it's well done.

Good work Josh.  Looking forward to playing with Leadwerks on my 8 core AMD and RX Vega =)

Share this comment


Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Blog Entries

    • By Josh in Josh's Dev Blog 0
      Textures in Leadwerks don't actually store any pixel data in system memory. Instead the data is sent straight from the hard drive to the GPU and dumped from memory, because there is no reason to have all that data sitting around in RAM. However, I needed to implement texture saving for our terrain system so I implemented a simple "Pixmap" class for handling image data:
      class Pixmap : public SharedObject { VkFormat m_format; iVec2 m_size; shared_ptr<Buffer> m_pixels; int bpp; public: Pixmap(); const VkFormat& format; const iVec2& size; const shared_ptr<Buffer>& pixels; virtual shared_ptr<Pixmap> Copy(); virtual shared_ptr<Pixmap> Convert(const VkFormat format); virtual bool Save(const std::string& filename, const SaveFlags flags = SAVE_DEFAULT); virtual bool Save(shared_ptr<Stream>, const std::string& mimetype = "image/vnd-ms.dds", const SaveFlags flags = SAVE_DEFAULT); friend shared_ptr<Pixmap> CreatePixmap(const int, const int, const VkFormat, shared_ptr<Buffer> data); friend shared_ptr<Pixmap> LoadPixmap(const std::wstring&, const LoadFlags); }; shared_ptr<Pixmap> CreatePixmap(const int width, const int height, const VkFormat format = VK_FORMAT_R8G8B8A8_UNORM, shared_ptr<Buffer> data = nullptr); shared_ptr<Pixmap> LoadPixmap(const std::wstring& path, const LoadFlags flags = LOAD_DEFAULT); You can convert a pixmap from one format to another in order to compress raw RGBA pixels into BCn compressed data. The supported conversion formats are very limited and are only being implemented as they are needed. Pixmaps can be saved as DDS files, and the same rules apply. Support for the most common formats is being added.
      As a result, the terrain system can now save out all processed images as DDS files. The modern DDS format supports a lot of pixel formats, so even heightmaps can be saved. All of these files can be easily viewed in Visual Studio itself. It's by far the most reliable DDS viewer, as even the built-in Windows preview function is missing support for DX10 formats. Unfortunately there's really no modern DDS viewer application like the old Windows Texture Viewer.

      Storing terrain data in an easy-to-open standard texture format will make development easier for you. I intend to eliminate all "black box" file formats so all your game data is always easily viewable in a variety of tools, right up until the final publish step.
    • By Josh in Josh's Dev Blog 1
      I wanted to see if any of the terrain data can be compressed down, mostly to reduce GPU memory usage. I implemented some fast texture compression algorithms for BC1, BC3, BC4, BC5, and BC7 compression. BC6 and BC7 are not terribly useful in this situation because they involve a complex lookup table, so data from different textures can't be mixed and matched. I found two areas where texture compression could be used, in alpha layers and normal maps. I implemented BC3 compression for terrain alpha and could not see any artifacts. The compression is very fast, always less than one second even with the biggest textures I would care to use (4096 x 4096).
      For normals, BC1 (DXT1 and BC3 (DXT5) produce artifacts: (I accidentally left tessellation turned on high in these shots, which is why the framerate is low):

      BC5 gives a better appearance on this bumpy area and closely matches the original uncompressed normals. BC5 takes 1 byte per pixel, one quarter the size of uncomompressed RGBA. However, it only supports two channels, so we need one texture for normals and another for tangents, leaving us with a total 50% reduced size.

      Here are the results:
      2048 x 2048 Uncompressed Terrain:
      Heightmap = 2048 * 2048 * 2 = 8388608 Normal / tangents map = 16777216 Secret sauce = 67108864 Secret sauce 2 = 16777216 Total = 104 MB 2048 x 2048 Compressed Terrain:
      Heightmap = 2048 * 2048 * 2 = 8388608 Normal map = 4194304 Tangents = 4194304 Secret sauce = 16777216 Secret sauce 2 = 16777216 Total = 48 MB Additionally, for editable terrain an extra 32 MB of data needs to be stored, but this can be dumped once the terrain is made static. There are other things you can do to reduce the file size but it would not change the memory usage, and processing time is very high for "super-compression" techniques. I investigated this thoroughly and found the best compression methods for this situation that are pretty much instantaneous with no noticeable loss of quality, so I am satisfied.
    • By jen in jen's Blog 0
      My small project will be called Foregate, it will be a dark medieval Diablo style single player action RPG.
      The graphics will be simple, no PBR, 256x256 map, reasonably low-res models.
      Camera style? Top-down-ish I think? Like in Diablo exactly - and because the camera is not directly in-front of the 3 models, I can get away with low-resolution assets - bonus. Also, with top-down view, I won't have to worry about high resolution sky-boxes. 
      What's my plan for this project?
      I plan to make this project as small and as simple as possible, possibly release it as open-source, and have fun with it of course.
      My previous experience with game development (1-2 years ago?) was amateurish I think, still is now. I want to give it a go again, this time with experience although my skill in C++ is not really that good? Maybe I can improve it in this project.
      More about the game
      The content is not set in stone yet but I have a general idea of how the mechanics is going to look and feel - Diablo-ish obviously. It'll have monsters (ancient & mythical probably), loot when killing a monster, gold as in-game currency, visual grid inventory, player stats (level, strength, agility, vitality, energy, &c.). 
      The game will be single-player. Possibly a coop multiplayer also? I don't have any interest in making massive multi-player. 
      I started my development yesterday with the basic preparations (setting up project environment, &c.), today I made my first step in developing the core components; worker class, game state, task class.
      I have a game state that keeps a single source of truth for the entire application; all game data will be stored in this class as "states". 
      I also have a "Worker" which will do the processing of tasks in the game.
      I also have "object" class, this can be a monster, the player, a weapon, a prop, or an NPC.
      So the idea is to have a CQRS type of interaction between the classes and the data. Any action in the game will be interpreted as "Task" for the Worker class. The worker class iterates through the Task. Tasks can be created by any class interfaced with the Worker class trough "addNewTask" and the new tasks can be of a certain type i.e.: ATTACK, IDLE, SAVE_GAME, EXIT_GAME, the new task will also have a payload data and it's processed according to its task type e.g. an ATTACK with payload "{ Damage: 10, Target: MonsterA }" will reduce the health of MonsterA by 10 - the worker class will change the game state; find MonsterA in MonsterState and reduce its health by 10. 
      I think it's advantageous to have this type of centralized module where all actions are processed; I can do all sorts of procedures during the processes, maybe debug data, filter actions, mutate payloads, and such.
      How much time am I going to put into this?
      A couple of hours a day for 3 days a week maybe.
      So it's all a rough sketch for now and it's heading the right direction. I'll have more to report later on. 

      This is Forgate Castle, minus the castle, in the map Forgate; the starting location for the player. The fortification will have merchants, and quest givers.
       
×
×
  • Create New...