I did some work optimizing the new renderer to discrete cards and came up with a new benchmark. This scene is interesting because there is not much in the scene, so the culling does not benefit much xrom multithreading, since there is so little work to do. The screen is filled with a spotlight in the foreground, and in the background a point light is visible in the distance. Both engines are running at 1280x720 resolution. This is really a pure test of deferred vs. forward lighting.
AMD R9 200
Leadwerks: 448 FPS
Turbo: 648 FPS (44% faster)
On this older AMD card we see that forward rendering automatically gives a significant boost in performance, likely due to not having to pass around a lot of big textures that are being written to.
GEForce 1070M (Laptop)
Leadwerks 4: 120 FPS
Turbo: 400 FPS (333% faster)
Here we see an even bigger performance boost. However, these are two very different cards and I don't think it's safe to say yet if one GPU vendor will get more from the new engine.
Intel Graphics 4000
Leadwerks 4: 67 FPS (18% faster)
Turbo: 57 FPS
This is an interesting result because integrated graphics tend to struggle with pixel fillrate, but seem to have no problems with texture bandwidth since they are using regular old CPU memory for everything. Although I think the new engine will run faster on Intel graphics (especially newer CPUs) when the scene is loaded down, it seems to run a bit slower in a nearly empty scene. I think this demonstrates that if you build a renderer specifically to take advantage of one type of hardware (discrete GPUs) it will not automatically be optimal on other types of hardware with different architectures. I commented out the lighting code and the framerate shot way up, so it definitely does seem to be a function of the pixel shader, which is already about as optimized as can be.
Overall, I think the new engine will show higher relative performance the more you throw at it, but an empty scene may run a bit slower on older Intel chips since the new renderer is optimized very specifically for discrete GPUs. The parallel capabilities of discrete GPUs are relied on to handle complex lighting in a single pass, while texture bandwidth is reduced to a minimum. This is in line with the hardware trend of computing power increasing faster than memory bandwidth. This also shows that low to mid-range discrete GPUs stand to make the most gains.