Hi, welcome, and thanks for registering
The problem of returning values with consistent timng is one that GPU developers are fairly accustomed to dealing with I think.... recently there has been somewhat of a spotlight on frame time variance and latency with regard to rendering complex 3D scenes on multiple GPUs.
What's odd about mentioning MMX and SSE for example, as features of a CPU that exclude doing same on a GPU, is that these extensions were originally created to do 3D graphics type operations; something seems at odds there.
I've never developed using CUDA, so to some extent my guesses here are admittedly uneducated, but even given the understanding that some algorithms are simply serial in nature, I still don't see why each oscillator couldn't have its own thread, reverb1 has its own thread, reverb2 has its own, and so on. One of the big selling points of the DSP in the Virus is specialized parallel filter processers, so lets say those are paralleled on the GPU or worst case scenario each filter and envelop gets their own thread.
In other words, to the best of my knowledge most plugins on the CPU today are not achieving more efficient use of multi-core CPUs by using parallelism to solve math problems that are serial in nature, they are simply using additional threads to divvy up the workload of separate features.
I hear ya on the lack of proper standard, although my original thought on CUDA is really the same as the old Virus plugin on TCE Powercore cards... that was certainly not a standard, but a proprietary solution dependent on owning that card. The difference I see with NVidia is that an insane number of people already have these in their systems, going unused for the most part while making music.