By storing x/y as a vector of ints instead of a vector of tuples, the compiler recognizes it for what it is: a huge vector multiplication. Now the compiler emits AVX2 instructions.
This bumps the speed tenfold.
This implementation is stupid but it works.
Also reformat the code, but that is nothing significant.