News

To leverage massive thread-level parallelism (TLP) in a GPU, deeply nested convolution loops are lowered (or unrolled) into large matrix multiplication, which trades memory capacity and bandwidth for ...