top of page

LEVI - KERNEL RITUS

Traditional GPU libraries like cuBLAS are optimized for massive workloads - think training giant neural networks with thousands of parameters. But what happens when you need to multiply small matrices? You get:

  • Setup overhead that takes longer than the actual computation

  • Memory allocation optimized for gigabytes, not kilobytes

  • Complex dispatch logic that assumes large-scale parallelism

  • Enterprise features you don't need for edge computing

Result: A Ferrari stuck in city traffic.

LEVI GPU Library fills the gap between "basic code" and "industrial-strength cuBLAS" with adaptive kernel selection that picks the right tool for each job.

The Sweet Spot: LEVI dominates in the 64-256 range where edge computing lives.

LEVI uses proprietary Kernel Ritus technology to automatically select optimal algorithms:

 

Two-Kernel Architecture:

 

// Simple Kernel (small matrices)

  • Minimal setup overhead

  • Cache-optimized access patterns

  • Loop unrolling for better IPC

  • Perfect for edge devices
     

// Tiled Kernel (medium+ matrices)

  • Shared memory utilization

  • Bank conflict avoidance

  • Optimized for throughput

  • Competitive with cuBLAS

Perfect For Edge Computing

 

Why Edge Needs Different Optimization:

 

Traditional Data Centers:

  • Batch size: 1024+ samples

  • Matrix size: 2048×2048+

  • Memory: Abundant

  • Power: Unlimited
     

Edge Computing:

  • Batch size: 1-32 samples

  • Matrix size: 64-512×64-512

  • Memory: Limited

  • Power: Battery-constrained
     

LEVI targets exactly this gap.

When to Use LEVI vs cuBLAS:

 

Use LEVI when:

✅ Matrix size < 512×512

✅ Edge/mobile deployment

✅ Power/memory constraints

✅ Batch processing many small problems

Use cuBLAS when:

✅ Matrix size > 512×512

✅ Data center deployment

✅ Maximum absolute throughput needed

✅ Deep learning training

More Information on our GitHub Repository!

https://github.com/forgottenforge/levi-gpu

getforged

Contact us

bottom of page