I didn't believe that shared memory is very fast. I wrote version app which using shared memory. On my card it's only 16kb for every block of threads. It's very small but i tried use that for encrypt/decrypt. Version without shared memory worked 76 seconds but with shared memory only 50 seconds. It's increase about 35% performance. I'll trying increase performance more.
Edited at 2011-02-24: My last version of nDES CUDA works about 59% faster using SHARED MEMORY on my 8600m gt. It's very good for me but i'll try increase it more :)