Initial Benchmarks of Cryptosystem
Just a little update, the basics of the cryptosystem are now complete, the receiver public key cloaking, message reply key embedding and just now finished the basic implementation of the fast signing/encryption key system described in the previous.
The speed is looking great, for each message the new private key generation only takes 21 nanoseconds, which is a lot faster than I'd expected. The caveat is that it requires generating a second generator key for adding vectors to the base key, but it's possible that could be a background process that updates every few seconds instead of generating a second generator key for every message. And naturally, this method will have somewhat lower cryptographic strength than the raw, random generated keys, so it cannot be depended upon for long periods of time, there will be some amount of rekeying required.
Generating the new private keys is an expensive operation, and consumes available system entropy at each instance, and can take a variable amount of time when the generated private key is not in the EC group. This "key rolling" operation is much faster and may end up being used to save on entropy and initial key generation using other strategies as well.
The cost of deriving public keys is another thing that I haven't yet benchmarked. This is going to be one of the hotspots for the cryptosystem, because of the Elliptic Curve Diffie Hellman secure cipher generation depending on this. But I'm optimistic it will not be a severe bottleneck. But for sure it will be a bottleneck.
For this reason, the super shadowy sponsor whose funding is enabling me to work on this full time, he is doing a little background work, testing and research on embedded hardware solutions for devices that we will be releasing not long after the system is completed. One of the things that really helps a lot with the throughput of the cryptography, hash functions, elliptic curve keys and the forward error correction coding is SIMD (Single Instruction Multiple Data) extensions such as AVX and AVX512.
There is a smaller number of options among the embedded hardware market that provide at least AVX, and the performance difference is quite substantial. SHA256 hashes take 700 nanoseconds without AVX2, and only about 70ns with it. I have already found libraries that make use of these SIMD instructions for the hash function and error correction.
There is similar extensions found on ARM embedded processors but not so much good with Atom and Celeron processors. Making use of the ARM SIMD extensions will require creating wrappers that are built on the C versions of the APIs to use them, which has a small overhead cost as well as reducing the utility of profiling tools on Go, as opposed to the native assembler support for AMD64 (intel type processors) for AVX and AVX512.
All of this is quite important for a key target of this project, which is to enable a minimal cost in latency while providing extremely strong anti-surveillance protection. So far the signs are pretty good that we will at least be able to manage somewhere between 100mbit and 1gbit of throughput for nodes, which will be more than adequate.
Benchmarks will illuminate more about the options as we progress. It may well take a little balancing to improve performance at the same time as keeping security as high as possible, but the targets should be reachable.