Can the owner implement this?
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf
Sorry because I can't write assembly: I only know C. Thanks!
After briefly skimming the paper, I would say that most (if not all) of those optimizations are already present in cpuminer. Right now I don't have time to devote to this, but if you want to go over it in detail you can check the portable C implementation of sha256d_ms() in sha2.c. As you can see from the git history, this code is from March 2012, so it predates this thesis by more than one year.
FYI, here's an optimization not mentioned in the thesis and I don't see it in cpuminer either. It can't be used it with HW SHA. If looped it replaces a XOR with a MOV each round except the first, partially unrolled the MOV can be eliminated.
https://issueexplorer.com/issue/openwall/john/4727