Hi x3mEn,
Thanks again for sharing some more details on your algorithm. I have read the paper you provided. Although we cannot judge the efficiency of an algorithm by simply counting the number of nested loops, I admit that your two-loop idea may be better in implementation than the three-loop one in the paper, mainly by its simplicity.
I now have the following suggestions:
1. You may want to use more knowledge in number theory.
Many of the constraints in the three-loop algorithm concern the value mod 4 of results. We know that, for a primitive cuboid, its diagonal is of the form 4k+1, meaning that its sides are 0, 0, 1 mod 4. But I think you have already used this fact.
A more interesting one comes from checking for squares. More precisely, as you said before, you have g^2=a^2+f^2=b^2+e^2, and you want to check for a^2+b^2=d^2 for some d in the decompositions. You are doing it now first with byte residue, then with integer residue (32 bit), then finally with float point (long double). In fact, you can do it using Chinese Remainder Theorem (
https://en.wikipedia.org/wiki/Chinese_remainder_theorem).
Basically, if you want to check an equality x+y=z, with z<n_1 * n_2 * ... * n_k, and all the n_i's are all coprime in pairs, you only need to check the equality mod all the n_i's. If all gives equal, then we must have a+b=c. In your case, a and b are 64 integers, therefore you need a collection of numbers with a product larger than 2^129, with each number less than 2^32. That is fairly easy to do. We know that 2^32-5, 2^32-17, 2^32-65, 2^32-99 are all primes, and you can also use 2^32 (your integer check), which is also coprime with them. You just perform the integer check, then check mod these five numbers (four if you restrict a, b to be less than 2^63). If you can use 64bit integers native to the processor (you can in C, with multiplication that gives you a 128-bit integer, in the form of hi- and low-64 bits), you may simply try 2^64 (the 64-bit integer check), 2^64-59, and (optionally, unnecessary for a,b > 2^63) 2^64-83.
A more interesting fact is that the residue mod a number of the form 2^32-a or 2^64-a for a small a takes much less time to compute than other modulus. Indeed, take 2^32-a as an example. If I want to compute n mod 2^32-a, with n a 64bit integer written as n1*2^32+n2, then we have
n mod 2^32-a = n1*a + n2 mod 2^32 - a
With a shift, a multiplication (with small number!) and an addition, we almost get the residue. On modern processors, this combinations gives you typically a large speed up, since division is much slower than multiplication. Of course, I say almost, because it is possible that n1*a + n2 is larger than 2^32 - a. In this case, if n1*a + n2 is larger than 2^32, you just do it again and it is good; otherwise, you substract 2^32 - a (or add a and take the lower 32 bits) and it is also good. The test is easy and fast.
On modern processors, integer instructions are much, much faster than float point ones. Especially you used extended double (80-bit) here, which must use the FPU. Square root in FPU takes about 40 cycles, while integer multiplication in CPU takes only 3 to 4 cycles (I look up some old reference, may be even better now). So I think there is a margin that you can gain here.
And for your information, on modern processors, checking for byte is not faster than checking for integer. In fact, the most natural check is to the native word length, which is 64 in your case.
2. Where is the hotspot?
I don't know if you have already profile your program and see the hotspot that takes most of the time, and try to improve it. I don't know much about Delphi (I can write code in it, but I mainly works on C), but there must be some tools that you can use for Delphi (or Pascal). Since integer factorization is much more expensive than any other part of your algorithm, I think that up to some turning point, the main cost will be on factorization. In this case, I have a good way to totally eliminate the cost of factorization, but it will complicate your code and the project for quite a bit. I don't recommend to do it unless you find out the hotspot in your current program. Determining hotspot is the basic of code optimization.
Hope that these commentaries can help you.