Fastest Prime Number Generator on NFO! FIGHT!
Page 1 of 2 Goto page 1, 2  Next
paxsali
Banned



Posts: 18352

PostPosted: Sun, 4th Aug 2013 14:31    Post subject: Fastest Prime Number Generator on NFO! FIGHT!
Hi feggits,

how about we make a competition who will write the fastest prime number generator in his favorite programming language?

I go first to set to set the bar. Let's meassure the time it takes to calculate the first 10.000 primes.

CPU:
Code:
Processor Name: Intel Core 2 Duo
      Processor Speed: 2,66 GHz
      Number of Processors: 1


Performance:
~3353 primes per sec (for the first 10.000 only[1])
Code:
real   0m2.982s
user   0m2.890s
sys   0m0.059s


Language:
Python

Implementation/Compiler/Runtime:
PyPy

Code:
http://pastebin.com/qgEnH9HE
Come on guys, it's freaking sunday, there is nothing else you do anyways... Cute Pom Pom's
And because I start with Python, it's easier for all the JavaScripters and C++ers or maybe Java programmers to set the new highscore.

FIGHT!

1) obviously only valid for the first 10.000 primes, because since the densitiy of primes decreases as the numbers get bigger. So the longer you run your programm, the less primes / sec it will produce.


"There will be no end to the troubles of humanity, until philosophers become kings, or kings become philosophers.", Plato.
"Hyperbole will destroy us all.", Matt Dillahunty.
"The hyperbole, the demonization of the other opinion and the unwillingness to even read the opposing opinion destroys the so important political discussions necessary for the well functioning of society.", Couleur
Back to top
WhiteBarbarian




Posts: 6003
Location: Russia
PostPosted: Sun, 4th Aug 2013 21:45    Post subject:
Managed to come up with quite fast algorithm. Implemented in C, generates 10000 primes in about 0.018266 seconds on 2.2 GHz Intel Core 2 Duo.

Source

 Spoiler:
 


Back to top
paxsali
Banned



Posts: 18352

PostPosted: Sun, 4th Aug 2013 22:20    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 5th Aug 2013 00:45    Post subject:
78 milliseconds

http://pastebin.com/1h0jyVPp

It's not robust but it does the job Smile

Language: C++, Compiler: MSVC++


Last edited by BearishSun on Mon, 5th Aug 2013 00:47; edited 1 time in total
Back to top
WhiteBarbarian




Posts: 6003
Location: Russia
PostPosted: Mon, 5th Aug 2013 00:46    Post subject:
paxsali wrote:
Disqualified!

EDIT: you could have translated my program 1:1 into C and still be faster (presumably)...

Damn, with that edit you made me feel guilty Failed Troll I coded my "generator" during AI turns in Civ 5 just for a little bit of honest trolling Mobster with a Fag


Ok, adapted your code to C version

Source: http://pastebin.com/XP5vGLT2

Completes in about 1.848005 seconds, which gives about 5408 primes per second.


Last edited by WhiteBarbarian on Mon, 5th Aug 2013 01:26; edited 2 times in total
Back to top
WhiteBarbarian




Posts: 6003
Location: Russia
PostPosted: Mon, 5th Aug 2013 00:53    Post subject:
BearishSun wrote:
78 milliseconds

http://pastebin.com/1h0jyVPp

It's not robust but it does the job Smile

Language: C++, Compiler: MSVC++


My C++ is VERY rusty. If I am reading it right you made a multithreaded version ?


Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 5th Aug 2013 00:59    Post subject:
Yup, very multithreaded, runs on GPU. 680 GTX in my case.
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Mon, 5th Aug 2013 01:38    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 5th Aug 2013 09:28    Post subject:
Essentially it queues up 150 000 "threads", which get executed on the GPU. This is why I say its not robust, since I'm guesstimating that 150 000 will be enough to find first 10000 prime numbers. (A more robust version would have an additional outer loop that would stop when enough numbers were reached, but I didn't feel like bothering).

Each thread gets executed in parallel (as much as hardware allows), which means the prime numbers will be generated out of sequence. This is why I sort the numbers at the end. For that same reason it is possible that "numPrims" counter will reach 10000 before first 10000 numbers were generated (some threads will be calculating numbers higher than 10000). This is why I allow 10500 primes, having 500 extra to ensure we get the first 10000. Again, another reason why its not very robust.

At the end I just sort it all and ignore the 500 "extra" numbers.

Now that it's morning I can think of a bunch of improvements, maybe someone wants to improve:
- Getting rid of atomically incremented variable "numPrims", as that is essentially a sync point between the threads
- Use tiled execution as that makes use of on-chip memory and might be faster (since we only do writes and don't have memory reads, I don't think it is)
- If you get rid of "numPrims" variable you will likely need a different algorithm altogether and that algorithm should avoid the sorting step.
- There are faster prime number generation algorithms but they rely on accessing previously generated data which I wanted to avoid due to sync issues. There might be a way to get around that.
- A multithreaded vectorized CPU version might be faster. I haven't tried, but it's possible that CPU<->GPU overhead is worse than just executing it on CPU (unlikely, but it would be nice to see AVX implementation)
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 5th Aug 2013 10:05    Post subject:
P.S. In your guys algorithms almost all time is spent printing out the values, and not actually generating primes. If you get rid of print/printf from the main loop and only calculate the performance of your prim generator you should get pretty close to my result.
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Mon, 5th Aug 2013 12:44    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
WhiteBarbarian




Posts: 6003
Location: Russia
PostPosted: Mon, 5th Aug 2013 13:13    Post subject:
BearishSun wrote:
P.S. In your guys algorithms almost all time is spent printing out the values, and not actually generating primes. If you get rid of print/printf from the main loop and only calculate the performance of your prim generator you should get pretty close to my result.

Yep, this was the first thing I have changed while adapting paxsali's code to C version.


Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Mon, 5th Aug 2013 15:09    Post subject:
Google Chrome Version 28.0.1500.95 m

i5 M560 (2x2.67GHz)

Javascript: 91ms for the first 10.000

http://pastebin.com/xCjZFudR

So ~110.000/s for the first 10.000? (This is a stupid measurement! Just giving the time it took for the first 10.000 is much better)

Edit:
You could also exchange the inner loop with
Code:

      for (;i < 10; i++)
         if (prime(i))
            primes.push(i);


if you want a little less code


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Mon, 5th Aug 2013 15:39    Post subject:
Same computer, C# with VS2012
http://pastebin.com/XFekkNWB


1020ms Confused


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
WhiteBarbarian




Posts: 6003
Location: Russia
PostPosted: Mon, 5th Aug 2013 17:57    Post subject:
I derped, been measuring performance on a debug build lol wut Run release build five times, average is 56,3852 ms Very Happy

EDIT: Wow, it's even faster than BearishSun version Shocked


Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 5th Aug 2013 18:41    Post subject:
Mine is far from optimized due to excessive syncing Smile Plus I'm using a brute force algorithm which does way too many iterations than needed.
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Mon, 5th Aug 2013 19:37    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Wed, 7th Aug 2013 13:47    Post subject:
There is no ** in javascript but Math.pow(x,y) but Math.sqrt is faster than pow so that is that...


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1338

PostPosted: Wed, 7th Aug 2013 22:58    Post subject:
Let me jump in as well, hehe.

Code:

Looking for 10000 Primes...
Found 11302 primes in 9.983ms
First 10: 1 2 3 5 7 11 13 17 19 23
Last 10: 104659 104677 104681 104683 104693 104701 104707 104711 104717 104723
Press any key to continue . . .


Visual Studio 2012, C++11 with threading - My first jab at it.

Ran on a Core i5 2500k.

Here's the code

 Spoiler:
 


Not optimally tweaked. Better management would get rid of the joins, although they help keep the numbers in-line and saves up on an expensive sort. The threads should take up resonably equal amounts of time on a quad-core so the dead-time between threads respawning is minimal.

The threading wasn't that much of a gain, only about a couple of ms. Maybe it would show better on an older architecture, although i think the bigger bottleneck might be something else, like the sqrt() because it involves the switch into the FPU.

Woop woop


teey
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Wed, 7th Aug 2013 23:50    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1338

PostPosted: Fri, 9th Aug 2013 19:21    Post subject:
Is it over? No challengers anymore? Where is everybody

paxsali wrote:
@mrt
You lost me at 1...


C++11? Very Happy

It is only an update to the C++ standard with added features compared to C++ aka C++99 (altho there have been tr1 updates and such).

Im happy to elaborate, if you ask Smile


teey
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Fri, 9th Aug 2013 19:42    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
spankie
VIP Member



Posts: 2958
Location: Belgium
PostPosted: Fri, 9th Aug 2013 20:54    Post subject:
You guys would like http://projecteuler.net/ Wink
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1338

PostPosted: Sun, 11th Aug 2013 10:58    Post subject:
@paxsali Whats the algorithm?


teey
Back to top
BearishSun




Posts: 4484

PostPosted: Sun, 11th Aug 2013 19:21    Post subject:
1.45ms Smile

http://pastebin.com/9h7tw7Wf

i7 2600K (for better comparison, mrts algorithm ran at 12.9ms (shortest measured)), MSVC++
Back to top
BearishSun




Posts: 4484

PostPosted: Sun, 11th Aug 2013 19:28    Post subject:
Scratch that, 1.17ms. Optimized out the sqrt.

http://pastebin.com/2CcHkN83
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Mon, 12th Aug 2013 12:51    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 12th Aug 2013 14:10    Post subject:
It's as clear as you can get it, optimized code is usually pretty ugly.

But in short, it's using vector CPU instructions (i.e. SIMD - Single instruction multiple data), where I can perform various operations on 8 values at once, at the cost of a single operation.

First bit before tmain is emulating abs() function because no such function exists for vector registers.

Then in tmain I populate the first few prime numbers because of the way algorithm works. Nothing special. I also keep a inverse of each prime number because division is a bottleneck in the algorithm, and caching the division saves about 0.5ms.

Then in the main loop I do a bunch of operations to determine if number is divisible by another number. There is no vector equivalent of a modulo operator, and also there is no support for integer division so I do it the hard way. Divide using floating point, then round to nearest integer and then see if the difference between original and rounded is close to 0, and if it is I determine it is divisible and it's not a prime.

And at the end I unpack everything from vector registers back into scalar part of the code. I could have done a loop here instead of doing each unpack manually but I don't trust the compiler to unroll it properly.
Back to top
paxsali
Banned



Posts: 18352

PostPosted: Mon, 12th Aug 2013 14:48    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Mon, 12th Aug 2013 14:49    Post subject:
Pretty smart, so you assumed that (n being the nth prime number after your initial Cool sqrt(n+7) > n, right? Well, then parallelizing the calculations is really nice Smile

Edit: Pax just poked the screen with a stick.

Edit2:


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
Page 1 of 2 All times are GMT + 1 Hour
NFOHump.com Forum Index - Programmers Corner Goto page 1, 2  Next
Signature/Avatar nuking: none (can be changed in your profile)  


Display posts from previous:   

Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group