天道酬勤,学无止境

python prime crunching: processing pool is slower?

So I've been messing around with python's multiprocessing lib for the last few days and I really like the processing pool. It's easy to implement and I can visualize a lot of uses. I've done a couple of projects I've heard about before to familiarize myself with it and recently finished a program that brute forces games of hangman.

Anywho, I was doing an execution time compairison of summing all the prime numbers between 1 million and 2 million both single threaded and through a processing pool. Now, for the hangman cruncher, putting the games in a processing pool improved execution time by about 8 times (i7 with 8 cores), but when grinding out these primes, it actually increased processing time by almost a factor of 4.

Can anyone tell me why this is? Here is the code for anyone interested in looking at or testing it:

#!/user/bin/python.exe
import math
from multiprocessing import Pool

global primes
primes = []

def log(result):
    global primes

    if result:
        primes.append(result[1])

def isPrime( n ):
    if n < 2:
        return False
    if n == 2:
        return True, n

    max = int(math.ceil(math.sqrt(n)))
    i = 2
    while i <= max:
        if n % i == 0:
            return False
        i += 1
    return True, n


def main():

   global primes

   #pool = Pool()

   for i in range(1000000, 2000000):
       #pool.apply_async(isPrime,(i,), callback = log)
       temp = isPrime(i)
       log(temp)

   #pool.close()
   #pool.join()

   print sum(primes)

   return

if __name__ == "__main__":
    main()

It'll currently run in a single thread, to run through the processing pool, uncomment the pool statements and comment out the other lines in the main for loop.

评论

the most efficient way to use multiprocessing is to divide the work into n equal sized chunks, with n the size of the pool, which should be approximately the number of cores on your system. The reason for this is that the work of starting subprocesses and communicating between them is quite large. If the size of the work is small compared to the number of work chunks, then the overhead of IPC becomes significant.

In your case, you're asking multiprocessing to process each prime individually. A better way to deal with the problem is to pass each worker a range of values, (probably just a start and end value) and have it return all of the primes in that range it found.

In the case of identifying large-ish primes, the work done grows with the starting value, and so You probably don't want to divide the total range into exactly n chunks, but rather n*k equal chunks, with k some reasonable, small number, say 10 - 100. that way, when some workers finish before others, there's more work left to do and it can be balanced efficiently across all workers.

Edit: Here's an improved example to show what that solution might look like. I've changed as little as possible so you can compare apples to apples.

#!/user/bin/python.exe
import math
from multiprocessing import Pool

global primes
primes = set()

def log(result):
    global primes

    if result:
        # since the result is a batch of primes, we have to use 
        # update instead of add (or for a list, extend instead of append)
        primes.update(result)

def isPrime( n ):
    if n < 2:
        return False
    if n == 2:
        return True, n

    max = int(math.ceil(math.sqrt(n)))
    i = 2
    while i <= max:
        if n % i == 0:
            return False
        i += 1
    return True, n

def isPrimeWorker(start, stop):
    """
    find a batch of primes
    """
    primes = set()
    for i in xrange(start, stop):
        if isPrime(i):
            primes.add(i)

    return primes



def main():

    global primes

    pool = Pool()

    # pick an arbitrary chunk size, this will give us 100 different 
    # chunks, but another value might be optimal
    step = 10000

    # use xrange instead of range, we don't actually need a list, just
    # the values in that range.
    for i in xrange(1000000, 2000000, step):
        # call the *worker* function with start and stop values.
        pool.apply_async(isPrimeWorker,(i, i+step,), callback = log)

    pool.close()
    pool.join()

    print sum(primes)

    return

if __name__ == "__main__":
    main()

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 多重处理池使Numpy矩阵乘法变慢(Multiprocessing.Pool makes Numpy matrix multiplication slower)
    问题 所以,我在玩multiprocessing.Pool和Numpy ,但是似乎我错过了一些重要的观点。 为什么pool版本慢得多? 我看着htop ,可以看到创建了多个进程,但是它们都共享一个CPU,总计约100%。 $ cat test_multi.py import numpy as np from timeit import timeit from multiprocessing import Pool def mmul(matrix): for i in range(100): matrix = matrix * matrix return matrix if __name__ == '__main__': matrices = [] for i in range(4): matrices.append(np.random.random_integers(100, size=(1000, 1000))) pool = Pool(8) print timeit(lambda: map(mmul, matrices), number=20) print timeit(lambda: pool.map(mmul, matrices), number=20) $ python test_multi.py 16.0265390873 19.097837925 [更新]
  • Why is splitting a string slower in C++ than Python?
    I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson). Python Code: #!/usr/bin/env python from __future__ import print
  • 如何在Python中使用线程?(How can I use threading in Python?)
    问题 我试图了解Python中的线程。 我看过文档和示例,但坦率地说,许多示例过于复杂,我难以理解它们。 您如何清楚地显示为多线程而划分的任务? 回答1 自2010年提出这个问题以来,如何使用带有map和pool的Python进行简单的多线程处理已经有了真正的简化。 下面的代码来自一篇文章/博客文章,您绝对应该检出(没有从属关系)-并行显示在一行中:更好的日常线程任务模型。 我将在下面进行总结-最终仅是几行代码: from multiprocessing.dummy import Pool as ThreadPool pool = ThreadPool(4) results = pool.map(my_function, my_array) 哪个是以下版本的多线程版本: results = [] for item in my_array: results.append(my_function(item)) 描述 Map是一个很棒的小功能,是轻松将并行性注入Python代码的关键。 对于那些不熟悉的人来说,地图是从Lisp之类的功能语言中提炼出来的。 它是将另一个功能映射到序列上的功能。 Map为我们处理序列的迭代,应用函数,并将所有结果存储在最后的方便列表中。 执行 map函数的并行版本由以下两个库提供:multiprocessing,以及鲜为人知,但同样出色的step child
  • Python | 使用进程池统计指定范围内素数的个数
    适用专业: 适用于计算机、网络工程、软件工程等相关专业,其他专业选做。 实验目的: (1)了解使用Python标准库multiprocessing编写多进程程序的方法。 (2)理解进程概念以及进程调度的工作原理。 (3)理解进程池的概念及其工作原理。 (4)理解并熟练使用Python标准库time中的方法测试代码运行时间。 (5)根据需要熟练编写不同形式的素数判断函数。 (6)了解多处理器和多核的概念。 实验内容: (1)编写函数判断一个数字是否为素数,然后创建进程池使用进程池的map()方法把该函数映射到指定范围内的数字,使用内置函数sum()统计有多少素数。同时,使用内置函数map()和sum()完成同样任务,比较两种方法的速度。 (2)调整进程池大小,即工作进程的数量,观察两种方法速度的变化。例如,上面的代码运行结果为: 664579 60.04925322532654 664579 26.993717908859253 把进程池大小改为5之后,运行结果为: 664579 61.76579570770264 664579 110.45850372314453 尝试分析出现这种情况的原因。 (3)打开任务管理器,观察程序运行过程中对CPU资源占用的变化情况。下面是代码运行5秒和80秒时任务管理器的截图,尝试分析出现这种情况的原因。 推荐阅读: 寒潮真的来了,BAT都难逃此劫
  • 为什么在C ++中从stdin读取行比Python慢​​得多?(Why is reading lines from stdin much slower in C++ than Python?)
    问题 我想比较使用Python和C ++从stdin读取的字符串输入的行数,并震惊地看到我的C ++代码运行速度比等效的Python代码慢一个数量级。 由于我的C ++生锈,并且我还不是专家Pythonista,因此请告诉我我做错了什么还是误解了什么。 (TLDR答案:包括以下语句: cin.sync_with_stdio(false)或仅使用fgets 。 TLDR结果:一直滚动到我的问题的底部,然后查看表格。) C ++代码: #include <iostream> #include <time.h> using namespace std; int main() { string input_line; long line_count = 0; time_t start = time(NULL); int sec; int lps; while (cin) { getline(cin, input_line); if (!cin.eof()) line_count++; }; sec = (int) time(NULL) - start; cerr << "Read " << line_count << " lines in " << sec << " seconds."; if (sec > 0) { lps = line_count / sec; cerr << "
  • Python multiprocessing performance only improves with the square root of the number of cores used
    I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores. I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I
  • Python multiprocessing - Why is using functools.partial slower than default arguments?
    Consider the following function: def f(x, dummy=list(range(10000000))): return x If I use multiprocessing.Pool.imap, I get the following timings: import time import os from multiprocessing import Pool def f(x, dummy=list(range(10000000))): return x start = time.time() pool = Pool(2) for x in pool.imap(f, range(10)): print("parent process, x=%s, elapsed=%s" % (x, int(time.time() - start))) parent process, x=0, elapsed=0 parent process, x=1, elapsed=0 parent process, x=2, elapsed=0 parent process, x=3, elapsed=0 parent process, x=4, elapsed=0 parent process, x=5, elapsed=0 parent process, x=6
  • Python多重处理-为什么使用functools.partial比默认参数慢?(Python multiprocessing - Why is using functools.partial slower than default arguments?)
    问题 考虑以下功能: def f(x, dummy=list(range(10000000))): return x 如果我使用multiprocessing.Pool.imap ,则会得到以下计时: import time import os from multiprocessing import Pool def f(x, dummy=list(range(10000000))): return x start = time.time() pool = Pool(2) for x in pool.imap(f, range(10)): print("parent process, x=%s, elapsed=%s" % (x, int(time.time() - start))) parent process, x=0, elapsed=0 parent process, x=1, elapsed=0 parent process, x=2, elapsed=0 parent process, x=3, elapsed=0 parent process, x=4, elapsed=0 parent process, x=5, elapsed=0 parent process, x=6, elapsed=0 parent process, x=7, elapsed=0
  • Multiprocessing.Pool makes Numpy matrix multiplication slower
    So, I am playing around with multiprocessing.Pool and Numpy, but it seems I missed some important point. Why is the pool version much slower? I looked at htop and I can see several processes be created, but they all share one of the CPUs adding up to ~100%. $ cat test_multi.py import numpy as np from timeit import timeit from multiprocessing import Pool def mmul(matrix): for i in range(100): matrix = matrix * matrix return matrix if __name__ == '__main__': matrices = [] for i in range(4): matrices.append(np.random.random_integers(100, size=(1000, 1000))) pool = Pool(8) print timeit(lambda: map
  • python top N word count, why multiprocess slower then single process
    I'm doing a frequency word count using python, the single process version: #coding=utf-8 import string import time from collections import Counter starttime = time.clock() origin = open("document.txt", 'r').read().lower() for_split = [',','\n','\t','\'','.','\"','!','?','-', '~'] #the words below will be ignoered when counting ignored = ['the', 'and', 'i', 'to', 'of', 'a', 'in', 'was', 'that', 'had', 'he', 'you', 'his','my', 'it', 'as', 'with', 'her', 'for', 'on'] i=0 for ch in for_split: origin = string.replace(origin, ch, ' ') words = string.split(origin) result = Counter(words).most_common
  • 限制python多处理中的总CPU使用率(Limit total CPU usage in python multiprocessing)
    问题 我正在使用multiprocessing.Pool.imap在Windows 7上使用Python 2.7并行运行许多独立的作业。使用默认设置,根据Windows任务管理器的测量,我的CPU总使用率固定为100%。 这使得我的代码在后台运行时无法执行任何其他工作。 我已尝试将进程数限制为CPU数减去1,如如何限制Python使用的处理器数中所述: pool = Pool(processes=max(multiprocessing.cpu_count()-1, 1) for p in pool.imap(func, iterable): ... 这确实减少了正在运行的进程总数。 但是,每个过程都需要花费更多的周期来弥补它。 因此,我的总CPU使用率仍然固定为100%。 有没有一种方法可以直接限制总的CPU使用率-而不仅仅是进程数-否则,是否有任何解决方法? 回答1 解决方案取决于您要执行的操作。 以下是一些选择: 流程优先级较低 您可以完善子流程。 这样,尽管它们仍然会占用100%的CPU,但是在启动其他应用程序时,操作系统会优先使用其他应用程序。 如果您想在笔记本电脑的后台运行繁重的计算并且不关心CPU风扇一直运行,那么可以通过psutils设置漂亮的值。 该脚本是一个测试脚本,可以在所有内核上运行足够的时间,因此您可以查看其行为。 from multiprocessing
  • Clojure number crunching performance
    I'm not sure whether this belongs on StackOverflow or in the Clojure Google group. But the group seems to be busy discussing numeric improvements for Clojure 1.2, so I'll try here: http://shootout.alioth.debian.org/ has a number of performance benchmarks for various languages. I noticed that Clojure was missing, so I made a Clojure version of the n-body problem. The fastest code I was able to produce can be found here, and benchmarking it seems to be saying that for number crunching Clojure is factor ~10 quicker than Python/Ruby/Perl factor ~4 slower than C/Java/Scala/Ada approximately on par
  • Erlang的并行性何时才能克服其在数字计算方面的弱点?(When does Erlang's parallelism overcome its weaknesses in numeric computing?)
    问题 最近关于并行计算的所有炒作,我一直在思考并行性,数字运算,集群等问题。 我开始阅读学习一些Erlang。 随着越来越多的人学习(包括我自己),Erlang以一种非常令人印象深刻的优雅方式处理并发。 然后,作者断言Erlang对于数字运算并不理想。 我可以理解,像Erlang这样的语言会比C慢,但是并发模型似乎非常适合诸如图像处理或矩阵乘法之类的东西,尽管作者明确表示并非如此。 真的那么糟糕吗? 是否有一个使Erlang的实力克服其局部速度弱点的转折点? 是否/正在采取什么措施来应对速度? 需要明确的是:我不是要开始辩论;而是要开始辩论。 我只是想知道。 回答1 将并行性仅视为原始数字处理能力是错误​​的。 与GPU或经典超级计算机相比,Erlang更接近于集群计算机的工作方式。 在现代GPU和老式超级计算机中,性能全都与矢量化算术,专用计算硬件以及处理单元之间的低延迟通信有关。 因为通信等待时间很短并且每个计算单元都非常快,所以理想的使用模式是将数据加载到计算机的RAM中并立即处理所有数据。 这种处理可能涉及到节点之间传递的大量数据,就像在图像处理或3D中发生的那样,在该过程中,要执行很多CPU绑定任务来将数据从输入形式转换为输出形式。 当您经常不得不访问磁盘,网络或其他一些缓慢的I / O通道来获取数据时,这种类型的机器是一个糟糕的选择。 这使至少一个昂贵的专用处理器闲置
  • ProcessPoolExecutor from concurrent.futures way slower than multiprocessing.Pool
    I was experimenting with the new shiny concurrent.futures module introduced in Python 3.2, and I've noticed that, almost with identical code, using the Pool from concurrent.futures is way slower than using multiprocessing.Pool. This is the version using multiprocessing: def hard_work(n): # Real hard work here pass if __name__ == '__main__': from multiprocessing import Pool, cpu_count try: workers = cpu_count() except NotImplementedError: workers = 1 pool = Pool(processes=workers) result = pool.map(hard_work, range(100, 1000000)) And this is using concurrent.futures: def hard_work(n): # Real
  • Python可以比C++更快,你不信?
    Python 是一个用途非常广泛的编程语言,拥有成千上万的第三方库,在人工智能、机器学习、自动化等方面有着广泛的应用,众所周知,Python 是动态语言,有全局解释器锁,比其他静态语言要慢,也正是这个原因,你也许会转向其他语言如 Java、C++,不过先等等,今天分享一个可以让 Python 比 C++ 还要快的技术,看完再决定要不要转吧。今天的主角就是 Numba,Numba 是一个开源的即时编译器(JIT compiler),可将 Python 和 NumPy 的代码的转换为快速的机器码,从而提升运行速度。可以达到 C 或 FORTRAN 的速度。这么牛逼是不是很难用呢?No,No,No,So easy,你不需要替换 Python 解释器,不需要单独编译,甚至不需要安装 C / C ++ 编译器。只需将 Numba 提供的装饰器放在 Python 函数上面就行,剩下的就交给 Numba 完成。举个简单的例子:from numba import jitimport random@jit(nopython=True)def monte_carlo_pi(nsamples): acc = 0 for i in range(nsamples): x = random.random() y = random.random() if (x ** 2 + y ** 2) < 1.0: acc
  • parallel.futures中的ProcessPoolExecutor比multiprocessing.Pool慢(ProcessPoolExecutor from concurrent.futures way slower than multiprocessing.Pool)
    问题 我是用新的闪亮concurrent.futures在Python 3.2模块引起的实验,我已经注意到,几乎相同的代码,使用泳池从concurrent.futures的方式比使用multiprocessing.Pool慢。 这是使用多重处理的版本: def hard_work(n): # Real hard work here pass if __name__ == '__main__': from multiprocessing import Pool, cpu_count try: workers = cpu_count() except NotImplementedError: workers = 1 pool = Pool(processes=workers) result = pool.map(hard_work, range(100, 1000000)) 这是使用current.futures: def hard_work(n): # Real hard work here pass if __name__ == '__main__': from concurrent.futures import ProcessPoolExecutor, wait from multiprocessing import cpu_count try: workers = cpu
  • multiprocessing.Pool()比仅使用普通函数要慢(multiprocessing.Pool() slower than just using ordinary functions)
    问题 (这个问题是关于如何使multiprocessing.Pool()运行代码更快。我终于解决了它,最终的解决方案可以在文章的底部找到。) 原始问题: 我正在尝试使用Python将一个单词与列表中的许多其他单词进行比较,并检索最相似的单词的列表。 为此,我使用了difflib.get_close_matches函数。 我在使用Python 2.6.5的相对较新且功能强大的Windows 7便携式计算机上。 我想要的是加快比较过程,因为我的比较单词列表很长,而且我不得不重复几次比较过程。 当我听到多处理模块的消息时,如果将比较分解成多个工作任务并同时运行(从而利用机器动力来换取更快的速度),我的比较任务将更快地完成,这似乎是合乎逻辑的。 但是,即使尝试了许多不同的方法,并使用了文档中显示并在论坛帖子中建议的方法,Pool方法似乎仍然非常慢,比在整个列表中运行原始get_close_matches函数要慢得多。一次。 我想帮助您理解为什么Pool()这么慢以及我是否正确使用它。 我仅以该字符串比较方案为例,因为这是我可以想到的最近的一个示例,在该示例中我无法理解或无法使多处理工作而不是不利于我。 下面是difflib场景中的示例代码,显示了普通方法和Pooled方法之间的时间差: from multiprocessing import Pool import random, time
  • multiprocessing.Pool() slower than just using ordinary functions
    (This question is about how to make multiprocessing.Pool() run code faster. I finally solved it, and the final solution can be found at the bottom of the post.) Original Question: I'm trying to use Python to compare a word with many other words in a list and retrieve a list of the most similar ones. To do that I am using the difflib.get_close_matches function. I'm on a relatively new and powerful Windows 7 Laptop computer, with Python 2.6.5. What I want is to speed up the comparison process because my comparison list of words is very long and I have to repeat the comparison process several
  • Python可以比C++更快,你不信?
    Python 是一个用途非常广泛的编程语言,拥有成千上万的第三方库,在人工智能、机器学习、自动化等方面有着广泛的应用,众所周知,Python 是动态语言,有全局解释器锁,比其他静态语言要慢,也正是这个原因,你也许会转向其他语言如 Java、C++,不过先等等,今天分享一个可以让 Python 比 C++ 还要快的技术,看完再决定要不要转吧。今天的主角就是 Numba,Numba 是一个开源的即时编译器(JIT compiler),可将 Python 和 NumPy 的代码的转换为快速的机器码,从而提升运行速度。可以达到 C 或 FORTRAN 的速度。这么牛逼是不是很难用呢?No,No,No,So easy,你不需要替换 Python 解释器,不需要单独编译,甚至不需要安装 C / C ++ 编译器。只需将 Numba 提供的装饰器放在 Python 函数上面就行,剩下的就交给 Numba 完成。举个简单的例子:from numba import jitimport random@jit(nopython=True)def monte_carlo_pi(nsamples): acc = 0 for i in range(nsamples): x = random.random() y = random.random() if (x ** 2 + y ** 2) < 1.0: acc
  • 适用于Python Django的多线程(Multithreading for Python Django)
    问题 某些功能应在Web服务器上异步运行。 发送电子邮件或数据后处理是典型的用例。 编写装饰器函数以异步运行函数的最佳(或最pythonic)方法是什么? 我的设置很常见:Python,Django,Gunicorn或Waitress,AWS EC2标准Linux 例如,这是一个开始: from threading import Thread def postpone(function): def decorator(*args, **kwargs): t = Thread(target = function, args=args, kwargs=kwargs) t.daemon = True t.start() return decorator 所需用法: @postpone def foo(): pass #do stuff 回答1 我继续在大规模生产中使用此实现,没有任何问题。 装饰器定义: def start_new_thread(function): def decorator(*args, **kwargs): t = Thread(target = function, args=args, kwargs=kwargs) t.daemon = True t.start() return decorator 用法示例: @start_new_thread def foo(