Python is single-core, and so to utilize multiple cores, you need to spawn multiple processes.
Each process will start up the interpreter and import the required modules, sharing nothing because it's all dynamic. The memory requirement grows linearly with the number of cores in your machine. This can easily be a hundred megabytes per core.
The trick that doesn't work
There's a common technique on modern operating systems to alleviate this: First start a single process, then load the required library code and/or static data, and finally fork until you've spawned a process for every core. Now your processes will share the already loaded objects using copy-on-write semantics.
You can use this technique with PyPy.
Unfortunately, the CPython implementation effectively degrades the effect to copy-on-read due to the way the garbage-collector keeps reference counts. When an object is referenced or dereferenced, the virtual memory page that holds it will have to be copied into a private memory allocation – effectively un-shared.
It was discussed a little bit in this thread on the "python-ideas" list.
Basically, while you could mark some objects (such as interned strings) as "eternal", and put them in a separate memory allocation, the problem remains for objects that do need garbage-collection (which is usually most of them).
I think this comment puts it nicely:
Forks are common for this, and for good reason; often times languages that start with reference counting end up taking large dependencies on external libraries that manipulate reference counts (like Python with its Py_INCREF and Py_DECREF), at which point the language can't fully move to tracing GC without breaking its ecosystem.
Ruby gets it right
Interestingly, the Ruby 2.0 reference implementation did just that:
What this means is that Ruby 2.0 can now mark all of the in-use structures during the “mark” portion of the GC processing without actually modifying the structures themselves, allowing Unix to continue to share memory across different Ruby processes!
And somehow, their ecosystem isn't expected to break. Note that Ruby 2.0 is only in its second release candidate as of this writing.
Apparently, "using bitmaps for mark & sweep GC dates from the early 70's."
There really ain't much new under the sun. Too bad every generation has to rediscover all this stuff.
But this might not be entirely true:
I may be totally wrong, and the paper is entirely in Japanese, but I think there is more to it.
If there's something to it, then perhaps CPython should take lesson. Note that Rubinius has garbage-collection which does support the copy-on-write technique, as has JRuby.
I can't help thinking that it's unfortunate that Python 3 came out and broke everything without fixing this aspect of its garbage collector. Perhaps there are good reasons for it.
All is not lost
Since currently neither PyPy, nor CPython can do multi-core, let's consider multi-threading on a single core.
So racy programs, even under GIL semantics, are probably incorrect programs. You still need to use locks to write correct multi-threaded programs. The only advantage that GIL semantics affords is that these buggy, racy programs don’t segfault or cause arbitrary memory corruption.
What Adrian Sampson explains is that Python's data structures aren't thread-safe on CPython due to the granularity of the byte code instructions. What looks like an atomic operation in the Python language typically won't be atomic on the interpreter level. Thus, execution is vulnerable to threads interleaving unpredictably:
votes += 1
It looks atomic, and indeed the abstract syntax supports this idea, but the interpreter has it as four byte code instructions.
Here's a quick fix:
import sys sys.setcheckinterval(2**30) # Never surrender! No locking required!
This will prevent threads from interleaving (see sys.setcheckinterval). Instead, the program explicitly yields control, either due to I/O or using a synchronization primitive from the threading module.
I argue that multi-threading is mostly suitable for programs that frequently and cooperatively yield control. Most web apps regularly yield due to I/O. It's not uncommon to issue hundreds of database queries to serve a single request (hopefully cheap ones). This is much better than checking every 100 byte code instructions if a signal's been given.
What if a thread never yields? You'll anyway have to kill the process.
The future is concurrent
PyPy's probably going to natively support running programs on multiple cores using software transactional memory.
As a preparation for this promise of real concurrency, perhaps we should start testing our code with a checking interval of 1 – let the threads interleave at will and really bring out the incorrectness of programs that operate on shared data structures without proper locking.
That, or take up Adrian Sampson's advice:
Or, more radically, it should forbid implicitly shared state altogether and adopt a Concurrent ML-like channel API or explicit sharing.
In which case, we might just as well use multiple processes. This model also does not suffer from "stop the world" garbage collection in which the entire process is locked – an unpredictable GIL.