Spooky Python File Descriptors

After a bit of a recent bug hunt I wanted to share what I found because it caught me up thinking too much in Python abstractions and not enough in what's actually happening. I found most people I explained it too were also kind of surprised so I thought it might be of interest more broadly. It also helped me understand that context blocks (with open(…) as file:) when handling open files are generally unnecessary. Also, since it's Halloween here in Canada, I couldn't help but pick a silly title for this post.

When you open a file in Python, it returns an object. That object proxies between Python's and your operating system's concept of a file. On POSIX-like operating systems, that takes the form of an open(2) function call using libC. That function returns an integer, the so called file descriptor. It's an ID the system gives your code to index into the file description table for your process within the kernel which holds all the actual file specific metadata it needs to operate on that file for you.

open("/dev/zero")

<_io.TextIOWrapper name='/dev/zero' mode='r' encoding='UTF-8'>

Nothing new there for anyone who's used Python to read files before. The spooky thing is what happens if you close the file descriptor and reuse the variable.

import os

file = open("/dev/zero")
file.read(5)
os.close(file.fileno())

file = open("/dev/zero")
file.read(5)
os.close(file.fileno())

Take a second to think about what happens if you run this in the Python REPL. Once you've thought about what should happen, go ahead and run it line by line. If you don't have a /dev/zero (like if you're running on Windows) you can replace those filenames with any file on your system.

The first block does pretty much what you'd expect. It opens /dev/zero and then reads up to five characters from it. The REPL shows us the return value of file.read(5) which is the string '\x00\x00\x00\x00\x00'.

The next line is also not too special. file.fileno() returns the integer file descriptor. os.close() calls the operating system's close(2) function passing that integer to it. This is something you probably shouldn't do in Python. But why?

Well, the next block, identical to the first shows us what can happen. If you ran the code above, when you reached the second file.read(5), you probably saw it raised an exception.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor

Why? In Python, file like object destructors automatically close their file descriptor. The bug wasn't actually on the line where we read the file, but on the line above where we open a new file and simultaneously release the last reference we have to the previous file object by assigning the new file to the variable holding the old reference.

What happens is you first asked the operating system for a file descriptor in the first block. The operating system returns some integer for the file, let's say 3. Python creates a new object and sets that number as a property of the object so it can perform operations on that file using the normal libC file function calls (read(2), write(2), etc…).

When you close the file descriptor directly, the Python object doesn't know the file is now closed. Not only that, but the operating system is now free to reuse that file descriptor. That's exactly what happens when you open the file again the second time. The operating system sees the lowest available file descriptor is 3 and issues it to Python which creates another object and stores it as a property. Unfortunately for us, now there are two objects that both think they own file descriptor 3.

Now the assignment takes place. Python decrements the reference count on the old object and assigns the new one to the variable file. At this point, the runtime can garbage collect the old object. To help prevent resource leaks, Python's file objects automatically call the close(2) function on the descriptor for us since it thinks it's still open. One benefit is this means you can't leak file descriptors by not using the context blocks on files (with open(…):). When the function that opened a file returns, the file will be destructed and closed automatically.

The stage is set. The problem is our new object has suddenly had it's descriptor closed by a different object because they both think they own the same resource. At this point, when you try and read from the new file object, the operating system will return an error because the file is already closed.

Creepy Crawly Bugs

In the above it's pretty simple to avoid an issue. Honestly it's kind of a weird set of operations to both manually close the file descriptor (instead of calling file.close()) and reuse a variable with an unrelated reference like this. I've never run into this issue before. What I did have to figure out though shares the same file descriptor alias problem. It's async!

Python async has become all the rage it seems. There are a bunch of gotchas with Python's asyncio that can often trip people up. One of them is that forking and async don't really mix well. First, it's often not that you're forking, but some other library or tool is doing it for you. Second, many seem to think Python's event loop is automatically managed for them. That there's always an event loop available to schedule work. Third, many people writing libraries want to support both async and sync code. They often do so shipping the code as async with a set of thin wrapper functions around their async code.

import os
import asyncio

async def foo():
	print("bar")

loop = asyncio.new_event_loop()
loop.run_until_complete(foo())

if os.fork():
	exit()

loop.run_until_complete(foo())

If you copy this to a file and run it, you'll see you end up with the same OSError: [Errno 9] Bad file descriptor exception. What's going on?

Well, asyncio is built on top of your operating system's socket file descriptor event queue. On BSD based systems (like MacOS) that's kqueue(2). On Linux, epoll(2). For everything else, there's select(2). You can see how it all works looking at the source of the cPython select module.

What this means though is that the python async loop is built on top of a file descriptor. One that can be closed. In fact, the select module explicitly sets it up to be closed when the program forks. It likely does this to prevent unaware children being given events they probably don't expect and prevent resource leaks. The problem is when libraries that previously weren't async suddenly start using it in a synchronous codebase that uses forks.

Often I see these libraries have a pattern that in it's simplest form looks something like this:

import asyncio

async def async_foo():
	print("bar")

def foo():
	loop = asyncio.get_event_loop()
	return loop.run_until_complete(async_foo())

The problem is that the default event loop may not exist. One way to prevent this as a library author is to not assume there's a global event loop. Definitely don't use asyncio.run(). That function will close the global event loop when it finishes. Instead, create a new loop of your own.

import asyncio

async def async_foo():
	print("bar")

def foo():
	loop = asyncio.new_event_loop()
	try:
		return loop.run_until_complete(async_foo())
	finally:
		loop.close()

A bit of a tip if you're maintaining both sync and async classes or submodules is to generate the entire synchronous set at import time so you don't have to maintain two copies of the same setup. For example:

import asyncio
import inspect
import functools

class AsyncFoo:
	"""
	Main way to interact with the Foo.
	"""
	async def foo(self):
		print("bar")

class Foo:
	"""
	Syncronous wrapper class for interacting with the Foo.
	"""
	def __init__(self):
		self._async = AsyncFoo()

	@staticmethod
	def _setup_async_proxy():
		"""
		Statically proxy async methods of AsyncFoo in Foo at import
		so unittest.mock.patch() can use autospec.
		"""

		def async_proxy(name, method):
			@functools.wraps(method)
			def wrapper(self, *args, **kwargs):
				loop = asyncio.new_event_loop()
				try:
					func = getattr(self._async, name)
					return loop.run_until_complete(func(*args, **kwargs))
				finally:
					loop.close()

			return wrapper

		for name, method in vars(AsyncFoo).items():
			if name[0] != "_" and inspect.iscoroutinefunction(method):
				setattr(Foo, name, async_proxy(name, method))

Foo._setup_async_proxy()

It'd be really great if coroutine objects could get a method like .run_sync() that made it easier for synchronous code to execute asynchronous code, possibly bypassing all the asynchronous queuing altogether and just blocking. Then you could run val = foo().run_sync() in synchronous code and we could skip all this.

Anyway, all the best. Hope this helped you understand much more of what's happening inside asyncio and avoid a bit of a puzzling situation. Maybe even simplify your work maintaining dual ecosystems.