You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

574 lines
21 KiB

<html>
<head>
<style>
body {
max-width: 40em;
margin: 2em;
}
a {
color: inherit;
}
a:hover {
color: red;
}
dt { margin-top: 1em; }
dd { margin-bottom: 1em; }
table {
margin: 1em 0;
padding: 0;
}
table th {
text-align: left;
padding: 0.2em;
white-space: nowrap;
}
table td {
padding: 0.2em;
text-align: left;
white-space: nowrap;
}
</style>
<title>Asynchronous I/O in Windows for Unix Programmers</title>
</head>
<body>
<h1>Asynchronous I/O in Windows for Unix Programmers</h1>
<p>Ryan Dahl ryan@joyent.com
<p>This document assumes you are familiar with how non-blocking socket I/O
is done in Unix.
<p>The syscall
<a href="http://msdn.microsoft.com/en-us/library/ms740141(v=VS.85).aspx"><code>select</code>
is available in Windows</a>
but <code>select</code> processing is O(n) in the number of file descriptors
unlike the modern constant-time multiplexers like epoll which makes select
unacceptable for high-concurrency servers.
This document will describe how high-concurrency programs are
designed in Windows.
<p>
Instead of <a href="http://en.wikipedia.org/wiki/Epoll">epoll</a>
or
<a href="http://en.wikipedia.org/wiki/Kqueue">kqueue</a>,
Windows has its own I/O multiplexer called
<a href="http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx">I/O completion ports</a>
(IOCPs). IOCPs are the objects used to poll
<a href="http://msdn.microsoft.com/en-us/library/ms686358(v=vs.85).aspx">overlapped I/O</a>
for completion. IOCP polling is constant time (REF?).
<p>
The fundamental variation is that in a Unix you generally ask the kernel to
wait for state change in a file descriptor's readability or writablity. With
overlapped I/O and IOCPs the programmers waits for asynchronous function
calls to complete.
For example, instead of waiting for a socket to become writable and then
using <a
href="http://www.kernel.org/doc/man-pages/online/pages/man2/send.2.html"><code>send(2)</code></a>
on it, as you commonly would do in a Unix, with overlapped I/O you
would rather <a
href="http://msdn.microsoft.com/en-us/library/ms742203(v=vs.85).aspx"><code>WSASend()</code></a>
the data and then wait for it to have been sent.
<p> Unix non-blocking I/O is not beautiful. A principle abstraction in Unix
is the unified treatment of many things as files (or more precisely as file
descriptors). <code>write(2)</code>, <code>read(2)</code>, and
<code>close(2)</code> work with TCP sockets just as they do on regular
files. Well&mdash;kind of. Synchronous operations work similarly on different
types of file descriptors but once demands on performance drive you to world of
<code>O_NONBLOCK</code> various types of file descriptors can act quite
different for even the most basic operations. In particular,
regular file system files do <i>not</i> support non-blocking operations.
(Disturbingly no man page mentions this rather important fact.)
For example, one cannot poll on a regular file FD for readability expecting
it to indicate when it is safe to do a non-blocking read.
Regular file are always readable and <code>read(2)</code> calls
<i>always</i> have the possibility of blocking the calling thread for an
unknown amount of time.
<p>POSIX has defined <a
href="http://pubs.opengroup.org/onlinepubs/007908799/xsh/aio.h.html">an
asynchronous interface</a> for some operations but implementations for
many Unixes have unclear status. On Linux the
<code>aio_*</code> routines are implemented in userland in GNU libc using
pthreads.
<a
href="http://www.kernel.org/doc/man-pages/online/pages/man2/io_submit.2.html"><code>io_submit(2)</code></a>
does not have a GNU libc wrapper and has been reported <a
href="http://voinici.ceata.org/~sana/blog/?p=248"> to be very slow and
possibly blocking</a>. <a
href="http://download.oracle.com/docs/cd/E19253-01/816-5171/aio-write-3rt/index.html">Solaris
has real kernel AIO</a> but it's unclear what its performance
characteristics are for socket I/O as opposed to disk I/O.
Contemporary high-performance Unix socket programs use non-blocking
file descriptors with a I/O multiplexer&mdash;not POSIX AIO.
Common practice for accessing the disk asynchronously is still done using custom
userland thread pools&mdash;not POSIX AIO.
<p>Windows IOCPs does support both sockets and regular file I/O which
greatly simplifies the handling of disks. For example,
<a href="http://msdn.microsoft.com/en-us/library/aa365468(v=VS.85).aspx"><code>ReadFileEx()</code></a>
operates on both.
As a first example let's look at how <code>ReadFile()</code> works.
<pre>
typedef void* HANDLE;
BOOL ReadFile(HANDLE file,
void* buffer,
DWORD numberOfBytesToRead,
DWORD* numberOfBytesRead,
OVERLAPPED* overlapped);
</pre>
<p>
The function has the possibility of executing the read synchronously
or asynchronously. A synchronous operation is indicated by
returning 0 and <a
href="http://msdn.microsoft.com/en-us/library/ms741580(v=VS.85).aspx">WSAGetLastError()</code></a>
returning <code>WSA_IO_PENDING</code>.
When <code>ReadFile()</code> operates asynchronously the
the user-supplied <a
href="http://msdn.microsoft.com/en-us/library/ms741665(v=VS.85).aspx"><code>OVERLAPPED*</code></a>
is a handle to the incomplete operation.
<pre>
typedef struct {
unsigned long* Internal;
unsigned long* InternalHigh;
union {
struct {
WORD Offset;
WORD OffsetHigh;
};
void* Pointer;
};
HANDLE hEvent;
} OVERLAPPED;
</pre>
To poll on the completion of one of these functions,
use an IOCP, <code>overlapped->hEvent</code>, and
<a
href="http://msdn.microsoft.com/en-us/library/aa364986(v=vs.85).aspx"><code>GetQueuedCompletionStatus()</code></a>.
<h3>Simple TCP Connection Example</h3>
<p>To demonstrate the use of <code>GetQueuedCompletionStatus()</code> an
example of connecting to <code>localhost</code> at port 8000 is presented.
<pre>
char* buffer[200];
WSABUF b = { buffer, 200 };
size_t bytes_recvd;
int r, total_events;
OVERLAPPED overlapped;
HANDLE port;
port = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, NULL, 0);
if (!port) {
goto error;
}
r = WSARecv(socket, &b, 1, &bytes_recvd, NULL, &overlapped, NULL);
CreateIoCompletionPort(port, &overlapped.hEvent,
if (r == 0) {
if (WSAGetLastError() == WSA_IO_PENDING) {
/* Asynchronous */
GetQueuedCompletionStatus()
if (r == WAIT_TIMEOUT) {
printf("Timeout\n");
} else {
}
} else {
/* Error */
printf("Error %d\n", WSAGetLastError());
}
} else {
/* Synchronous */
printf("read %ld bytes from socket\n", bytes_recvd);
}
</pre>
<h3>Previous Work</h3>
<p> Writing code that can take advantage of the best worlds on across Unix operating
systems and Windows is very difficult, requiring one to understand intricate
APIs and undocumented details from many different operating systems. There
are several projects which have made attempts to provide an abstraction
layer but in the author's opinion, none are completely satisfactory.
<p><b>Marc Lehmann's
<a href="http://software.schmorp.de/pkg/libev.html">libev</a> and
<a href="http://software.schmorp.de/pkg/libeio.html">libeio</a>.</b>
libev is the perfect minimal abstraction of the Unix I/O multiplexers. It
includes several helpful tools like <code>ev_async</code>, which is for
asynchronous notification, but the main piece is the <code>ev_io</code>,
which informs the user about the state of file descriptors. As mentioned
before, in general it is not possible to get state changes for regular
files&mdash;and even if it were the <code>write(2)</code> and
<code>read(2)</code> calls do not guarantee that they won't block.
Therefore libeio is provided for calling various disk-related
syscalls in a managed thread pool. Unfortunately the abstraction layer
which libev targets is not appropriate for IOCPs&mdash;libev works strictly
with file descriptors and does not the concept of a <i>socket</i>.
Furthermore users on Unix will be using libeio for file I/O which is not
ideal for porting to Windows. On windows libev currently uses
<code>select()</code>&mdash;which is limited to 64 file descriptors per
thread.
<p><b><a href="http://monkey.org/~provos/libevent/">libevent</a>.</b>
Somewhat bulkier than libev with code for RPC, DNS, and HTTP included.
Does not support file I/O.
libev was
created after Lehmann <a
href="http://www.mail-archive.com/libevent-users@monkey.org/msg00753.html">evaluated
libevent and rejected it</a>&mdash;it's interesting to read his reasons
why. <a
href="http://google-opensource.blogspot.com/2010/01/libevent-20x-like-libevent-14x-only.html">A
major rewrite</a> was done for version 2 to support Windows IOCPs but <a
href="http://www.mail-archive.com/libevent-users@monkey.org/msg01730.html">anecdotal
evidence</a> suggests that it is still not working correctly.
<p><b><a
href="http://www.boost.org/doc/libs/1_43_0/doc/html/boost_asio.html">Boost
ASIO</a>.</b> It basically does what you want on Windows and Unix for
sockets. That is, epoll on Linux, kqueue on Macintosh, IOCPs on Windows.
It does not support file I/O. In the author's opinion is it too large
for a not extremely difficult problem (~300 files, ~12000 semicolons).
<h2>File Types</h2>
<p>
Almost every socket operation that you're familiar with has an
overlapped counter-part. The following section tries to pair Windows
overlapped I/O syscalls with non-blocking Unix ones.
<h3>TCP Sockets</h3>
TCP Sockets are by far the most important stream to get right.
Servers should expect to be handling tens of thousands of these
per thread, concurrently. This is possible with overlapped I/O in Windows if
one is careful to avoid Unix-ism like file descriptors. (Windows has a
hard limit of 2048 open file descriptors&mdash;see
<a
href="http://msdn.microsoft.com/en-us/library/6e3b887c.aspx"><code>_setmaxstdio()</code></a>.)
<dl>
<dt><code>send(2)</code>, <code>write(2)</code></dt>
<dd>Windows:
<a href="http://msdn.microsoft.com/en-us/library/ms742203(v=vs.85).aspx"><code>WSASend()</code></a>,
<a href="http://msdn.microsoft.com/en-us/library/aa365748(v=VS.85).aspx"><code>WriteFileEx()</code></a>
</dd>
<dt><code>recv(2)</code>, <code>read(2)</code></dt>
<dd>
Windows:
<a href="http://msdn.microsoft.com/en-us/library/ms741688(v=VS.85).aspx"><code>WSARecv()</code></a>,
<a href="http://msdn.microsoft.com/en-us/library/aa365468(v=VS.85).aspx"><code>ReadFileEx()</code></a>
</dd>
<dt><code>connect(2)</code></dt>
<dd>
Windows: <a href="http://msdn.microsoft.com/en-us/library/ms737606(VS.85).aspx"><code>ConnectEx()</code></a>
<p>
Non-blocking <code>connect()</code> is has difficult semantics in
Unix. The proper way to connect to a remote host is this: call
<code>connect(2)</code> while it returns
<code>EINPROGRESS</code> poll on the file descriptor for writablity.
Then use
<pre>int error;
socklen_t len = sizeof(int);
getsockopt(fd, SOL_SOCKET, SO_ERROR, &error, &len);</pre>
A zero <code>error</code> indicates that the connection succeeded.
(Documented in <code>connect(2)</code> under <code>EINPROGRESS</code>
on the Linux man page.)
</dd>
<dt><code>accept(2)</code></dt>
<dd>
Windows: <a href="http://msdn.microsoft.com/en-us/library/ms737524(v=VS.85).aspx"><code>AcceptEx()</code></a>
</dd>
<dt><code>sendfile(2)</code></dt>
<dd>
Windows: <a href="http://msdn.microsoft.com/en-us/library/ms740565(v=VS.85).aspx"><code>TransmitFile()</code></a>
<p> The exact API of <code>sendfile(2)</code> on Unix has not been agreed
on yet. Each operating system does it slightly different. All
<code>sendfile(2)</code> implementations (except possibly FreeBSD?) are blocking
even on non-blocking sockets.
<ul>
<li><a href="http://www.kernel.org/doc/man-pages/online/pages/man2/sendfile.2.html">Linux <code>sendfile(2)</code></a>
<li><a href="http://www.freebsd.org/cgi/man.cgi?query=sendfile&sektion=2">FreeBSD <code>sendfile(2)</code></a>
<li><a href="http://www.manpagez.com/man/2/sendfile/">Darwin <code>sendfile(2)</code></a>
</ul>
Marc Lehmann has written <a
href="https://github.com/joyent/node/blob/2c185a9dfd3be8e718858b946333c433c375c295/deps/libeio/eio.c#L954-1080">a
portable version in libeio</a>.
</dd>
<dt><code>shutdown(2)</code>, graceful close, half-duplex connections</dt>
<dd>
<a
href="http://msdn.microsoft.com/en-us/library/ms738547(v=VS.85).aspx">Graceful
Shutdown, Linger Options, and Socket Closure</a>
<br/>
<a
href="http://msdn.microsoft.com/en-us/library/ms737757(VS.85).aspx"><code>DisconnectEx()</code></a>
</dd>
<dt><code>close(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/ms737582(v=VS.85).aspx"><code>closesocket()</code></a>
</dd>
The following are nearly same in Windows overlapped and Unix
non-blocking sockets. The only difference is that the Unix variants
take integer file descriptors while Windows uses <code>SOCKET</code>.
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/ms740496(v=VS.85).aspx"><code>sockaddr</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms737550(v=VS.85).aspx"><code>bind()</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms738543(v=VS.85).aspx"><code>getsockname()</code></a>
</ul>
<h3>Named Pipes</h3>
Windows has "named pipes" which are more or less the same as <a
href="http://www.kernel.org/doc/man-pages/online/pages/man7/unix.7.html"><code>AF_Unix</code>
domain sockets</a>. <code>AF_Unix</code> sockets exist in the file system
often looking like
<pre>/tmp/<i>pipename</i></pre>
Windows named pipes have a path, but they are not directly part of the file
system; instead they look like
<pre>\\.\pipe\<i>pipename</i></pre>
<dl>
<dt><code>socket(AF_Unix, SOCK_STREAM, 0), bind(2), listen(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/aa365150(VS.85).aspx"><code>CreateNamedPipe()</code></a>
<p>Use <code>FILE_FLAG_OVERLAPPED</code>, <code>PIPE_TYPE_BYTE</code>,
<code>PIPE_NOWAIT</code>.
</dd>
<dt><code>send(2)</code>, <code>write(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/aa365748(v=VS.85).aspx"><code>WriteFileEx()</code></a>
</dd>
<dt><code>recv(2)</code>, <code>read(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/aa365468(v=VS.85).aspx"><code>ReadFileEx()</code></a>
</dd>
<dt><code>connect(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/aa365150(VS.85).aspx"><code>CreateNamedPipe()</code></a>
</dd>
<dt><code>accept(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/aa365146(v=VS.85).aspx"><code>ConnectNamedPipe()</code></a>
</dd>
</dl>
Examples:
<ul>
<li><a
href="http://msdn.microsoft.com/en-us/library/aa365601(v=VS.85).aspx">Named
Pipe Server Using Completion Routines</a>
<li><a
href="http://msdn.microsoft.com/en-us/library/aa365603(v=VS.85).aspx">Named
Pipe Server Using Overlapped I/O</a>
</ul>
<h3>Regular Files</h3>
<p>
In Unix file system files are not able to use non-blocking I/O. There are
some operating systems that have asynchronous I/O but it is not standard and
at least on Linux is done with pthreads in GNU libc. For this reason
applications designed to be portable across different Unixes must manage a
thread pool for issuing file I/O syscalls.
<p>
The situation is better in Windows: true overlapped I/O is available when
reading or writing a stream of data to a file.
<dl>
<dt><code>write(2)</code></dt>
<dd> Windows:
<a href="http://msdn.microsoft.com/en-us/library/aa365748(v=VS.85).aspx"><code>WriteFileEx()</code></a>
<p>Solaris's event completion ports has true in-kernel async writes with <a
href="http://download.oracle.com/docs/cd/E19253-01/816-5171/aio-write-3rt/index.html">aio_write(3RT)</a>
</dd>
<dt><code>read(2)</code></dt>
<dd> Windows:
<a href="http://msdn.microsoft.com/en-us/library/aa365468(v=VS.85).aspx"><code>ReadFileEx()</code></a>
<p>Solaris's event completion ports has true in-kernel async reads with <a
href="http://download.oracle.com/docs/cd/E19253-01/816-5171/aio-read-3rt/index.html">aio_read(3RT)</a>
</dd>
</dl>
<h3>Console/TTY</h3>
<p>It is (usually?) possible to poll a Unix TTY file descriptor for
readability or writablity just like a TCP socket&mdash;this is very helpful
and nice. In Windows the situation is worse, not only is it a completely
different API but there are not overlapped versions to read and write to the
TTY. Polling for readability can be accomplished by waiting in another
thread with <a
href="http://msdn.microsoft.com/en-us/library/ms685061(VS.85).aspx"><code>RegisterWaitForSingleObject()</code></a>.
<dl>
<dt><code>read(2)</code></dt>
<dd>
<a
href="http://msdn.microsoft.com/en-us/library/ms684958(v=VS.85).aspx"><code>ReadConsole()</code></a>
and
<a
href="http://msdn.microsoft.com/en-us/library/ms684961(v=VS.85).aspx"><code>ReadConsoleInput()</code></a>
do not support overlapped I/O and there are no overlapped
counter-parts. One strategy to get around this is
<pre><a href="http://msdn.microsoft.com/en-us/library/ms685061(VS.85).aspx">RegisterWaitForSingleObject</a>(&tty_wait_handle, tty_handle,
tty_want_poll, NULL, INFINITE, WT_EXECUTEINWAITTHREAD |
WT_EXECUTEONLYONCE)</pre>
which will execute <code>tty_want_poll()</code> in a different thread.
You can use this to notify the calling thread that
<code>ReadConsoleInput()</code> will not block.
</dd>
<dt><code>write(2)</code></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/ms687401(v=VS.85).aspx"><code>WriteConsole()</code></a>
is also blocking but this is probably acceptable.
</dd>
<dt><a
href="http://www.kernel.org/doc/man-pages/online/pages/man3/tcsetattr.3.html"><code>tcsetattr(3)</code></a></dt>
<dd>
<a href="http://msdn.microsoft.com/en-us/library/ms686033(VS.85).aspx"><code>SetConsoleMode()</code></a>
</dd>
</dl>
<h2>Assorted Links</h2>
<p>
tips
<ul>
<li> overlapped = non-blocking.
<li> There is no overlapped <a href="http://msdn.microsoft.com/en-us/library/ms738518(VS.85).aspx"><code>GetAddrInfoEx()</code></a> function. It seems Asynchronous Procedure Calls must be used instead.
<li> <a href=http://msdn.microsoft.com/en-us/library/ms740673(VS.85).aspx"><code>Windows Sockets 2</code></a>
</ul>
<p>
IOCP:
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/ms686358(v=vs.85).aspx">Synchronization and Overlapped Input and Output</a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms741665(v=VS.85).aspx"><code>OVERLAPPED</code> Structure</a>
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/ms683209(v=VS.85).aspx"><code>GetOverlappedResult()</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms683244(v=VS.85).aspx"><code>HasOverlappedIoCompleted()</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/aa363792(v=vs.85).aspx"><code>CancelIoEx()</code></a>
&mdash; cancels an overlapped operation.
</ul>
<li><a href="http://msdn.microsoft.com/en-us/library/ms742203(v=vs.85).aspx"><code>WSASend()</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms741688(v=VS.85).aspx"><code>WSARecv()</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms737606(VS.85).aspx"><code>ConnectEx()</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms740565(v=VS.85).aspx"><code>TransmitFile()</code></a>
&mdash; an async <code>sendfile()</code> for windows.
<li><a href="http://msdn.microsoft.com/en-us/library/ms741565(v=VS.85).aspx"><code>WSADuplicateSocket()</code></a>
&mdash; describes how to share a socket between two processes.
<li id="setmaxstdio"><a href="http://msdn.microsoft.com/en-us/library/6e3b887c.aspx"><code>_setmaxstdio()</code></a>
&mdash; something like setting the maximum number of file decriptors
and <a
href="http://www.kernel.org/doc/man-pages/online/pages/man2/setrlimit.2.html"><code>setrlimit(3)</code></a>
AKA <code>ulimit -n</code>. Note the file descriptor limit on windows is
2048.
</ul>
<p>
APC:
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/ms681951(v=vs.85).aspx">Asynchronous Procedure Calls</a>
<li><a href="http://msdn.microsoft.com/en-us/library/ms682016"><code>DNSQuery()</code></a>
&mdash; General purpose DNS query function like <code>res_query()</code> on Unix.
</ul>
Pipes:
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/aa365781(v=VS.85).aspx"><code>Pipe functions</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/aa365150(VS.85).aspx"><code>CreateNamedPipe</code></a>
<li><a href="http://msdn.microsoft.com/en-us/library/aa365144(v=VS.85).aspx"><code>CallNamedPipe</code></a>
&mdash; like <code>accept</code> is for Unix pipes.
<li><a href="http://msdn.microsoft.com/en-us/library/aa365146(v=VS.85).aspx"><code>ConnectNamedPipe</code></a>
</ul>
<code>WaitForMultipleObjectsEx</code> is pronounced "wait for multiple object sex".
Also useful:
<a
href="http://msdn.microsoft.com/en-us/library/xw1ew2f8(v=vs.80).aspx">Introduction
to Visual C++ for Unix Users</a>
<br><br>
<a
href="http://ebookbrowse.com/network-programming-for-microsoft-windows-2nd-edition-2002-pdf-d73663829">Network
Programming For Microsoft Windows 2nd Edition 2002</a>. Juicy details on
page 119.
</body></html>