ablog このページをアンテナに追加 RSSフィード Twitter





Managing Nfs and Nis

Managing Nfs and Nis

Mandatory locking and NFS

NLM supports only advisory whole file and byte range locking, and until NFS Version 4 is deployed, this means that the NFS environment cannot support mandatory whole file and byte range locking. The reason goes back to how mandatory locking interacts with advisory fcntl calls.

Let’s suppose a process with ID 1867 issues an fcntl exclusive lock call on the entire range of a local file that has mandatory lock permissions set. This fcntl call is an advisory lock. Now the process attempts to write the file. The operating system can tell that process 1867 holds an advisory lock, and so, it allows the write to proceed, rather than attempting to acquire the advisory lock on behalf of the process 1867 for the duration of the write. Now suppose process 1867 does the same sequence on another file with mandatory lock permissions, but this file is on an NFS filesystem. Process 1867 issues an fcntl exclusive lock call on the entire range of a file that has mandatory lock permissions set. Now process 1867 attempts to write the file. While the NLM protocol has fields in its lock requests to uniquely identify the process on the client that locked the file, the NFS protocol has no fields to identify the processes that are doing writes or reads. The file is advisory locked, and it has the mandatory lock permissions set, yet the NFS server has no way of knowing if the process that sent the write request is the same one that obtained the lock. Thus, the NFS server cannot lock the file on behalf of the NFS client. For this reason, some NFS servers, including Solaris servers, refuse any read or write to a file with the mandatory lock permissions set.

 * Open an existing file or directory.
 * The may_flags argument indicates the type of open (read/write/lock)
 * and additional flags.
 * N.B. After this call fhp needs an fh_put
nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
			int may_flags, struct file **filp)
	struct path	path;
	struct inode	*inode;
	int		flags = O_RDONLY|O_LARGEFILE;
	__be32		err;
	int		host_err = 0;


	 * If we get here, then the client has already done an "open",
	 * and (hopefully) checked permission - so allow OWNER_OVERRIDE
	 * in case a chmod has now revoked permission.
	 * Arguably we should also allow the owner override for
	 * directories, but we never have and it doesn't seem to have
	 * caused anyone a problem.  If we were to change this, note
	 * also that our filldir callbacks would need a variant of
	 * lookup_one_len that doesn't check permissions.
	if (type == S_IFREG)
	err = fh_verify(rqstp, fhp, type, may_flags);
	if (err)
		goto out;

	path.mnt = fhp->fh_export->ex_path.mnt;
	path.dentry = fhp->fh_dentry;
	inode = path.dentry->d_inode;

	/* Disallow write access to files with the append-only bit set
	 * or any access when mandatory locking enabled
	err = nfserr_perm;
	if (IS_APPEND(inode) && (may_flags & NFSD_MAY_WRITE))
		goto out;
	 * We must ignore files (but only files) which might have mandatory
	 * locks on them because there is no way to know if the accesser has
	 * the lock.
	if (S_ISREG((inode)->i_mode) && mandatory_lock(inode))
		goto out;

	if (!inode->i_fop)
		goto out;

	host_err = nfsd_open_break_lease(inode, may_flags);
	if (host_err) /* NOMEM or WOULDBLOCK */
		goto out_nfserr;

	if (may_flags & NFSD_MAY_WRITE) {
		if (may_flags & NFSD_MAY_READ)
			flags = O_RDWR|O_LARGEFILE;
	*filp = dentry_open(&path, flags, current_cred());
	if (IS_ERR(*filp)) {
		host_err = PTR_ERR(*filp);
		*filp = NULL;
	} else {
		host_err = ima_file_check(*filp, may_flags);

		if (may_flags & NFSD_MAY_64BIT_COOKIE)
			(*filp)->f_mode |= FMODE_64BITHASH;
			(*filp)->f_mode |= FMODE_32BITHASH;

	err = nfserrno(host_err);
	return err;


 * Write data to a file.
 * The stable flag requests synchronous writes.
 * N.B. After this call fhp needs an fh_put
nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
		loff_t offset, struct kvec *vec, int vlen, unsigned long *cnt,
		int *stablep)
	__be32			err = 0;

	if (file) {
		err = nfsd_permission(rqstp, fhp->fh_export, fhp->fh_dentry,
		if (err)
			goto out;
		err = nfsd_vfs_write(rqstp, fhp, file, offset, vec, vlen, cnt,
	} else {
		err = nfsd_open(rqstp, fhp, S_IFREG, NFSD_MAY_WRITE, &file);
		if (err)
			goto out;

		if (cnt)
			err = nfsd_vfs_write(rqstp, fhp, file, offset, vec, vlen,
					     cnt, stablep);
	return err;


 * Check for a user's access permissions to this inode.
nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
					struct dentry *dentry, int acc)
	struct inode	*inode = dentry->d_inode;
	int		err;

	if ((acc & NFSD_MAY_MASK) == NFSD_MAY_NOP)
		return 0;
#if 0
	dprintk("nfsd: permission 0x%x%s%s%s%s%s%s%s mode 0%o%s%s%s\n",
		(acc & NFSD_MAY_READ)?	" read"  : "",
		(acc & NFSD_MAY_WRITE)?	" write" : "",
		(acc & NFSD_MAY_EXEC)?	" exec"  : "",
		(acc & NFSD_MAY_SATTR)?	" sattr" : "",
		(acc & NFSD_MAY_TRUNC)?	" trunc" : "",
		(acc & NFSD_MAY_LOCK)?	" lock"  : "",
		(acc & NFSD_MAY_OWNER_OVERRIDE)? " owneroverride" : "",
		IS_IMMUTABLE(inode)?	" immut" : "",
		IS_APPEND(inode)?	" append" : "",
		__mnt_is_readonly(exp->ex_path.mnt)?	" ro" : "");
	dprintk("      owner %d/%d user %d/%d\n",
		inode->i_uid, inode->i_gid, current_fsuid(), current_fsgid());

	/* Normally we reject any write/sattr etc access on a read-only file
	 * system.  But if it is IRIX doing check on write-access for a 
	 * device special file, we ignore rofs.
	if (!(acc & NFSD_MAY_LOCAL_ACCESS))
			if (exp_rdonly(rqstp, exp) ||
				return nfserr_rofs;
			if (/* (acc & NFSD_MAY_WRITE) && */ IS_IMMUTABLE(inode))
				return nfserr_perm;
	if ((acc & NFSD_MAY_TRUNC) && IS_APPEND(inode))
		return nfserr_perm;

	if (acc & NFSD_MAY_LOCK) {
		/* If we cannot rely on authentication in NLM requests,
		 * just allow locks, otherwise require read permission, or
		 * ownership
		if (exp->ex_flags & NFSEXP_NOAUTHNLM)
			return 0;
	 * The file owner always gets access permission for accesses that
	 * would normally be checked at open time. This is to make
	 * file access work even when the client has done a fchmod(fd, 0).
	 * However, `cp foo bar' should fail nevertheless when bar is
	 * readonly. A sensible way to do this might be to reject all
	 * attempts to truncate a read-only file, because a creat() call
	 * always implies file truncation.
	 * ... but this isn't really fair.  A process may reasonably call
	 * ftruncate on an open file descriptor on a file with perm 000.
	 * We must trust the client to do permission checking - using "ACCESS"
	 * with NFSv3.
	    uid_eq(inode->i_uid, current_fsuid()))
		return 0;

	err = inode_permission(inode, acc & (MAY_READ|MAY_WRITE|MAY_EXEC));

	/* Allow read access to binaries even when mode 111 */
	if (err == -EACCES && S_ISREG(inode->i_mode) &&
		err = inode_permission(inode, MAY_EXEC);

	return err? nfserrno(err) : 0;


static __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
				loff_t offset, struct kvec *vec, int vlen,
				unsigned long *cnt, int *stablep)
	struct svc_export	*exp;
	struct dentry		*dentry;
	struct inode		*inode;
	mm_segment_t		oldfs;
	__be32			err = 0;
	int			host_err;
	int			stable = *stablep;
	int			use_wgather;
	loff_t			pos = offset;

	dentry = file->f_path.dentry;
	inode = dentry->d_inode;
	exp   = fhp->fh_export;

	use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);

	if (!EX_ISSYNC(exp))
		stable = 0;

	/* Write the data. */
	oldfs = get_fs(); set_fs(KERNEL_DS);
	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
	if (host_err < 0)
		goto out_nfserr;
	*cnt = host_err;
	nfsdstats.io_write += host_err;

	/* clear setuid/setgid flag after write */
	if (inode->i_mode & (S_ISUID | S_ISGID))

	if (stable) {
		if (use_wgather)
			host_err = wait_for_concurrent_writes(file);
			host_err = vfs_fsync_range(file, offset, offset+*cnt, 0);

	dprintk("nfsd: write complete host_err=%d\n", host_err);
	if (host_err >= 0)
		err = 0;
		err = nfserrno(host_err);
	return err;


NFS のマウントオプションの hard と soft について調べたメモ

NFS のマウントオプションの hard、soft について調べたメモ(Linux限定)。


hard の動作
soft の動作
  • 整合性が求められるデータを読み書きに使う場合は hard にすべき。
    • 不完全な書込*2や読込*3が発生する可能性があるため。
  • 実行可能ファイルを置く場合も hard にすべき。
    • 実行可能ファイルのデータをメモリに読込中やページアウトされたページを再読込中に、NFS サーバがクラッシュすると想定外の動作*4をする可能性がある。


soft / hard

Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

NB: A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity. Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.


The number of times the NFS client retries a request before it attempts further recovery action. If the retrans option is not specified, the NFS client tries each request three times.

The NFS client generates a "server not responding" message after retrans retries, then attempts further recovery (depending on whether the hard mount option is in effect).

intr / nointr

This option is provided for backward compatibility. It is ignored after kernel 2.6.25.

nfs(5) - Linux manual page

Managing Nfs and Nis

Managing Nfs and Nis

  • 6.3. Mounting filesystems - Mount options


By default, NFS filesystems are hard mounted, and operations on them are retried until they are acknowledged by the server. If the soft option is specified, an NFS RPC call returns a timeout error if it fails the number of times specified by the retrans option.

  • 6.3. Mounting filesystems - Mounting filesystems - Hard and soft mounts

Hard and soft mounts

The hard and soft mount options determine how a client behaves when the server is excessively loaded for a long period or when it crashes. By default, all NFS filesystems are mounted hard, which means that an RPC call that times out will be retried indefinitely until a response is received from the server. This makes the NFS server look as much like a local disk as possible — the request that needs to go to disk completes at some point in the future. An NFS server that crashes looks like a disk that is very, very slow.

A side effect of hard-mounting NFS filesystems is that processes block (or “hang”) in a high-priority disk wait state until their NFS RPC calls complete. If an NFS server goes down, the clients using its filesystems hang if they reference these filesystems before the server recovers. Using intr in conjunction with the hard mount option allows users to interrupt system calls that are blocked waiting on a crashed server. The system call is interrupted when the process making the call receives a signal, usually sent by the user typing CTRL-C (interrupt) or using the kill command. CTRL-\ (quit) is another way to generate a signal, as is logging out of the NFS client host. When using kill , only SIGINT, SIGQUIT, and SIGHUP will interrupt NFS operations.

When an NFS filesystem is soft-mounted, repeated RPC call failures eventually cause the NFS operation to fail as well. Instead of emulating a painfully slow disk, a server exporting a soft-mounted filesystem looks like a failing disk when it crashes: system calls referencing the soft-mounted NFS filesystem return errors. Sometimes the errors can be ignored or are preferable to blocking at high priority; for example, if you were doing an ls -l when the NFS server crashed, you wouldn’t really care if the ls command returned an error as long as your system didn’t hang.

The other side to this “failing disk” analogy is that you never want to write data to an unreliable device, nor do you want to try to load executables from it. You should not use the soft option on any filesystem that is writable, nor on any filesystem from which you load executables. Furthermore, because many applications do not check return value of the read(2) system call when reading regular files (because those programs were written in the days before networking was ubiquitous, and disks were reliable enough that reads from disks virtually never failed), you should not use the soft option on any filesystem that is supplying input to applications that are in turn using the data for a mission-critical purpose. NFS only guarantees the consistency of data after a server crash if the NFS filesystem was hard-mounted by the client. Unless you really know what you are doing, neveruse the soft option.

We’ll come back to hard- and soft-mount issues in when we discuss modifying client behavior in the face of slow NFS servers in Chapter 18.

  • 18.2. Soft mount issues

Repeated retransmission cycles only occur for hard-mounted filesystems. When the soft option is supplied in a mount, the RPC retransmission sequence ends at the first major timeout, producing messages like:

NFS write failed for server wahoo: error 5 (RPC: Timed out)
NFS write error on host wahoo: error 145.
(file handle: 800000 2 a0000 114c9 55f29948 a0000 11494 5cf03971)

The NFS operation that failed is indicated, the server that failed to respond before the major timeout, and the filehandle of the file affected. RPC timeouts may be caused by extremely slow servers, or they can occur if a server crashes and is down or rebooting while an RPC retransmission cycle is in progress.

With soft-mounted filesystems, you have to worry about damaging data due to incomplete writes, losing access to the text segment of a swapped process, and making soft-mounted filesystems more tolerant of variances in server response time. If a client does not give the server enough latitude in its response time, the first two problems impair both the performance and correct operation of the client. If write operations fail, data consistency on the server cannot be guaranteed. The write error is reported to the application during some later call to write( ) or close( ), which is consistent with the behavior of a local filesystem residing on a failing or overflowing disk. When the actual write to disk is attempted by the kernel device driver, the failure is reported to the application as an error during the next similar or related system call.

A well-conditioned application should exit abnormally after a failed write, or retry the write if possible. If the application ignores the return code from write( ) or close( ), then it is possible to corrupt data on a soft-mounted filesystem. Some write operations may fail and never be retried, leaving holes in the open file.

To guarantee data integrity, all filesystems mounted read-write should be hard-mounted. Server performance as well as server reliability determine whether a request eventually succeeds on a soft-mounted filesystem, and neither can be guaranteed. Furthermore, any operating system that maps executable images directly into memory (such as Solaris) should hard-mount filesystems containing executables. If the filesystem is soft-mounted, and the NFS server crashes while the client is paging in an executable (during the initial load of the text segment or to refill a page frame that was paged out), an RPC timeout will cause the paging to fail. What happens next is system-dependent; the application may be terminated or the system may panic with unrecoverable swap errors.

A common objection to hard-mounting filesystems is that NFS clients remain catatonic until a crashed server recovers, due to the infinite loop of RPC retransmissions and timeouts. By default, Solaris clients allow interrupts to break the retransmission loop. Use the intr mount option if your client doesn’t specify interrupts by default. Unfortunately, some older implementations of NFS do not process keyboard interrupts until a major timeout has occurred: with even a small timeout period and retransmission count, the time required to recognize an interrupt can be quite large.

If you choose to ignore this advice, and choose to use soft-mounted NFS filesystems, you should at least make NFS clients more tolerant of soft-mounted NFS fileservers by increasing the retrans mount option. Increasing the number of attempts to reach the server makes the client less likely to produce an RPC error during brief periods of server loading.


  • そもそも、整合性を求められるデータの読み書きや実行可能ファイルを置く領域に NFS を使うべきかという点には触れていません。

"Reducing Memory Access Latency" が素晴らしすぎる

Reducing Memory Access Latency by Satoru Moriya (Hitachi LTC)



REDUCING MEMORY ACCESS LATENCY | Hitachi Data Systems も同じ資料のようです。

*1nfs(5) の man では kernel 2.6.25 以降は無視されると書かれている






oracle.jdbc.ReadTimeout はソケット読込時のタイムアウト

Oracle JDBC Thin Diver の oracle.jdbc.ReadTimeout について調べたことをメモ。

Oracle ACEid:yamadamn さんのスライドがわかりやすいです。






  • Oracle JDBC Thin Diver の oracle.jdbc.ReadTimeout(ミリ秒) はソケット読込時のタイムアウト
  • ReadTimeout > SetQueryTimeout にする理由
    • SELECT で少しずつ結果セットを受信するようなケースでは問題ないが、実行時間の長い SELECT で最後まで結果セットが帰って来ないケースや UPDATE や DELETE などの DML で SetQueryTimeout を超えていなくても、ReadTimeoutを超えてタイムアウトしてしまうから


$ export CLASSPATH=./ojdbc6.jar:.
$ javac TestReadTimeout.java 
$ strace -ff -Ttt -s 200 -o strace_java_log java TestReadTimeout
Error code: 17002
SQL state: 08006
java.sql.SQLRecoverableException: IO Error: Socket read timed out ★ ReadTimeout でタイムアウト発生
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1057)
	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1336)
	at oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:1916)
	at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:1878)
	at oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:318)
	at TestReadTimeout.main(TestReadTimeout.java:23)
Caused by: oracle.net.ns.NetException: Socket read timed out
	at oracle.net.ns.Packet.receive(Packet.java:347)
	at oracle.net.ns.DataPacket.receive(DataPacket.java:106)
	at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:324)
	at oracle.net.ns.NetInputStream.read(NetInputStream.java:268)
	at oracle.net.ns.NetInputStream.read(NetInputStream.java:190)
	at oracle.net.ns.NetInputStream.read(NetInputStream.java:107)
	at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124)
	at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80)
	at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137)
	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:350)
	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
	at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:195)
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1036)
	... 5 more
java.sql.SQLRecoverableException: Closed Connection
	at oracle.jdbc.driver.PhysicalConnection.needLine(PhysicalConnection.java:5416)
	at oracle.jdbc.driver.OracleStatement.closeOrCache(OracleStatement.java:1585)
	at oracle.jdbc.driver.OracleStatement.close(OracleStatement.java:1570)
	at oracle.jdbc.driver.OracleStatementWrapper.close(OracleStatementWrapper.java:94)
	at TestReadTimeout.main(TestReadTimeout.java:41)
  • pstack でコールスタックを見てみる
    • 上記とは別途実行して pstack を取得した結果
    • pstack よりスレッドダンプのほうがわかりやすいですね。
$ java TestReadTimeout &
[1] 23636
$ pstack 23636
Thread 14 (Thread 0x7fc066ebb700 (LWP 23641)):
#0  0x0000003c9b2df0d3 in poll () from /lib64/libc.so.6 ★ poll システムコールで待機している
#1  0x00007fc057aec39e in ?? () from /usr/lib/jvm/java-1.7.0-openjdk-
#2  0x00007fc057ae7c3c in Java_java_net_SocketInputStream_socketRead0 () from /usr/lib/jvm/java-1.7.0-openjdk- ★ Java_java_net_SocketInputStream_socketRead0 が呼ばれている
#3  0x00007fc05d012d98 in ?? ()
#4  0x000000000000137d in ?? ()
#5  0x00007fc05d0132ac in ?? ()
#6  0x00007fc066eb9d90 in ?? ()
#7  0x00007fc06770c940 in ?? () from /usr/lib/jvm/java-1.7.0-openjdk-
#8  0x00007fc05d0061d4 in ?? ()
#9  0x0000000000000000 in ?? ()
$ view strace_java_log.19377
12:50:53.661210 sendto(8, "\0V\0\0\6\0\0\0\0\0\3^\1\2\1!\0\1\1\30\1\1\r\0\0\0\0\2\177\370\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0call dbms_lock.sleep(10)\1\1\1\1\0\0\0\0\0\0\0\0\0\0\0", 86, 0, NULL, 0) = 86 <0.000027>
★↑dbms_lock.sleep(10) を実行して 10 秒スリープ
12:50:53.661321 poll([{fd=8, events=POLLIN|POLLERR}], 1, 4989) = 0 (Timeout) <4.994016>
★↑oracle.jdbc.ReadTimeout で設定した約5秒を超え、タイムアウトしている
12:50:58.655483 lseek(3, 30275151, SEEK_SET) = 30275151 <0.000012>
12:50:58.655567 read(3, "PK\3\4\n\0\0\10\0\0g\10\256F\347pK\356\f\2\0\0\f\2\0\0%\0\0\0", 30) = 30 <0.000011>
12:50:58.655642 lseek(3, 30275218, SEEK_SET) = 30275218 <0.000009>
12:50:58.655705 read(3, "\312\376\272\276\0\0\0003\0\32\n\0\4\0\26\n\0\4\0\27\7\0\30\7\0\31\1\0\20serialVersionUID\1\0\1J\1\0\rConstantValue\5\205:^J\376scT\1\0\6<init>\1\0\25(Ljava/lang/String;)V\1\0\4Code\1\0\17LineNumberTable\1\0\22LocalVariableTable\1\0\4this\1\0!Ljava/net/SocketTimeoutException;\1\0\3m"..., 524) = 524 <0.000010>


世の中のサーバサイド Java ではまだ Java 6 を使っているケースのほうが多いのかなと思い、Java 6 のソースコードを調べました。 Java 7 でも似たような感じです(たしか)。

     *  Enable/disable SO_TIMEOUT with the specified timeout, in ★
     *  milliseconds.  With this option set to a non-zero timeout, ★
     *  a read() call on the InputStream associated with this Socket ★
     *  will block for only this amount of time.  If the timeout expires, ★
     *  a <B>java.net.SocketTimeoutException</B> is raised, though the ★
     *  Socket is still valid. The option <B>must</B> be enabled
     *  prior to entering the blocking operation to have effect. The
     *  timeout must be > 0.
     *  A timeout of zero is interpreted as an infinite timeout.
     * @param timeout the specified timeout, in milliseconds. 
     * @exception SocketException if there is an error
     * in the underlying protocol, such as a TCP error.
     * @since   JDK 1.1
     * @see #getSoTimeout()
    public synchronized void setSoTimeout(int timeout) throws SocketException {
        if (isClosed())
            throw new SocketException("Socket is closed");
        if (timeout < 0)
          throw new IllegalArgumentException("timeout can't be negative");

        getImpl().setOption(SocketOptions.SO_TIMEOUT, new Integer(timeout)); ★
     * Reads into a byte array <i>b</i> at offset <i>off</i>,
     * <i>length</i> bytes of data.
     * @param b the buffer into which the data is read
     * @param off the start offset of the data
     * @param len the maximum number of bytes read
     * @return the actual number of bytes read, -1 is
     *          returned when the end of the stream is reached.
     * @exception IOException If an I/O error has occurred.
    public int read(byte b[], int off, int length) throws IOException {
        int n;

        // EOF already encountered
        if (eof) {
            return -1;

        // connection reset
        if (impl.isConnectionReset()) {
            throw new SocketException("Connection reset");

        // bounds check
        if (length <= 0 || off < 0 || off + length > b.length) {
            if (length == 0) {
                return 0;
            throw new ArrayIndexOutOfBoundsException();

        boolean gotReset = false;

        // acquire file descriptor and do the read
        FileDescriptor fd = impl.acquireFD();
        try {
            n = socketRead0(fd, b, off, length, impl.getTimeout()); ★socketRead0 を呼んでいる
            if (n > 0) {
                return n;
        } catch (ConnectionResetException rstExc) {
            gotReset = true;
        } finally {

         * We receive a "connection reset" but there may be bytes still
         * buffered on the socket
        if (gotReset) {
            try {
                n = socketRead0(fd, b, off, length, impl.getTimeout()); ★socketRead0 を呼んでいる
                if (n > 0) {
                    return n;
            } catch (ConnectionResetException rstExc) {
            } finally {
     * Reads into an array of bytes at the specified offset using
     * the received socket primitive.
     * @param fd the FileDescriptor
     * @param b the buffer into which the data is read
     * @param off the start offset of the data
     * @param len the maximum number of bytes read
     * @param timeout the read timeout in ms ★
     * @return the actual number of bytes read, -1 is
     *          returned when the end of the stream is reached.
     * @exception IOException If an I/O error has occurred.
    private native int socketRead0(FileDescriptor fd, ★ JNI(Java Native Interface) を呼んでいる
                                   byte b[], int off, int len,
                                   int timeout)
        throws IOException;
 * Class:     java_net_SocketInputStream
 * Method:    socketRead0
 * Signature: (Ljava/io/FileDescriptor;[BIII)I
Java_java_net_SocketInputStream_socketRead0(JNIEnv *env, jobject this,
                                            jobject fdObj, jbyteArray data,
                                            jint off, jint len, jint timeout)
    char *bufP;
    jint fd, nread;

    if (IS_NULL(fdObj)) {
        /* should't this be a NullPointerException? -br */
        JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException",
                        "Socket closed");
        return -1;
    } else {
        fd = (*env)->GetIntField(env, fdObj, IO_fd_fdID);
        /* Bug 4086704 - If the Socket associated with this file descriptor
         * was closed (sysCloseFD), the the file descriptor is set to -1.
        if (fd == -1) {
            JNU_ThrowByName(env, "java/net/SocketException", "Socket closed");
            return -1;

     * If the read is greater than our stack allocated buffer then
     * we allocate from the heap (up to a limit)
    if (len > MAX_BUFFER_LEN) {
        if (len > MAX_HEAP_BUFFER_LEN) {
            len = MAX_HEAP_BUFFER_LEN;
        bufP = (char *)malloc((size_t)len);
        if (bufP == NULL) {
            bufP = BUF;
            len = MAX_BUFFER_LEN;
    } else {
        bufP = BUF;

    if (timeout) { ★
        nread = NET_Timeout(fd, timeout);
        if (nread <= 0) {
            if (nread == 0) {
                JNU_ThrowByName(env, JNU_JAVANETPKG "SocketTimeoutException", ★
                            "Read timed out");
            } else if (nread == JVM_IO_ERR) {
                if (errno == EBADF) {
                     JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException", "Socket closed");
                 } else {
                     NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "SocketException",
                                                  "select/poll failed");
            } else if (nread == JVM_IO_INTR) {
                JNU_ThrowByName(env, JNU_JAVAIOPKG "InterruptedIOException",
                            "Operation interrupted");
            if (bufP != BUF) {
            return -1;

    nread = NET_Read(fd, bufP, len);

    if (nread <= 0) {
        if (nread < 0) {

            switch (errno) {
                case ECONNRESET:
                case EPIPE:
                    JNU_ThrowByName(env, "sun/net/ConnectionResetException",
                        "Connection reset");

                case EBADF:
                    JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException",
                        "Socket closed");

                case EINTR:
                     JNU_ThrowByName(env, JNU_JAVAIOPKG "InterruptedIOException",
                           "Operation interrupted");

                        JNU_JAVANETPKG "SocketException", "Read failed");
    } else {
        (*env)->SetByteArrayRegion(view src/solaris/native/java/net/SocketInputStream.c @ 1181:814bf0775b52env, data, off, nread, (jbyte *)bufP);

    if (bufP != BUF) {
    return nread;
#   Compiler emits things like:  path/file.o: file.h
#   We want something like: relative_path/file.o relative_path/file.d: file.h
CC_DEPEND_FILTER = $(SED) -e 's!$*\.$(OBJECT_SUFFIX)!$(dir $@)& $(dir $@)$*.$(DEPEND_SUFFIX)!g'

  PLATFORM_SRC = $(BUILDDIR)/../src/solaris

# Platform specific closed sources
ifndef OPENJDK
    CLOSED_PLATFORM_SRC = $(BUILDDIR)/../src/closed/solaris

  • TestReadTimeout.java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Properties;

public class TestReadTimeout {
        public static void main(String args[]) {
                Connection conn = null;
                Statement stmt = null;
                ResultSet resultSet = null;
                try {
                        Class.forName ("oracle.jdbc.driver.OracleDriver");
			java.util.Properties info = new java.util.Properties();
			info.put ("user", "scott");
			info.put ("password","tiger");
			info.put ("oracle.jdbc.ReadTimeout","4989");
                        conn = DriverManager.getConnection
                        for(;;) {
                            stmt = conn.createStatement();
                            stmt.execute("call dbms_lock.sleep(10)");
                } catch (SQLException e) {
                        System.out.println("Error code: " + e.getErrorCode());
                        System.out.println("SQL state: " + e.getSQLState());
                } catch (ClassNotFoundException e) {
                } finally {
                        try {
                                if (resultSet != null) {
                        } catch (SQLException e){
                        try {
                                if (stmt != null) {
                        } catch (SQLException e){
                        try {
                                if (conn != null) {
                        } catch (SQLException e){



Linux のページテーブルのサイズの見方と見積式

Linux Kernel 2.6 (x86-64) でのページテーブルのサイズの確認方法と見積式を調べてみた。



  • OS全体のページテーブルのサイズ
$ cat /proc/meminfo 
MemTotal:       16158544 kB
MemFree:        13134056 kB
PageTables:        34428 kB ★ 34MB
$ cat /proc/10225/status # 10255 は PID
Name:	zsh
State:	S (sleeping)
Tgid:	10225
Pid:	10225
PPid:	10222
VmPTE:	     124 kB ★ 124KB


(プロセスが使用している物理メモリサイズ / 4KB(ページサイズ)) * 8bytes

正確には、PTE は 8bytes * 512 の単位で1セットで、x86-64 のページサイズは 4KB なので、以下の式になると思います。

ROUNDUP((プロセスが使用している物理メモリサイズ / 4KB(ページサイズ)) / 512 entry) * 4KB
= プロセスが使用している物理メモリサイズ / 512

Oracle Database でSGA(共有メモリ)が 1GB の場合、1プロセスが共有メモリに使うページテーブルのサイズは以下の通り。

( 1,073,741,824 / 512 ) = 2,097,152 = 2MB




  • メインフレームの時代からOSは仮想記憶という仕組みで、物理メモリ以上のサイズ(物理メモリ + スワップ領域)をメモリ領域として使えるようになっています。
    • Linuxデフォルトで、物理メモリ + スワップ領域を超えるサイズを仮想的に割当てができたはず(オーバーコミット)
  • 仮想記憶には仮想アドレス空間をページング方式(固定長で分割)とセグメント方式(可変長で分割)があり、ほとんどのOSはページング方式を採用していると思います。
  • ページング方式では、ページテーブルと呼ばれるデータ構造にアドレス変換テーブル(仮想ページ番号と物理ページ番号のマッピング情報)が格納されます。
  • ページテーブルはユーザ空間ではなくカーネル空間の領域です。ps や pmap などで見れるプロセスがユーザー空間で使用するメモリ領域には含まれません。
  • ページング方式などの仮想記憶はメモリ管理ユニット(MMU)と呼ばれるハードウェアで実現され、OSはその仕様に準じた実装をしています。

Linux のページテーブル

Oracle Database on Linux でのページテーブル
  • Oracle Database はマルチプロセスで共有メモリを使うため、SGA(共有メモリ)が大きく、セッション数が多いと、塵も積もれば山となるで、ページテーブルのサイズが大きくなります。
  • OS内の管理領域であるページテーブルに何〜何十GBのページテーブルを使うのはもったいないので、SGAが大きくセッション数がか多い場合は HugePages を使うとページテーブルのサイズが小さくなり、メモリを節約できます。
  • 通常のページは4KBですが、HugePagesでは2MBになります。512倍になるため、ページを管理するPTE の数が少なくなり、ページテーブルのサイズも小さくなります。


SGA が1GB のインスタンスに100セッションの接続を張ると、1セッションで400KB弱、100セッションで 40MB 程度をページテーブルとして使いましたという検証結果を記載予定

  • 100セッション接続前のメモリ使用量とページテーブルのサイズ
  • 100セッション接続する
  • ページテーブルのサイズが大きくなる
  • 1プロセス当りのベージテーブルのサイズは●●KB程度
  • 100セッション切断するとベージテーブルは解放される

Linux Kernel のソースコードより(引用部除く)

id:naoya さんのブログエントリでわかりやすく解説されているので、そのまま引用します。

/proc/<PID>/status の出力の詳細を知る

/proc/<PID>/status はプロセスのメモリ利用状況を詳細に出力するので、重宝します。各行の意味するところを正確に把握しておきたいところです。Linux カーネルソースの Documentation/filesystems/proc.txt に一応ドキュメントがありますが、残念ながら詳細な言及はありません。

そこで、ソースを見ます。少し古いですが、linux-2.6.23 のソースを見ていきます。/proc/<PID>/status を read すると、fs/proc/array.c にある proc_pid_status() 関数が呼ばれます。

int proc_pid_status(struct task_struct *task, char * buffer)
    char * orig = buffer;
    struct mm_struct *mm = get_task_mm(task);

    buffer = task_name(task, buffer);
    buffer = task_state(task, buffer);

    if (mm) {
        buffer = task_mem(mm, buffer);
    buffer = task_sig(task, buffer);
    buffer = task_cap(task, buffer);
    buffer = cpuset_task_status_allowed(task, buffer);
#if defined(CONFIG_S390)
    buffer = task_show_regs(task, buffer);
    return buffer - orig;

引数の task は /proc/<PID>/status で指定した PID のプロセスプロセスディスクリプタ (task_struct 構造体)で、task->mm でメモリディスクリプタ (mm_struct 構造体) が得られます。status の出力で表示されているメモリ関連の行の値はメモリディスクリプタに収められています。

proc_pid_status() では get_task_mm(task) でメモリディスクリプタを取得し、task_mm(mm, buffer) でメモリディスクリプタ内から必要な値を取得し、出力を作っています。task_mm() は以下のような実装になっていました。

char *task_mem(struct mm_struct *mm, char *buffer)
    unsigned long data, text, lib;
    unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;

     * Note: to minimize their overhead, mm maintains hiwater_vm and
     * hiwater_rss only when about to *lower* total_vm or rss.  Any
     * collector of these hiwater stats must therefore get total_vm
     * and rss too, which will usually be the higher.  Barriers? not
     * worth the effort, such snapshots can always be inconsistent.
    hiwater_vm = total_vm = mm->total_vm;
    if (hiwater_vm < mm->hiwater_vm)
        hiwater_vm = mm->hiwater_vm;
    hiwater_rss = total_rss = get_mm_rss(mm);
    if (hiwater_rss < mm->hiwater_rss)
        hiwater_rss = mm->hiwater_rss;

    data = mm->total_vm - mm->shared_vm - mm->stack_vm;
    text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
    lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
    buffer += sprintf(buffer,
        "VmPeak:\t%8lu kB\n"
        "VmSize:\t%8lu kB\n"
        "VmLck:\t%8lu kB\n"
        "VmHWM:\t%8lu kB\n"
        "VmRSS:\t%8lu kB\n"
        "VmData:\t%8lu kB\n"
        "VmStk:\t%8lu kB\n"
        "VmExe:\t%8lu kB\n"
        "VmLib:\t%8lu kB\n"
        "VmPTE:\t%8lu kB\n",
        hiwater_vm << (PAGE_SHIFT-10),
        (total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
        mm->locked_vm << (PAGE_SHIFT-10),
        hiwater_rss << (PAGE_SHIFT-10),
        total_rss << (PAGE_SHIFT-10),
        data << (PAGE_SHIFT-10),
        mm->stack_vm << (PAGE_SHIFT-10), text, lib,
        (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
    return buffer;

この実装を見ることで、status の各行の意味は明確になるでしょう。

あるプロセスが利用しているメモリサイズを procfs 経由で調べる - naoyaのはてなダイアリー

VmPTE は以下の式で計算されていることがわかります。

(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
512個 * 8bytes(PTEの1エントリのサイズ) * ページエントリのセットの数 
 * entries per page directory level
#define PTRS_PER_PTE	512
typedef struct {
	unsigned long pte;
} pte_t;
  • x86_64 では pte_t(long) は 8byte でした。
    • crash コマンドで確認
# crash
crash> struct pte_t
typedef struct {
    pteval_t pte;
} pte_t;
$ cat pte_size.c 
#include <stdio.h>
void main(void) {
    typedef struct {
        unsigned long pte;
    } pte_t;
    printf("Size of pte_t: %ubytes\n", sizeof(pte_t));
$ gcc -m64 -o pte_size pte_size.c
$ ./pte_size 
Size of pte_t: 8bytes
int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
	pgtable_t new = pte_alloc_one(mm, address);
	if (!new)
		return -ENOMEM;

	 * Ensure all pte setup (eg. pte page lock and page clearing) are
	 * visible before the pte is made visible to other CPUs by being
	 * put into page tables.
	 * The other side of the story is the pointer chasing in the page
	 * table walking code (when walking the page table without locking;
	 * ie. most of the time). Fortunately, these data accesses consist
	 * of a chain of data-dependent loads, meaning most CPUs (alpha
	 * being the notable exception) will already guarantee loads are
	 * seen in-order. See the alpha page table accessors for the
	 * smp_read_barrier_depends() barriers in page table walking code.
	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
		pmd_populate(mm, pmd, new);
		new = NULL;
	if (new)
		pte_free(mm, new);
	return 0;



  • 本エントリでは扱いませんが、HugePage を使うとTLBヒット率が向上します。







To Do

  • mm->nr_ptes の算出ロジックを調べる。
  • デマンドページングにより、物理メモリにマップされて初めてPTEを使うか確認する
  • Oracle Database での検証結果を書く
  • /proc/meminfo の PageTables の算出ロジックKernelソースコードから調査する
  • HugePages を使った場合の見積式を書く
  • Oracle Database の PRE_PAGE_SGA、LOCK_SGA について書く


Oracle RAC の投票ディスクについて

オラクルマスター教科書 ORACLE MASTER Expert 【RAC】編(試験番号:1Z0-048)



CSSD(Cluster Syncronization Service デーモン)は、インターコネクトを介してほかのノード通信し、通信状況を投票ディスクに格納します。インターコネクトに障害が発生した場合、共有ディスクに対するI/Oの同期が取れなくなるため、クラスタが分断された状態になります。クラスタ孤立するとほかのノードが使用可能か判断できなくなり、同じデータベースに対して非同期アクセスすることでデータベースの不整合が発生する可能性があります。このような状況は「スプリットブレイン」と呼ばれ、投票ディスクはスプリットブレインを解決するために使用されます。





    • 投票ディスクへのディスクI/Oが実行できないと判定されるまえでの最大許容時間(秒単位)です。この時間を経過するとノードを削除するためにクラスタ再構成が行われます。デフォルトは200秒です。


サポートエンジニアが語る!RAC 環境のトラブルシューティング







Oracle Database 11g Oracle Real Application Clusters Handbook, 2nd Edition (Oracle Press)

  • CHAPTER 14 Oracle RAC Troubleshooting
    • Debugging Node Eviction Issues

One of the most common and complex issues in Oracle RAC is performing the root cause analysis (RCA) of the node eviction issues. Oracle evicts a node from the cluster mainly due to one of the following three reasons:

Oracle Grid Infrastructureインストレーション・ガイド 11gリリース2 (11.2) for Linux B56271-12

  • 2.13 Intelligent Platform Management Interface(IPMI)の有効化

Intelligent Platform Management Interface(IPMI)は、コンピュータハードウェアおよびファームウェアへの共通インタフェースを提供し、システム管理者はそのインタフェースを使用して、システム状態の監視およびシステムの管理を実行できます。Oracle 11g リリース2では、Oracle ClusterwareにIPMIを統合して、障害分離をサポートしたりクラスタの整合性を確保することができます。




IPMI: ハードウェアファームウェア に共通インターフェースを提供することによってシステムを監視します。

OUI では、IPMI を利用して障害が発生したリモート・ノードの停止を実施する仕組みを構成可能です。OUI でのインストール実行後に手動で構成することも可能です。

IPMI の使用

Oracle Clusterware で IPMI を使用する場合には、OUI にて Grid Infrastructure のインストール時に IPMI の構成(ADMIN 権限を持 ザ名やパ ドの入力)が可能です あるい 権限を持つユーザ名やパスワードの入力)が可能です。あるいはインストール終了後に手動で設定が可能です。

また、IPMI を設定する際には事前に OS 側で IPMI を構成し使用可能な状態にしておく必要があります。

絵で見てわかるシステム構築のためのOracle設計 (DB Selection)

絵で見てわかるシステム構築のためのOracle設計 (DB Selection)