Discussion:
[patch 1/2] fork_connector: add a fork connector
(too old to reply)
Guillaume Thouvenin
2005-03-25 10:08:53 UTC
Permalink
This patch adds a fork connector in the do_fork() routine. It sends a
netlink datagram when enabled. The message can be read by a user space
application. By this way, the user space application is alerted when a
fork occurs.

It uses the userspace <-> kernelspace connector that works on top of
the netlink protocol. The fork connector is enabled or disabled by
sending a message to the connector. This operation should be done by
only one application. Such application can be downloaded from
http://cvs.sourceforge.net/viewcvs.py/elsa/elsa_project/utils/fcctl.c

The unique sequence number of messages can be used to check if a
message is lost. This sequence number is relative to a CPU.

The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.

This patch applies to 2.6.12-rc1-mm3. Some other patches are needed
that fix problems in the connector.c file. At least, you need to apply
the patch provided in the second email.

Signed-off-by: Guillaume Thouvenin <***@bull.net>
---

drivers/connector/Kconfig | 11 ++++
drivers/connector/Makefile | 1
drivers/connector/cn_fork.c | 104 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/connector.h | 8 +++
kernel/fork.c | 44 ++++++++++++++++++
5 files changed, 168 insertions(+)

Index: linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Kconfig
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/drivers/connector/Kconfig 2005-03-25 09:47:09.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Kconfig 2005-03-25 10:14:21.000000000 +0100
@@ -10,4 +10,15 @@ config CONNECTOR
Connector support can also be built as a module. If so, the module
will be called cn.ko.

+config FORK_CONNECTOR
+ bool "Enable fork connector"
+ depends on CONNECTOR=y
+ default y
+ ---help---
+ It adds a connector in kernel/fork.c:do_fork() function. When a fork
+ occurs, netlink is used to transfer information about the parent and
+ its child. This information can be used by a user space application.
+ The fork connector can be enable/disable by sending a message to the
+ connector with the corresponding group id.
+
endmenu
Index: linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Makefile
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/drivers/connector/Makefile 2005-03-25 09:47:09.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Makefile 2005-03-25 10:14:21.000000000 +0100
@@ -1,2 +1,3 @@
obj-$(CONFIG_CONNECTOR) += cn.o
+obj-$(CONFIG_FORK_CONNECTOR) += cn_fork.o
cn-objs := cn_queue.o connector.o
Index: linux-2.6.12-rc1-mm3-cnfork/drivers/connector/cn_fork.c
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/drivers/connector/cn_fork.c 2003-01-30 11:24:37.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/drivers/connector/cn_fork.c 2005-03-25 10:14:21.000000000 +0100
@@ -0,0 +1,104 @@
+/*
+ * cn_fork.c
+ *
+ * 2005 Copyright (c) Guillaume Thouvenin <***@bull.net>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#include <linux/connector.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Guillaume Thouvenin <***@bull.net>");
+MODULE_DESCRIPTION("Enable or disable the usage of the fork connector");
+
+int cn_fork_enable = 0;
+struct cb_id cb_fork_id = { CN_IDX_FORK, CN_VAL_FORK };
+
+static inline void cn_fork_send_status(void)
+{
+ /* TODO */
+ printk(KERN_INFO "cn_fork_enable == %d\n", cn_fork_enable);
+}
+
+/**
+ * cn_fork_callback - enable or disable the fork connector
+ * @data: message send by the connector
+ *
+ * The callback allows to enable or disable the sending of information
+ * about fork in the do_fork() routine. To enable the fork, the user
+ * space application must send the integer 1 in the data part of the
+ * message. To disable the fork connector, it must send the integer 0.
+ */
+static void cn_fork_callback(void *data)
+{
+ struct cn_msg *msg = (struct cn_msg *)data;
+ int action;
+
+ if (cn_already_initialized && (msg->len == sizeof(cn_fork_enable))) {
+ memcpy(&action, msg->data, sizeof(cn_fork_enable));
+ switch(action) {
+ case FORK_CN_START:
+ cn_fork_enable = 1;
+ break;
+ case FORK_CN_STOP:
+ cn_fork_enable = 0;
+ break;
+ case FORK_CN_STATUS:
+ cn_fork_send_status();
+ break;
+ }
+ }
+}
+
+/**
+ * cn_fork_init - initialization entry point
+ *
+ * This routine will be run at kernel boot time because this driver is
+ * built in the kernel. It adds the connector callback to the connector
+ * driver.
+ */
+static int cn_fork_init(void)
+{
+ int err;
+
+ err = cn_add_callback(&cb_fork_id, "cn_fork", &cn_fork_callback);
+ if (err) {
+ printk(KERN_WARNING "Failed to register cn_fork\n");
+ return -EINVAL;
+ }
+
+ printk(KERN_NOTICE "cn_fork is registered\n");
+ return 0;
+}
+
+/**
+ * cn_fork_exit - exit entry point
+ *
+ * As this driver is always statically compiled into the kernel the
+ * cn_fork_exit has no effect.
+ */
+static void cn_fork_exit(void)
+{
+ cn_del_callback(&cb_fork_id);
+}
+
+module_init(cn_fork_init);
+module_exit(cn_fork_exit);
Index: linux-2.6.12-rc1-mm3-cnfork/include/linux/connector.h
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/include/linux/connector.h 2005-03-25 09:47:11.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/include/linux/connector.h 2005-03-25 10:14:21.000000000 +0100
@@ -28,10 +28,16 @@
#define CN_VAL_KOBJECT_UEVENT 0x0000
#define CN_IDX_SUPERIO 0xaabb /* SuperIO subsystem */
#define CN_VAL_SUPERIO 0xccdd
+#define CN_IDX_FORK 0xfeed /* fork events */
+#define CN_VAL_FORK 0xbeef


#define CONNECTOR_MAX_MSG_SIZE 1024

+#define FORK_CN_STOP 0
+#define FORK_CN_START 1
+#define FORK_CN_STATUS 2
+
struct cb_id
{
__u32 idx;
@@ -133,6 +139,8 @@ struct cn_dev
};

extern int cn_already_initialized;
+extern int cn_fork_enable;
+extern struct cb_id cb_fork_id;

int cn_add_callback(struct cb_id *, char *, void (* callback)(void *));
void cn_del_callback(struct cb_id *);
Index: linux-2.6.12-rc1-mm3-cnfork/kernel/fork.c
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/kernel/fork.c 2005-03-25 09:47:11.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/kernel/fork.c 2005-03-25 10:14:21.000000000 +0100
@@ -41,6 +41,7 @@
#include <linux/profile.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/connector.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -63,6 +64,47 @@ DEFINE_PER_CPU(unsigned long, process_co

EXPORT_SYMBOL(tasklist_lock);

+#ifdef CONFIG_FORK_CONNECTOR
+
+#define CN_FORK_INFO_SIZE 64
+#define CN_FORK_MSG_SIZE (sizeof(struct cn_msg) + CN_FORK_INFO_SIZE)
+
+static DEFINE_PER_CPU(unsigned long, fork_counts);
+
+static inline void fork_connector(pid_t parent, pid_t child)
+{
+ if (cn_fork_enable) {
+ struct cn_msg *msg;
+ __u8 buffer[CN_FORK_MSG_SIZE];
+
+ msg = (struct cn_msg *)buffer;
+
+ memcpy(&msg->id, &cb_fork_id, sizeof(msg->id));
+
+ msg->ack = 0; /* not used */
+ msg->seq = get_cpu_var(fork_counts)++;
+
+ /*
+ * size of data is the number of characters
+ * printed plus one for the trailing '\0'
+ */
+ memset(msg->data, '\0', CN_FORK_INFO_SIZE);
+ msg->len = scnprintf(msg->data, CN_FORK_INFO_SIZE-1,
+ "%i %i %i",
+ smp_processor_id(), parent, child) + 1;
+
+ put_cpu_var(fork_counts);
+
+ cn_netlink_send(msg, CN_IDX_FORK);
+ }
+}
+#else
+static inline void fork_connector(pid_t parent, pid_t child)
+{
+ return;
+}
+#endif /* CONFIG_FORK_CONNECTOR */
+
int nr_processes(void)
{
int cpu;
@@ -1253,6 +1295,8 @@ long do_fork(unsigned long clone_flags,
if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE))
ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
}
+
+ fork_connector(current->pid, p->pid);
} else {
free_pidmap(pid);
pid = PTR_ERR(p);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
dean gaudet
2005-03-25 22:56:39 UTC
Permalink
On Fri, 25 Mar 2005, Guillaume Thouvenin wrote:

...
Post by Guillaume Thouvenin
The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.
...
Post by Guillaume Thouvenin
+ /*
+ * size of data is the number of characters
+ * printed plus one for the trailing '\0'
+ */
+ memset(msg->data, '\0', CN_FORK_INFO_SIZE);
+ msg->len = scnprintf(msg->data, CN_FORK_INFO_SIZE-1,
+ "%i %i %i",
+ smp_processor_id(), parent, child) + 1;
i'm certain that if you used a struct {} and filled in 3 fields rather
than zeroing 64 bytes of memory, and doing 3 conversions to decimal ascii
then you'd see a marked decrease in the overhead of this. it's not clear
to me why you need ascii here -- the rest of the existing bsd accounting
code is not ascii (i'm assuming the purpose of the fork connector is for
accounting).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Paul Jackson
2005-03-28 21:50:17 UTC
Permalink
Post by Guillaume Thouvenin
The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.
Thanks for including the numbers. The 7% seems a bit costly, for a bit
more accounting information. Perhaps dean's suggestion, to not use
ascii, will help. I hope so, though I doubt it will make a huge
difference. Was this 7% loss with or without a user level program
consuming the sent messages? I would think that the number of interest
would include a minimal consumer task.

I don't see a good reason to make fork_connector() inline. Since it
calls other subroutines and is not just a few lines, perhaps better to
make it a real routine, so we can see it in "nm --print-size" output and
debug stacks.

Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
seems unfortunate. Can the real fork_connector() be put elsewhere, and
the ifdef put in a header file that makes it a no-op if not configured,
or simply a function declaration, if configured?

What's the status of the connector driver patch? I perhaps wasn't
paying close enough attention, but all I see of it now is a couple of
patches sent to lkml, from Evgeniy Polyakov, in September and January.
I don't see it in my copies of *-mm or recent Linus bk trees. Am I
missing something?

This still seems to me like more apparatus than is desirable, just to
get another form of session id, as best as I can figure it. However
we've already been there, and apparently my concerns were not
persuasive. If one does go down this path, then using this connector
patch is a good an alternative as any I know of. Well, that or relayfs.
My uneducated assumption is that relayfs might at least batch data
packets up into big buffer chunks better, but someone more knowledgeable
than me needs to consider that.

It's a little sad, when almost all the required accounting information
comes out in packed 64 byte records, carefully buffered and sent in
big chunks, to minimize per-task costs. Then this one extra detail,
of <parent-pid, child-pid> requires an entire netlink packet of
its own of what size -- another 50 or 100 bytes? Is this packet
received as a separate data packet, on its own recv(2) system call,
by the user task, not in a big block of packets? The efficiency
of getting this one extra <parent-pid, child-pid> out of the kernel
seems to be one or two orders of magnitude worse than the rest of
the accounting data.

===

Hmmm ... perhaps one could add a _second_ accounting file, cutting and
pasting code in kernel/acct.c and enabling writing additional
information to that second file, using the same mechanisms as now used
for the primary file. Use a more extensible record format for the
second file (say start each record with a magic cookie, a byte record
type and a byte record length, then that many bytes). That way, we have
an escape valve for adding additional record types in the future.
And that way we can efficiently write short records, with just say
a couple of interesting values, and minimal overhead.

Don't worry if the magic cookie appears as part of the raw data. If one
has to resync such a data stream, one can look for a series of records,
each starting with the magic cookie, sensible record type byte, and a
length that ends right at the next such valid record. The occassional
duplication of the same cookie within the data stream would not thwart a
resync for long. And the main purpose of the magic cookie is to make
sure you are still in sync, not reverting to garbage-in, garbage-out,
mode. Almost any magic value other than 0x0000 will suffice for that
purpose.

I just ran a silly little test on my PC desktop Linux box, scanning
/proc/kcore. The _least_ common 2 byte word seen was 0x2B91, with 31
instances in a half-billion words scanned, so I nominate that value for
the magic cookie ;).

The key reason that it might make sense here to adapt the existing
accounting file direct write mechanism, rather than using "connector" or
"relayfs", is that we really do want to get this data to disk initially.
Relayfs is optimized for getting alot of data to a user daemon, and the
connector for sending smaller packets of data to a user daemon. But
accounting processing is sometimes done out of a cron job off-hours.
During the day (the busy hours) you might just want to stash the stuff
with as little performance impact is possible. If one can avoid _any_
other task having to context switch in, in order to get this data on its
way, that is a huge win.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <***@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Evgeniy Polyakov
2005-03-29 07:12:44 UTC
Permalink
Post by Paul Jackson
Post by Guillaume Thouvenin
The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.
Thanks for including the numbers. The 7% seems a bit costly, for a bit
more accounting information. Perhaps dean's suggestion, to not use
ascii, will help. I hope so, though I doubt it will make a huge
difference. Was this 7% loss with or without a user level program
consuming the sent messages? I would think that the number of interest
would include a minimal consumer task.
There is no overhead at all using CBUS.
On my old P2/256mb SMP machine it took about 950 usec
to create+exit process both with fork connector turned on and
without it even compiled.
Direct connector's method call took about 1000-1100 usec.
Current fork connector does not use CBUS [yet, I hope].
Post by Paul Jackson
I don't see a good reason to make fork_connector() inline. Since it
calls other subroutines and is not just a few lines, perhaps better to
make it a real routine, so we can see it in "nm --print-size" output and
debug stacks.
Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
seems unfortunate. Can the real fork_connector() be put elsewhere, and
the ifdef put in a header file that makes it a no-op if not configured,
or simply a function declaration, if configured?
What's the status of the connector driver patch? I perhaps wasn't
paying close enough attention, but all I see of it now is a couple of
patches sent to lkml, from Evgeniy Polyakov, in September and January.
I don't see it in my copies of *-mm or recent Linus bk trees. Am I
missing something?
It was dropped from -mm tree, since bk tree where it lives
was in maintenance mode.
I think connector will be appeared in the next -mm release.
Post by Paul Jackson
This still seems to me like more apparatus than is desirable, just to
get another form of session id, as best as I can figure it. However
we've already been there, and apparently my concerns were not
persuasive. If one does go down this path, then using this connector
patch is a good an alternative as any I know of. Well, that or relayfs.
My uneducated assumption is that relayfs might at least batch data
packets up into big buffer chunks better, but someone more knowledgeable
than me needs to consider that.
It's a little sad, when almost all the required accounting information
comes out in packed 64 byte records, carefully buffered and sent in
big chunks, to minimize per-task costs. Then this one extra detail,
of <parent-pid, child-pid> requires an entire netlink packet of
its own of what size -- another 50 or 100 bytes? Is this packet
received as a separate data packet, on its own recv(2) system call,
by the user task, not in a big block of packets? The efficiency
of getting this one extra <parent-pid, child-pid> out of the kernel
seems to be one or two orders of magnitude worse than the rest of
the accounting data.
It can be easily changed.
One may send kernel/acct.c acct_t structure out of the kernel -
overhead will be the same: kmalloc probably will get new area from the
same 256-bytes pool, skb is still in cache.
Post by Paul Jackson
===
Hmmm ... perhaps one could add a _second_ accounting file, cutting and
pasting code in kernel/acct.c and enabling writing additional
information to that second file, using the same mechanisms as now used
for the primary file. Use a more extensible record format for the
second file (say start each record with a magic cookie, a byte record
type and a byte record length, then that many bytes). That way, we have
an escape valve for adding additional record types in the future.
And that way we can efficiently write short records, with just say
a couple of interesting values, and minimal overhead.
Don't worry if the magic cookie appears as part of the raw data. If one
has to resync such a data stream, one can look for a series of records,
each starting with the magic cookie, sensible record type byte, and a
length that ends right at the next such valid record. The occassional
duplication of the same cookie within the data stream would not thwart a
resync for long. And the main purpose of the magic cookie is to make
sure you are still in sync, not reverting to garbage-in, garbage-out,
mode. Almost any magic value other than 0x0000 will suffice for that
purpose.
I just ran a silly little test on my PC desktop Linux box, scanning
/proc/kcore. The _least_ common 2 byte word seen was 0x2B91, with 31
instances in a half-billion words scanned, so I nominate that value for
the magic cookie ;).
The key reason that it might make sense here to adapt the existing
accounting file direct write mechanism, rather than using "connector" or
"relayfs", is that we really do want to get this data to disk initially.
Relayfs is optimized for getting alot of data to a user daemon, and the
connector for sending smaller packets of data to a user daemon. But
accounting processing is sometimes done out of a cron job off-hours.
During the day (the busy hours) you might just want to stash the stuff
with as little performance impact is possible. If one can avoid _any_
other task having to context switch in, in order to get this data on its
way, that is a huge win.
File writing accounting [kernel/acct.c] is slower, it takes global
locks
and requires process' context to work with system calls.
realyfs is interesting project, but it has different aims,
as far as I can see, - it is created for transferring huge amounts of
data,
and it is succeded in it, while connector is purely
control/notification
mechanism, for example, for gathering short-living per-process
accounting data.
--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski
Greg KH
2005-03-29 07:25:19 UTC
Permalink
Post by Evgeniy Polyakov
Post by Paul Jackson
I don't see it in my copies of *-mm or recent Linus bk trees. Am I
missing something?
It was dropped from -mm tree, since bk tree where it lives
was in maintenance mode.
I think connector will be appeared in the next -mm release.
Should have been in the last -mm release. If not, please let me know.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Evgeniy Polyakov
2005-03-29 08:34:58 UTC
Permalink
Post by Greg KH
Post by Evgeniy Polyakov
Post by Paul Jackson
I don't see it in my copies of *-mm or recent Linus bk trees. Am I
missing something?
It was dropped from -mm tree, since bk tree where it lives
was in maintenance mode.
I think connector will be appeared in the next -mm release.
Should have been in the last -mm release. If not, please let me know.
Thank you.
If you are not going to sleep right now I will recreate rejected
NLMSGSPACE
patch in a several minutes.
Post by Greg KH
thanks,
greg k-h
--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski
Paul Jackson
2005-03-29 09:02:49 UTC
Permalink
Post by Evgeniy Polyakov
There is no overhead at all using CBUS.
This is unlikely. Very unlikely.

Please understand that I am not trying to critique CBUS or connector in
isolation, but rather trying to determine what mechanism is best suited
for getting this accounting data written to disk, which is where I
assume it has to go until some non-real time job gets around to
analyzing it. We already have the rest of the BSD Accounting
information taking this batched up path directly to the disk. There is
nothing (that I know of) to be gained from delivering this new fork data
with any higher quality of service, or to any other place.

From what I can understand, correct me if I'm wrong, we have two
alternatives in front of us (ignoring relayfs for a second):

1) Using fork_connector (presumably soon to include use of CBUS):
- forking process enqueues data plus header for single fork
- context switch
- daemon process dequeues single fork data (is this a read or recv?)
- daemon process mergers multiple fork data into single buffer
- daemon process writes buffer for multiple forks (a write)
- disk driver pushes buffer with data for multiple forks to disk

2) Using a modified form of what BSD ACCOUNTING does now:
- forking process appends single fork data to in-kernel buffer
- disk driver pushes buffer with data for multiple forks to disk

It seems to me to be rather unlikely that (1) is cheaper than (2). It
is no particular fault of connector or CBUS that this is so. Even if
there were no overhead at all using CBUS (which I don't believe), (1)
still costs more, because it passes the data, with an added packet
header, into a separate process, and into user space, before it is
combined with other accounting information, and written back down into
the kernel to go to the disk.
Post by Evgeniy Polyakov
... The efficiency
of getting this one extra <parent-pid, child-pid> out of the kernel
seems to be one or two orders of magnitude worse than the rest of
the accounting data.
It can be easily changed.
One may send kernel/acct.c acct_t structure out of the kernel -
overhead will be the same: kmalloc probably will get new area from the
same 256-bytes pool, skb is still in cache.
I have no idea what you just said.
Post by Evgeniy Polyakov
File writing accounting [kernel/acct.c] is slower,
Sure file writing is slower than queuing on an internal list. I don't
care that getting the data where I want it is slower than getting it
some other place that's only part way.
Post by Evgeniy Polyakov
and requires process' context to work with system calls.
For some connector uses, that might matter. For hooks in fork,
that is no problem - we have all the process context one could
want - two of them if that helps ;).
Post by Evgeniy Polyakov
realyfs is interesting project, but it has different aims,
That could well be ... I can't claim to know which of relayfs or
connector would be better here, of the two.
Post by Evgeniy Polyakov
while connector is purely control/notification mechanism
However connector is, in this regard, overkill. We don't need a single
event notification mechanism here. One of the key ways in which
accounting such as this has historically minimized system load is to
forgo any effort to provide any notification or data packet per event,
and instead immediately work to batch the data up in bulk form, with one
buffer containing the concatenated data for multiple events. This
amortizes the cost of almost all the handling, and of all the disk i/o,
over many data collection events. Correct me if I'm wrong, but
fork_connector doesn't do this merging of events into a consolidated
data buffer, so is at a distinct disadvantage, for this use, because the
data merging is delayed, and a separate, user level process, is required
to accomplish the merging and conversion to writable blocks of data
suitable for storing on the disk.

Nothing wrong with a good screwdriver. But if you want to pound nails,
hammers, even rocks, work better.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <***@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Guillaume Thouvenin
2005-03-29 09:21:15 UTC
Permalink
Post by Paul Jackson
This
amortizes the cost of almost all the handling, and of all the disk i/o,
over many data collection events. Correct me if I'm wrong, but
fork_connector doesn't do this merging of events into a consolidated
data buffer, so is at a distinct disadvantage, for this use, because the
data merging is delayed, and a separate, user level process, is required
to accomplish the merging and conversion to writable blocks of data
suitable for storing on the disk.
The goal of the fork connector is to inform a user space application
that a fork occurs in the kernel. This information (cpu ID, parent PID
and child PID) can be used by several user space applications. It's not
only for accounting. Accounting and fork_connector are two different
things and thus, fork_connector doesn't do the merge of any kinds of
data (and it will never do).

One difference with relayfs is the amount of datas that are
transfered. The relayfs is done, like Evgeniy said, for large amount of
datas. So I think that it's not suitable for what we want to achieve
with the fork connector.

I hope this help,
Regards,
Guillaume

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Paul Jackson
2005-03-29 15:27:39 UTC
Permalink
Post by Guillaume Thouvenin
The goal of the fork connector is to inform a user space application
that a fork occurs in the kernel. This information (cpu ID, parent PID
and child PID) can be used by several user space applications. It's not
only for accounting. Accounting and fork_connector are two different
things and thus, fork_connector doesn't do the merge of any kinds of
data (and it will never do).
Yes - it is clear that the fork_connector does this - inform user space
of fork information <cpu, parent, child>. I'm not saying that
fork_connector should merge data; I'm observing that it doesn't, and
that this would seem to serve the needs of accounting poorly.

Out of curiosity, what are these 'several user space applications?' The
only one I know of is this extension to bsd accounting to include
capturing parent and child pid at fork. Probably you've mentioned some
other uses of fork_connector before here, but I missed it.
Post by Guillaume Thouvenin
The relayfs is done, like Evgeniy said, for large amount of
datas. So I think that it's not suitable for what we want to achieve
with the fork connector.
I never claimed that relayfs was appropriate for fork_connector.

I'm not trying to tape a rock to Evgeniy's screwdriver. I'm saying that
accounting looks like a nail to me, so let us see what rocks and hammers
we have in our tool box.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <***@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Jay Lan
2005-03-29 18:46:39 UTC
Permalink
Post by Paul Jackson
Post by Guillaume Thouvenin
The goal of the fork connector is to inform a user space application
that a fork occurs in the kernel. This information (cpu ID, parent PID
and child PID) can be used by several user space applications. It's not
only for accounting. Accounting and fork_connector are two different
things and thus, fork_connector doesn't do the merge of any kinds of
data (and it will never do).
Yes - it is clear that the fork_connector does this - inform user space
of fork information <cpu, parent, child>. I'm not saying that
fork_connector should merge data; I'm observing that it doesn't, and
that this would seem to serve the needs of accounting poorly.
Paul,

You probably can look at it this way: the accounting data being
written out by BSD are per process data and the fork connector
provides information needed to group processes into process
aggregates.

Thanks,
- jay
Post by Paul Jackson
Out of curiosity, what are these 'several user space applications?' The
only one I know of is this extension to bsd accounting to include
capturing parent and child pid at fork. Probably you've mentioned some
other uses of fork_connector before here, but I missed it.
Post by Guillaume Thouvenin
The relayfs is done, like Evgeniy said, for large amount of
datas. So I think that it's not suitable for what we want to achieve
with the fork connector.
I never claimed that relayfs was appropriate for fork_connector.
I'm not trying to tape a rock to Evgeniy's screwdriver. I'm saying that
accounting looks like a nail to me, so let us see what rocks and hammers
we have in our tool box.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Evgeniy Polyakov
2005-03-29 10:34:19 UTC
Permalink
Post by Paul Jackson
Post by Evgeniy Polyakov
There is no overhead at all using CBUS.
This is unlikely. Very unlikely.
Please understand that I am not trying to critique CBUS or connector in
isolation, but rather trying to determine what mechanism is best suited
for getting this accounting data written to disk, which is where I
assume it has to go until some non-real time job gets around to
analyzing it. We already have the rest of the BSD Accounting
information taking this batched up path directly to the disk. There is
nothing (that I know of) to be gained from delivering this new fork data
with any higher quality of service, or to any other place.
From what I can understand, correct me if I'm wrong, we have two
- forking process enqueues data plus header for single fork
Here forking connector module "exits" and can handle next fork() on the
same CPU.
Different CPUs are handled independently due to per-cpu variables usage.

That is why it is very fast in "fast-path".
Post by Paul Jackson
- context switch
- daemon process dequeues single fork data (is this a read or recv?)
- daemon process mergers multiple fork data into single buffer
- daemon process writes buffer for multiple forks (a write)
- disk driver pushes buffer with data for multiple forks to disk
Not exactly.

- context switch
- CBUS daemon, which runs with +19 nice get a bunch [10 currently, nice
value and queue "length" are determined in experiments] of requests from
each CPU queue and send them using connector's cn_netlink_send.
- cn_netlink_send reallocates a buffer for each message
[skb + allocation from 256-bytes pool in kmalloc],
walks through list of registered sockets and links given skb to the
requested socket queues.
- context switch
- userspace daemon is awakened from recv() syscall and does what it
want with provided data.
It can write it to disk, but may process in real-time or send to the
network.

The most expensive part is cn_netlink_send()/netlink_broadcast(),
with CBUS it is deferred to the safe time,
so fork() itself is not affected (it is only per-cpu locking + linking +
atomic allocation).
Since deferred message will be processed in "safe" time with low
priority, it should not
affect fork() too (but can).
Post by Paul Jackson
- forking process appends single fork data to in-kernel buffer
It is not as simple.
It takes global locks several times, it access bunch of shared between
CPU data.
It calls ->stat() and ->write() which may sleep.
Post by Paul Jackson
- disk driver pushes buffer with data for multiple forks to disk
Here is the same deffering as in connector, only preparation is
different.
And acct.c preparation may sleep and works with shared objects and
locks,
so it is slower, but it has an advantage - data is already written to
the
storage.
Post by Paul Jackson
It seems to me to be rather unlikely that (1) is cheaper than (2). It
is no particular fault of connector or CBUS that this is so. Even if
there were no overhead at all using CBUS (which I don't believe), (1)
still costs more, because it passes the data, with an added packet
header, into a separate process, and into user space, before it is
combined with other accounting information, and written back down into
the kernel to go to the disk.
That work is deferred and does not affect in-kernel processes.
And why userspace fork connector should write data to the disk?
It can process it in-flight and write only results.
acct.c processing daemon needs to read data, i.e. transfer them from
kernelspace.
But again all that work is deferred and does not affect fork()
performance.
Post by Paul Jackson
Post by Evgeniy Polyakov
... The efficiency
of getting this one extra <parent-pid, child-pid> out of the kernel
seems to be one or two orders of magnitude worse than the rest of
the accounting data.
It can be easily changed.
One may send kernel/acct.c acct_t structure out of the kernel -
overhead will be the same: kmalloc probably will get new area from the
same 256-bytes pool, skb is still in cache.
I have no idea what you just said.
Connector's overhead may come from memory allocation -
currently it calls alloc_skb(size, GFP_ATOMIC), skb allocation
calls kmem_cache_alloc() for skb itself and kmalloc() for
size + sizeof(struct skb_shared_info), which essentially
is allocation from the 256-bytes slab pool.
Post by Paul Jackson
Post by Evgeniy Polyakov
File writing accounting [kernel/acct.c] is slower,
Sure file writing is slower than queuing on an internal list. I don't
care that getting the data where I want it is slower than getting it
some other place that's only part way.
One needs to pay for speed.
In case of the acct.c price is high since writing is slow,
but one does not need to care about receiving part.
In the case of connector, price is low, but that requires
some additional process to fetch the data.

For some tasks one may be better than other, for others it is not.
Post by Paul Jackson
Post by Evgeniy Polyakov
and requires process' context to work with system calls.
For some connector uses, that might matter. For hooks in fork,
that is no problem - we have all the process context one could
want - two of them if that helps ;).
Post by Evgeniy Polyakov
realyfs is interesting project, but it has different aims,
That could well be ... I can't claim to know which of relayfs or
connector would be better here, of the two.
Post by Evgeniy Polyakov
while connector is purely control/notification mechanism
However connector is, in this regard, overkill. We don't need a single
event notification mechanism here. One of the key ways in which
accounting such as this has historically minimized system load is to
forgo any effort to provide any notification or data packet per event,
and instead immediately work to batch the data up in bulk form, with one
buffer containing the concatenated data for multiple events. This
amortizes the cost of almost all the handling, and of all the disk i/o,
over many data collection events. Correct me if I'm wrong, but
fork_connector doesn't do this merging of events into a consolidated
data buffer, so is at a distinct disadvantage, for this use, because the
data merging is delayed, and a separate, user level process, is required
to accomplish the merging and conversion to writable blocks of data
suitable for storing on the disk.
It is design decision,
one may want to write all data even with slowdowning
the system, but later process it all at once,
but one may want to process it in real-time in small pieces and have
direct
vision of how the system behaves, and thus it requires very small
overhead.
Userspace fork connector daemon may send data to the network or transfer
it using some other mechanism without touching disk IO subsystem,
and it is faster than writing it disk, then reading it or may be
seeking,
and transfer again.

One instrument is better for one type of taks,
others are suitable for different.
Post by Paul Jackson
Nothing wrong with a good screwdriver. But if you want to pound nails,
hammers, even rocks, work better.
He-he, while you lift your rock hammer, others can finish all the work
with theirs small instruments. :)
Post by Paul Jackson
--
I won't rest till it's the best ...
Programmer, Linux Scalability
--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski
Paul Jackson
2005-03-29 17:13:53 UTC
Permalink
Post by Evgeniy Polyakov
Here forking connector module "exits" and can handle next fork() on the
same CPU.
Fine ... but it's not about what the fork_connector does. It's about
getting the accounting data to disk, if I understand correctly.
Post by Evgeniy Polyakov
That is why it is very fast in "fast-path".
I don't care how fast a tool is. I care how fast the job gets done. If
a tool is only doing part of the job, then we can't decide whether to
use that tool just based on how fast that part of the job gets done.
Post by Evgeniy Polyakov
The most expensive part is cn_netlink_send()/netlink_broadcast(),
with CBUS it is deferred to the safe time,
This is "safe time" for the immediate purpose of letting the forking
process continue on its way. But the deferred work of buffering up the
data and writing it to disk still needs to be done, pretty soon. When
sizing a system to see how many users or jobs I can run on it at a time,
I will have to include sufficient cpu, memory and disk i/o to handle
getting this accounting data to disk, right?
Post by Evgeniy Polyakov
Post by Paul Jackson
- forking process appends single fork data to in-kernel buffer
It is not as simple.
It takes global locks several times, it access bunch of shared between
CPU data.
It calls ->stat() and ->write() which may sleep.
Hmmm ... good points. The mechanisms in the kernel now (and for the
last 25 years) to write out BSD ACCOUNTING data may not be numa friendly.

Perhaps there should be a per-cpu 512 byte buffer, which can gather up 8
accounting records (64 bytes each) and only call the file system write
once every 8 task exits. Or perhaps a per-node buffer, with a spinlock
to serialize access by the CPUs on that node. Or perhaps per-node
accounting files. Or something like that.

Guillaume, Jay - do we (you ?) need to make classic BSD ACCOUNTING data
collection numa friendly? Based on the various frustrated comments at
the top of kernel/acct.c, this could be a non-trivial effort to get
right. Maybe we need it, but can't afford it.

And perhaps my proposed variable length records for supplementary
accounting, such as <parent pid, child pid> from fork, need to allow
for some way to pad out the rest of a buffer, when the next record
won't fit entirely.
Post by Evgeniy Polyakov
That work is deferred and does not affect in-kernel processes.
The accounting data collection cannot be deferred for long, perhaps
just a few minutes. Not until the data hits the disk can we rest
indefinitely. Unless, that is, I don't understand what problem is
being solved here (quite possible ;).
Post by Evgeniy Polyakov
And why userspace fork connector should write data to the disk?
I NEVER said it should. I am NOT trying to redesign fork_connector.

Good grief ... how many times and ways do I have to say this ;)?

I am asking what is the best tool for accounting data collection,
which, if I understand correctly, does need to write to disk.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <***@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Guillaume Thouvenin
2005-03-29 08:20:56 UTC
Permalink
Post by Paul Jackson
Post by Guillaume Thouvenin
The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.
Thanks for including the numbers. The 7% seems a bit costly, for a bit
more accounting information. Perhaps dean's suggestion, to not use
ascii, will help. I hope so, though I doubt it will make a huge
difference. Was this 7% loss with or without a user level program
consuming the sent messages? I would think that the number of interest
would include a minimal consumer task.
Yes, dean's suggestion helps. The overhead is now around 4%

fork_connector disabled:
Process fork+exit: 149.4444 microseconds

fork_connector enabled:
Process fork+exit: 154.9167 microseconds
Post by Paul Jackson
Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
seems unfortunate. Can the real fork_connector() be put elsewhere, and
the ifdef put in a header file that makes it a no-op if not configured,
or simply a function declaration, if configured?
I think that it can be moved in include/linux/connector.h

Guillaume

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Paul Jackson
2005-03-29 14:52:28 UTC
Permalink
Post by Guillaume Thouvenin
Yes, dean's suggestion helps. The overhead is now around 4%
More improvement than I expected (and I see a CBUS result further
down in my inbox).

Does this include a minimal consumer task of the data that writes
it to disk?
Post by Guillaume Thouvenin
I think that it can be moved in include/linux/connector.h
Good.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <***@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Guillaume Thouvenin
2005-03-29 12:53:16 UTC
Permalink
Post by Paul Jackson
Post by Guillaume Thouvenin
The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.
Thanks for including the numbers. The 7% seems a bit costly, for a bit
more accounting information. Perhaps dean's suggestion, to not use
ascii, will help. I hope so, though I doubt it will make a huge
difference. Was this 7% loss with or without a user level program
consuming the sent messages? I would think that the number of interest
would include a minimal consumer task.
I ran some test using the CBUS instead of the cn_netlink_send() routine
and the overhead is nearly 0%:

fork connector disabled:
Process fork+exit: 148.1429 microseconds

fork connector enabled:
Process fork+exit: 148.4595 microseconds

Regards,
Guillaume

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Paul Jackson
2005-03-29 15:39:20 UTC
Permalink
Post by Guillaume Thouvenin
I ran some test using the CBUS instead of the cn_netlink_send() routine
Overhead of what? Does this include merging the data and getting it to
disk?

Am I even asking the right question here - is it true that this data,
when collected for accounting purposes, needs to go to disk, and that
summarizing and analyzing the data is done 'off-line', perhaps hours
later? That's the way it was 25 years ago ... but perhaps the basic
data flow appropriate for accounting has changed since then.

And if the data flow has changed, then how to you account for the fact
that the rest of the accounting data, under the CONFIG_BSD_PROCESS_ACCT
option, is still collected the 'old fashioned way' (direct kernel write
to disk)?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <***@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------
Loading...