[patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

Discussion:

[patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

(too old to reply)

Ingo Molnar

2005-03-23 21:50:34 UTC

i think the 'migrate read-count' method is not adequate either,
because all callbacks queued within an RCU read section must be called
after the lock has been dropped - while with the migration method
CPU#1 would be free to process callbacks queued in the RCU read
section still active on CPU#2.
i'm wondering how much of a problem this is though. Can there be stale
rcu_read_lock();
call_rcu(&dentry->d_rcu, d_callback);
func(dentry->whatever);
rcu_read_unlock();

but, this cannot happen, because call_rcu() is used by RCU-write code.

so the important property seems to be that any active RCU-read section
should keep at least one CPU's active_readers count elevated
permanently, for the duration of the RCU-read section. It doesnt matter
that the reader migrates between CPUs - because the RCU code itself
guarantees that no callbacks will be executed until _all_ CPUs have been
in quiescent state. I.e. all CPUs have to go through a zero
active_readers value before the callback is done.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-24 05:41:37 UTC

Now, it is true that CPU#2 might record a quiescent state during this
time, but this will have no effect because -all- CPUs must pass
through a quiescent state before any callbacks will be invoked. Since
CPU#1 is refusing to record a quiescent state, grace periods will be
blocked for the full extent of task 1's RCU read-side critical
section.

ok, great. So besides the barriers issue (and the long grace period time
issue), the current design is quite ok. And i think your original flip
pointers suggestion can be used to force synchronization.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 07:48:29 UTC

Post by Ingo Molnar

Now, it is true that CPU#2 might record a quiescent state during this
time, but this will have no effect because -all- CPUs must pass
through a quiescent state before any callbacks will be invoked. Since
CPU#1 is refusing to record a quiescent state, grace periods will be
blocked for the full extent of task 1's RCU read-side critical
section.

ok, great. So besides the barriers issue (and the long grace period time
issue), the current design is quite ok. And i think your original flip
pointers suggestion can be used to force synchronization.

The thing I am currently struggling with on the flip-pointers approach is
handling races between rcu_read_lock() and the flipping. In the earlier
implementations that used this trick, you were guaranteed that if you were
executing concurrently with one flip, you would do a voluntary context
switch before the next flip happened, so that the race was harmless.
This guarantee does not work in the PREEMPT_RT case, so more thought
will be required. :-/

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Steven Rostedt

2005-03-24 08:49:57 UTC

I've noticed the following situation which is caused by __up_mutex
assigning an owner right away.

I also see this with non rt tasks causing a burst of schedules.

1. Process A runs and grabs lock L. then finishes its time slice.
2. Process B runs and tries to grab Lock L.
3. Process A runs and releases lock L.
4. __up_mutex gives process B lock L.
5. Process A tries to grab lock L.
6. Process B runs and releases lock L.
7. __up_mutex gives lock L to process A.
8. Process B tries to grab lock L again.
9. Process A runs...

Here we have more unnecessary schedules. So the condition to grab a lock
should be:

1. not owned.
2. partially owned, and the owner is not RT.
3. partially owned but the owner is RT and so is the grabber, and the
grabber's priority is >= the owner's priority.

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-24 10:48:06 UTC

Post by Steven Rostedt
I also see this with non rt tasks causing a burst of schedules.
1. Process A runs and grabs lock L. then finishes its time slice.
2. Process B runs and tries to grab Lock L.
3. Process A runs and releases lock L.
4. __up_mutex gives process B lock L.
5. Process A tries to grab lock L.
6. Process B runs and releases lock L.
7. __up_mutex gives lock L to process A.
8. Process B tries to grab lock L again.
9. Process A runs...
Here we have more unnecessary schedules. So the condition to grab a lock
1. not owned.
2. partially owned, and the owner is not RT.
3. partially owned but the owner is RT and so is the grabber, and the
grabber's priority is >= the owner's priority.

yeah, sounds good - but i'd not make any distinction between RT and
non-RT tasks - just make the rule #3 distinction based on ->prio. In
particular on UP a task should only run when its higher prio, so if a
lock is 'partially owned' then the priority rule should always be true.
(On SMP it's a bit more complex, there the priority rule could make a
real difference.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-24 11:44:15 UTC

Post by Steven Rostedt
Here we have more unnecessary schedules. So the condition to grab a
1. not owned.
2. partially owned, and the owner is not RT.
3. partially owned but the owner is RT and so is the grabber, and the
grabber's priority is >= the owner's priority.

there's another approach that could solve this problem: let the
scheduler sort it all out. Esben Nielsen had this suggestion a couple of
months ago - i didnt follow it because i thought that technique would
create too many runnable tasks, but maybe that was a mistake. If we do
the owning of the lock once the wakee hits the CPU we avoid the 'partial
owner' problem, and we have the scheduler sort out priorities and
policies.

but i think i like the 'partial owner' (or rather 'owner pending')
technique a bit better, because it controls concurrency explicitly, and
it would thus e.g. allow another trick: when a new owner 'steals' a lock
from another in-flight task, then we could 'unwakeup' that in-flight
thread which could thus avoid two more context-switches on e.g. SMP
systems: hitting the CPU and immediately blocking on the lock. (But this
is a second-phase optimization which needs some core scheduler magic as
well, i guess i'll be the one to code it up.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Steven Rostedt

2005-03-24 14:35:43 UTC

Post by Ingo Molnar
there's another approach that could solve this problem: let the
scheduler sort it all out. Esben Nielsen had this suggestion a couple of
months ago - i didnt follow it because i thought that technique would
create too many runnable tasks, but maybe that was a mistake. If we do
the owning of the lock once the wakee hits the CPU we avoid the 'partial
owner' problem, and we have the scheduler sort out priorities and
policies.

I've thought about this too, and came up with the conclusion that this was
too messy. You have to give up the information of the processes that are
waiting on the lock when you release it. Or keep the information of the
waiters (waking only one "the wakee") but then you have a lock with
waiters and no owner, which is the messy part.

On an SMP machine, there may even be a chance of a lower priority process
that gets it. That would be possible if the low priority process on the
other CPU tries to grab the lock just after it was released but before
the just woken up high priorty processes get scheduled. So there's a
window where the lock is open, and the lower priority process snagged it
just before the others got in.

Post by Ingo Molnar
but i think i like the 'partial owner' (or rather 'owner pending')
technique a bit better, because it controls concurrency explicitly, and
it would thus e.g. allow another trick: when a new owner 'steals' a lock
from another in-flight task, then we could 'unwakeup' that in-flight
thread which could thus avoid two more context-switches on e.g. SMP
systems: hitting the CPU and immediately blocking on the lock. (But this
is a second-phase optimization which needs some core scheduler magic as
well, i guess i'll be the one to code it up.)

Darn! It seemed like fun to implement. I may do it myself anyway on my
kernel just to understand your implementation even better.

Later,

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-24 17:53:46 UTC

Post by Steven Rostedt

Post by Ingo Molnar
but i think i like the 'partial owner' (or rather 'owner pending')
technique a bit better, because it controls concurrency explicitly, and
it would thus e.g. allow another trick: when a new owner 'steals' a lock
from another in-flight task, then we could 'unwakeup' that in-flight
thread which could thus avoid two more context-switches on e.g. SMP
systems: hitting the CPU and immediately blocking on the lock. (But this
is a second-phase optimization which needs some core scheduler magic as
well, i guess i'll be the one to code it up.)

Darn! It seemed like fun to implement. I may do it myself anyway on my
kernel just to understand your implementation even better.

feel free to implement the whole thing. Unscheduling a task should be
done carefully, for obvious reasons. (I've implemented it once 1-2 years
ago for a different purpose, to unschedule ksoftirqd - it ought to be
somewhere in the lkml archives.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-24 18:22:58 UTC

Post by Steven Rostedt
On an SMP machine, there may even be a chance of a lower priority
process that gets it. That would be possible if the low priority
process on the other CPU tries to grab the lock just after it was
released but before the just woken up high priorty processes get
scheduled. So there's a window where the lock is open, and the lower
priority process snagged it just before the others got in.

that's always a possibility, on UP too: if a lower priority task manages
to acquire a lock 'just before' a highprio thread got interested in it
there's no way to undo that.

but for the other reasons the explicit approach looks better.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Esben Nielsen

2005-03-24 23:11:47 UTC

Post by Ingo Molnar

Post by Steven Rostedt
Here we have more unnecessary schedules. So the condition to grab a
1. not owned.
2. partially owned, and the owner is not RT.
3. partially owned but the owner is RT and so is the grabber, and the
grabber's priority is >= the owner's priority.

there's another approach that could solve this problem: let the
scheduler sort it all out. Esben Nielsen had this suggestion a couple of
months ago - i didnt follow it because i thought that technique would
create too many runnable tasks, but maybe that was a mistake. If we do
the owning of the lock once the wakee hits the CPU we avoid the 'partial
owner' problem, and we have the scheduler sort out priorities and
policies.
but i think i like the 'partial owner' (or rather 'owner pending')
technique a bit better, because it controls concurrency explicitly, and
it would thus e.g. allow another trick: when a new owner 'steals' a lock
from another in-flight task, then we could 'unwakeup' that in-flight
thread which could thus avoid two more context-switches on e.g. SMP
systems: hitting the CPU and immediately blocking on the lock. (But this
is a second-phase optimization which needs some core scheduler magic as
well, i guess i'll be the one to code it up.)

I checked the implementation of a mutex I send in last fall. The unlock
operation does give ownership explicitly to the highest priority waiter,
as Ingo's implementation does.

Originally I planned for just having unlock to wake up the highest
priority owner and set lock->owner = NULL. The lock operation would be
something like
while(lock->owner!=NULL)
{
schedule();
}
grap the lock.

Then the first task, i.e. the one with highest priority on UP, will get it
first. On SMP a low priority task on another CPU might get in and take it.

I like the idea of having the scheduler take care of it - it is a very
optimal coded queue-system after all. That will work on UP but not on SMP.
Having the unlock operation to set the mutex in a "partially owned" state
will work better. The only problem I see, relative to Ingo's
implementation, is that then the awoken task have to go in and
change the state of the mutex, i.e. it has to lock the wait_lock again.
Will the extra schedulings being the problem happen offen enough in
practise to have the extra overhead?

Post by Ingo Molnar
Ingo

Esben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-25 06:27:29 UTC

Post by Esben Nielsen
I like the idea of having the scheduler take care of it - it is a very
optimal coded queue-system after all. That will work on UP but not on
SMP. Having the unlock operation to set the mutex in a "partially
owned" state will work better. The only problem I see, relative to
Ingo's implementation, is that then the awoken task have to go in and
change the state of the mutex, i.e. it has to lock the wait_lock
again. Will the extra schedulings being the problem happen offen
enough in practise to have the extra overhead?

i think this should be covered by the 'unschedule/unwakeup' feature,
mentioned in the latest mails.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Steven Rostedt

2005-03-26 16:34:01 UTC

Post by Ingo Molnar

Post by Esben Nielsen
I like the idea of having the scheduler take care of it - it is a very
optimal coded queue-system after all. That will work on UP but not on
SMP. Having the unlock operation to set the mutex in a "partially
owned" state will work better. The only problem I see, relative to
Ingo's implementation, is that then the awoken task have to go in and
change the state of the mutex, i.e. it has to lock the wait_lock
again. Will the extra schedulings being the problem happen offen
enough in practise to have the extra overhead?

i think this should be covered by the 'unschedule/unwakeup' feature,
mentioned in the latest mails.

The first implementation would probably just be the setting of a "pending
owner" bit. But the better one may be to unschedule. But, what would the
overhead be for unscheduling. Since you need to grab the run queue locks
for that. This might make for an interesting case study. The waking up of
a process who had the lock stolen may not happen that much. The lock
stealing, would (as I see in my runs) happen quite a bit though. But on
UP, the waking of the robbed owner, would never happen, unless it also
owned a lock that a higher priority process wanted.

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-26 19:14:01 UTC

Post by Steven Rostedt

Post by Ingo Molnar
i think this should be covered by the 'unschedule/unwakeup' feature,
mentioned in the latest mails.

The first implementation would probably just be the setting of a
"pending owner" bit. But the better one may be to unschedule. But,
what would the overhead be for unscheduling. Since you need to grab
the run queue locks for that. This might make for an interesting case
study. The waking up of a process who had the lock stolen may not
happen that much. The lock stealing, would (as I see in my runs)
happen quite a bit though. But on UP, the waking of the robbed owner,
would never happen, unless it also owned a lock that a higher priority
process wanted.

yeah, lets skip the unscheduling for now, the 'pending owner' bit is the
important one.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Steven Rostedt

2005-03-26 16:06:23 UTC

Post by Esben Nielsen
I checked the implementation of a mutex I send in last fall. The unlock
operation does give ownership explicitly to the highest priority waiter,
as Ingo's implementation does.
Originally I planned for just having unlock to wake up the highest
priority owner and set lock->owner = NULL. The lock operation would be
something like
while(lock->owner!=NULL)
{
schedule();
}
grap the lock.
Then the first task, i.e. the one with highest priority on UP, will get it
first. On SMP a low priority task on another CPU might get in and take it.

This could be dangerous on SMP then, because, if a higher priority process
on the CPU of the process that grabbed the lock, preempted it, then the
other CPU can spin on this since it would be the highest priority process
for that CPU. Not to mention that you have to make sure that priority
inheritance was still implemented for the higher priority process woken up
but having it stollen by the lower priority process.

Post by Esben Nielsen
I like the idea of having the scheduler take care of it - it is a very
optimal coded queue-system after all. That will work on UP but not on SMP.
Having the unlock operation to set the mutex in a "partially owned" state
will work better. The only problem I see, relative to Ingo's
implementation, is that then the awoken task have to go in and
change the state of the mutex, i.e. it has to lock the wait_lock again.
Will the extra schedulings being the problem happen offen enough in
practise to have the extra overhead?

Another answer is to have the "pending owner" bit be part of the task
structure. A flag maybe. If a higher priority process comes in and
decides to grab the lock from this owner, it does a test_and_clear on the
this flag on the pending owner task. When the pending owner task wakes
up, it does the test_and_clear on its own bit. Who ever had the bit set
on the test wins. If the higher prio task were to clear it first, then it
takes the ownership away from the pending owner. If the pending owner
were to clear the bit first, it won and would contiue as though it got the
lock. The higher priority tasks would do this test within the wait_lock
to keep from having more than one trying to grab the lock from the pending
owner, but the pending owner won't need to do anything since it will know
if it was the new owner just by testing its own bit.

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Steven Rostedt

2005-03-24 08:34:22 UTC

Ingo,

I've noticed the following situation which is caused by __up_mutex
assigning an owner right away.

Given 3 processes, with priorities 1, 2, 3, where 3 is highest and 1 is
lowest, and call them process A, B, C respectively.

1. Process A runs and grabs Lock L.
2. Process B preempts A, tries to grab Lock L.
3. Process A inherits process B's priority and starts to run.
4. Process C preempts A, tries to grab Lock L.
5. Process A inherits process C's priority and starts to run.
6. Process A releases Lock L loses its priority.
7. Process C gets Lock L.
8. Process C runs and releases Lock L.
9. With __up_mutex, Process B automatically gets Lock L.
10. Process C tries to grab Lock L again, but is now blocked by B.

So we have a needless latency for Procss C, since it should be able to get
lock L again. The problem arises because process B grabbed the lock
without actually running.

Since I agree with the rule that a lock can't have waiters while not being
owned, I believe that this problem can be solved by adding a flag that
states that the lock is "partially owned". That is the ownership of the
lock should be transferred at step 9, but a flag is set that it has not
been grabbed. This flag would be cleared when Process B wakes up and
leaves the __down call.

This way when process C tries to get the lock again, it sees that it's
owned, but B hasn't executed yet. So it would be safe for C to take the
lock back from B, that is if C is greater priority than B, otherwise it
would act the same.

If you agree with this approach, I would be willing to write a patch for
you.

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Ingo Molnar

2005-03-24 10:45:40 UTC

Post by Steven Rostedt
This way when process C tries to get the lock again, it sees that it's
owned, but B hasn't executed yet. So it would be safe for C to take
the lock back from B, that is if C is greater priority than B,
otherwise it would act the same.

agreed. In particular this would be a nice optimization for cases where
tasks are delayed for a longer time due to CPU congestion (e.g. lots of
tasks on the same SCHED_FIFO priority). So if a higher prio task comes
along while the

Post by Steven Rostedt
If you agree with this approach, I would be willing to write a patch
for you.

yeah - please do.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 05:32:49 UTC

+#ifdef CONFIG_PREEMPT_RCU
+
+void rcu_read_lock(void)
+{
+ if (current->rcu_read_lock_nesting++ == 0) {
+ current->rcu_data = &get_cpu_var(rcu_data);
+ atomic_inc(&current->rcu_data->active_readers);
+ put_cpu_var(rcu_data);
Need an smp_mb() here for non-x86 CPUs. Otherwise, the CPU can
re-order parts of the critical section to precede the rcu_read_lock().
Could precede the put_cpu_var(), but why increase latency?

ok. It's enough to put a barrier into the else branch here, because the
atomic op in the main brain is a barrier by itself.

For x86, the atomic op is a barrier, but not for other architectures.
You don't need a barrier in the else branch, because in that case
you are already in an enclosing RCU read-side critical section, so
any bleeding of code will be into this enclosing section, thus still
safe.

+void rcu_read_unlock(void)
+{

[...]

And need an smp_mb() here, again for non-x86 CPUs.

ok.

Assuming that the memory barriers are added, I can see a bunch of ways
for races to extend grace periods, but none so far that result in the
fatal too-short grace period. Since rcu_qsctr_inc() refuses to
increment the quiescent-state counter on any CPU that started an RCU
read-side critical section that has not yet completed, any long
critical section will have a corresponding CPU that will refuse to go
through a quiescent state. And that will prevent the grace period
from completing.

i'm worried about the following scenario: what happens when a task is
migrated from CPU#1 to CPU#2, while in an RCU read section that it
acquired on CPU#1, and queues a callback. E.g. d_free() does a
call_rcu(), to queue the freeing of the dentry.
That callback will be queued on CPU#2 - while the task still keeps
current->rcu_data of CPU#1. It also means that CPU#2's read counter did
_not_ get increased - and a too short grace period may occur.

Let me make sure I understand the sequence of events:

CPU#1 CPU#2

task1: rcu_read_lock()

task1: migrate to CPU#2

task1: call_rcu()

task1: rcu_read_unlock()

This sequence will be safe because CPU#1's active_readers field will
be non-zero throughout, so that CPU#1 will refuse to record any
quiescent state from the time that task1 does the rcu_read_lock()
on CPU#1 until the time that it does the rcu_read_unlock() on CPU#2.

Now, it is true that CPU#2 might record a quiescent state during this
time, but this will have no effect because -all- CPUs must pass through
a quiescent state before any callbacks will be invoked. Since CPU#1
is refusing to record a quiescent state, grace periods will be blocked
for the full extent of task 1's RCU read-side critical section.

Or am I misunderstanding your scenario? Or, for that matter, your code?

it seems to me that that only safe method is to pick an 'RCU CPU' when
first entering the read section, and then sticking to it, no matter
where the task gets migrated to. Or to 'migrate' the +1 read count from
one CPU to the other, within the scheduler.

This would work too, but I don't believe it is necessary given what
you are already doing.

the 'migrate read count' solution seems more promising, as it would keep
other parts of the RCU code unchanged. [ But it seems to break the nice
'flip pointers' method you found to force a grace period. If a 'read
section' can migrate from one CPU to another then it can migrate back as
well, at which point it cannot have the 'old' pointer. Maybe it would
still work better than no flip pointers. ]

Well, I do believe that suppressing migration of tasks during RCU read-side
critical sections would simplify the flip-pointers/counters code. But
that is easy to say in advance of actually producing this code. ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 07:05:12 UTC

Post by Ingo Molnar

i think the 'migrate read-count' method is not adequate either,
because all callbacks queued within an RCU read section must be called
after the lock has been dropped - while with the migration method
CPU#1 would be free to process callbacks queued in the RCU read
section still active on CPU#2.
i'm wondering how much of a problem this is though. Can there be stale
rcu_read_lock();
call_rcu(&dentry->d_rcu, d_callback);
func(dentry->whatever);
rcu_read_unlock();

but, this cannot happen, because call_rcu() is used by RCU-write code.

The code would not look exactly like this, but acquiring the update-side
lock inside an RCU read-side critical section can be thought of as
a reader-to-writer lock upgrade. RCU can do this unconditionally,
which was one of the walls I was banging my head against when trying
to think up a realtime-safe RCU implementation.

So something like the following would be legitimate RCU code:

rcu_read_lock();
spin_lock(&dcache_lock);
call_rcu(&dentry->d_rcu, d_callback);
spin_unlock(&dcache_lock);
rcu_read_unlock();

The spin_lock() call upgrades from a read-side to a write-side critical
section.

Post by Ingo Molnar
so the important property seems to be that any active RCU-read section
should keep at least one CPU's active_readers count elevated
permanently, for the duration of the RCU-read section.

Yep!

Post by Ingo Molnar
It doesnt matter
that the reader migrates between CPUs - because the RCU code itself
guarantees that no callbacks will be executed until _all_ CPUs have been
in quiescent state. I.e. all CPUs have to go through a zero
active_readers value before the callback is done.

Yep again!

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 06:47:39 UTC

i think the 'migrate read-count' method is not adequate either, because
all callbacks queued within an RCU read section must be called after the
lock has been dropped - while with the migration method CPU#1 would be
free to process callbacks queued in the RCU read section still active on
CPU#2.

how about keeping the rcu callback list in process context and only
splice it to a global (per cpu) list on rcu_read_unlock?

hm, that would indeed solve this problem. It would also solve the grace
period problem: all callbacks in the global (per-CPU) list are
immediately processable. Paul?

If I understand the proposal, it would break in the following situation
(lifted from earlier email):

rcu_read_lock();
list_for_each_entry_rcu(p, head, list) {
if (unlikely(p->status == DELETE_ME)) {
spin_lock(&p->mutex);
if (likely(p->status == DELETE_ME)) {
p->status = DELETED;
list_rcu(&p->list);
call_rcu(&p->rcu, sublist_finalize_rcu);
}
spin_unlock(&p->mutex);
}
}
rcu_read_unlock();

Here, sublist_finalize_rcu() just finds the front of the block and
kfree()s it.

Here is the scenario:

CPU 1 CPU 2

task 1 does rcu_read lock

task 2 does rcu_read_lock

task 1 sees DELETE_ME

task 2 sees DELETE_ME

task 1 acquires the lock

task 2 blocks/spins on lock

task 1 does call_rcu

task 1 releases lock

task 1 does rcu_read_unlock()

task 2 acquires lock

RCU puts the callback on global list

RCU invokes callback, kfree()!!!

task 2 now sees garbage!!!

Callbacks cannot be invoked until all RCU read-side critical sections
that were in process at the time of the rcu_callback() have all
completed.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 06:16:07 UTC

the 'migrate read count' solution seems more promising, as it would
keep other parts of the RCU code unchanged. [ But it seems to break
the nice 'flip pointers' method you found to force a grace period. If
a 'read section' can migrate from one CPU to another then it can
migrate back as well, at which point it cannot have the 'old' pointer.
Maybe it would still work better than no flip pointers. ]

the flip pointer method could be made to work if we had a NR_CPUS array
of 'current RCU pointer' values attached to the task - and that array
would be cleared if the task exits the read section. But this has memory
usage worries with large NR_CPUS. (full clearing of the array can be
avoided by using some sort of 'read section generation' counter attached
to each pointer)

The only per-task data you need to maintain is a single pointer to
the per-CPU counter that was incremented by the outermost rcu_read_lock().
You also need a global index, call it "rcu_current_ctr_set" (or come
up with a better name!) along with a per-CPU pair of counters:

struct rcu_ctr_set { /* or maybe there is a way to drop array */
atomic_t ctr[2]; /* into DECLARE_PER_CPU() without a struct */
} /* wrapper... */
int rcu_curset = 0;
DEFINE_PER_CPU(struct rcu_ctr_set, rcu_ctr_set) = { 0, 0 };

You need two fields in the task structure:

atomic_t *rcu_preempt_ctr;
int rcu_nesting;

Then you have something like:

void rcu_read_lock(void)
{
if (current->rcu_nesting++ == 0) {
preempt_disable();
current->rcu_preempt_ctr =
&__get_cpu_var(rcu_ctr_set).ctr[rcu_curset];
atomic_inc(current->rcu_preempt_ctr);
preempt_enable();
smp_mb();
}
}

void rcu_read_unlock(void)
{
if (--current->rcu_nesting == 0) {
smb_mb(); /* might only need smp_wmb()... */
atomic_dec(current->rcu_preempt_ctr);
current->rcu_preempt_ctr = NULL; /* for debug */
}
}

One can then force a grace period via something like the following,
but only if you know that all of the rcu_ctr_set.ctr[!current] are zero:

void _synchronize_kernel(void)
{
int cpu;

spin_lock(&rcu_mutex);
rcu_curset = !rcu_curset;
for (;;) {
for_each_cpu(cpu) {
if (atomic_read(&__get_cpu_var(rcu_ctr_set).ctr[!rcu_curset]) != 0) {
/* yield CPU for a bit */
continue;
}
}
}
spin_unlock(&rcu_mutex);
}

In real life, you need a way to multiplex multiple calls to
_synchronize_kernel() into a single counter-flip event, by
setting up callbacks. And so on...

The above stolen liberally from your patch and from my memories of
Dipankar's RCU patch for CONFIG_PREEMPT kernels. Guaranteed to have
horrible bugs. ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 06:41:31 UTC

That callback will be queued on CPU#2 - while the task still keeps
current->rcu_data of CPU#1. It also means that CPU#2's read counter
did _not_ get increased - and a too short grace period may occur.
it seems to me that that only safe method is to pick an 'RCU CPU' when
first entering the read section, and then sticking to it, no matter
where the task gets migrated to. Or to 'migrate' the +1 read count
from one CPU to the other, within the scheduler.

i think the 'migrate read-count' method is not adequate either, because
all callbacks queued within an RCU read section must be called after the
lock has been dropped - while with the migration method CPU#1 would be
free to process callbacks queued in the RCU read section still active on
CPU#2.

Right -- the limitation is that you cannot declare a grace period
over until -all- RCU read-side critical sections that started before
the call_rcu() have completed.

i'm wondering how much of a problem this is though. Can there be stale
rcu_read_lock();
call_rcu(&dentry->d_rcu, d_callback);
func(dentry->whatever);
rcu_read_unlock();

In this code segment, a correct RCU will not invoke the callback
until after the rcu_read_unlock(). Ugly code, though. But see
below...

would be unsafe because the pointer is still accessed within the RCU
read section, and if we get migrated from CPU#1 to CPU#2 after call_rcu
but before dentry->whatever dereference, the callback may be processed
early by CPU#1, making the dentry->whatever read operation unsafe.
the question is, does this occur in practice? Does existing RCU-using
code use pointers it has queued for freeing, relying on the fact that
the callback wont be processed until we drop the RCU read lock?

Using a pointer that had been passed to call_rcu() would be in theory
safe, but I would usually object to it. In most cases, it would cause
confusion at the very least. However, there are exceptions:

rcu_read_lock();
list_for_each_entry_rcu(p, head, list) {
if (unlikely(p->status == DELETE_ME)) {
spin_lock(&p->mutex);
if (likely(p->status == DELETE_ME)) {
p->status = DELETED;
list_rcu(&p->list);
call_rcu(&p->rcu, sublist_finalize_rcu);
}
spin_unlock(&p->mutex);
}
}
rcu_read_unlock();

You can drop the spinlock before the call_rcu() if you want, but doing
so makes the code uglier. Besides, you have to allow for the fact that
another instance of this same code might be running on the same list
on some other CPU. You cannot invoke the callback until -both- CPUs/tasks
have exited their RCU read-side critical sections.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

Paul E. McKenney

2005-03-24 06:55:53 UTC

ok. It's enough to put a barrier into the else branch here, because the
atomic op in the main brain is a barrier by itself.

Since the else branch is only taken when rcu_read_lock_nesting > 0, do
we need the barrier at all?

Actually, since atomic_inc isn't a barrier, we do need that mb.
However, it should only be necessary in the main branch and we
can use smp_mb__after_atomic_inc which is optimised away on a
number of architectures (i386 in particular).

You are right, I should have said smp_mb__after_atomic_inc() instead of
smp_mb() in the rcu_read_lock() case and smp_mb__after_atomic_dec()
in the rcu_read_unlock() case.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

-------------------------------------------------------------------------------
Achtung: diese Newsgruppe ist eine unidirektional gegatete Mailingliste.
Antworten nur per Mail an die im Reply-To-Header angegebene Adresse.
Fragen zum Gateway -> ***@inka.de.
-------------------------------------------------------------------------------

21 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Ingo Molnar 2005-03-23 21:50:34 UTC

Ingo Molnar 2005-03-24 05:41:37 UTC

Paul E. McKenney 2005-03-24 07:48:29 UTC

Steven Rostedt 2005-03-24 08:49:57 UTC

Ingo Molnar 2005-03-24 10:48:06 UTC

Ingo Molnar 2005-03-24 11:44:15 UTC

Steven Rostedt 2005-03-24 14:35:43 UTC

Ingo Molnar 2005-03-24 17:53:46 UTC

Ingo Molnar 2005-03-24 18:22:58 UTC

Esben Nielsen 2005-03-24 23:11:47 UTC

Ingo Molnar 2005-03-25 06:27:29 UTC

Steven Rostedt 2005-03-26 16:34:01 UTC

Ingo Molnar 2005-03-26 19:14:01 UTC

Steven Rostedt 2005-03-26 16:06:23 UTC

Steven Rostedt 2005-03-24 08:34:22 UTC

Ingo Molnar 2005-03-24 10:45:40 UTC

Paul E. McKenney 2005-03-24 05:32:49 UTC

Paul E. McKenney 2005-03-24 07:05:12 UTC

Paul E. McKenney 2005-03-24 06:47:39 UTC

Paul E. McKenney 2005-03-24 06:16:07 UTC

Paul E. McKenney 2005-03-24 06:41:31 UTC

Paul E. McKenney 2005-03-24 06:55:53 UTC

about - legalese

Loading...