Session:The Kernel Report ELC 2012
- ELC 2012
- February 15, 2012
- Jonathan Corbet
- The Kernel Report ELC 2012
- here (linux foundation) and here (free-electrons)
- 58 minutes
The Linux kernel is at the core of any Linux system; the performance and capabilities of the kernel will, in the end, place an upper bound on what the system as a whole can do. This talk will review recent events in the kernel development community, discuss the current state of the kernel and the challenges it faces, and look forward to how the kernel may address those challenges. Attendees of any technical ability should gain a better understanding of how the kernel got to its current state and what can be expected in the near future.
Jonathan Corbet got his first look at the BSD Unix source back in 1981, when an instructor at the University of Colorado let him "fix" the paging algorithm. He has been digging around inside every system he could get his hands on ever since, working on drivers for VAX, Sun, Ardent, and x86 systems on the way. He got his first Linux system in 1993, and has never looked back. Mr. Corbet is currently the co-founder and executive editor of Linux Weekly News; he lives in Boulder, Colorado with his wife and two children.
- Transcribed by
- Chris Dudding
0:00 - 1:00:
[ELC Slide - Thank you to our sponsors]
>> INTRODUCER: So, I'd like to welcome you all out to this year's Embedded Linux conference. Very happy to see you all here and I hope you're as anxious as I am.
We're.. well erm. anxious is not the right word but excited as I am about the great sessions we've got planned for this week. Um..
This is always erm.. the best part of getting ready for a conference is when you actually get the thing under way and there's been a lot of prep work behind the scenes and we're very excited to have you here and excited about the programme we've got.
[ELC Slide - Mobile App]
I've got just a couple of quick announcements I'd like to make:
First, about the mobile application so in the guide it talks about a mobile app um.. the actual name you look for in the marketplace is wrong in the guide
1:00 - 2:00:
You need to look under Linux Foundation conference. And that's available on the Android Market. There's actually a kind of a funny story why its not available for iPhone and that was because when it was first submitted to the iPhone erm. I don't know what they call their..
>> AUDIENCE: App Store
[Laughter] Well, so. Yes, thank you. It actually it was for.. It had Android listed in the description because it covers both conferences: Android Developers.. [rephrases] Android Builders Submit and ELC and they rejected it because of the word Android. So, go figure!
So, not for lack of trying. I'm sorry we don't have an iPhone app for you. Get yourself a more open phone! [Laughter]
[ELC Slide - Intel Atom Processor Giveaway]
Anyway, also I'd like to talk about this. So this um.. [holds Intel Atom development board] Ignore the antennas. Inside is a little development board
2:00 - 3:00:
that is being manufactured by Intel and it contains an Atom processor. Its actually called the.. I love Intel code names.. the board is called a Fish River Island 2 and its erm. E6XX Tunnel Creek Processor and they will be manufacturing these.. these are not available yet. They will be manufacturing 'em and sending them out to attendees. If you are interested in getting one.. we have.. unfortunately we don't have enough for everyone but what they are doing is they are taking proposals you can register at the Intel desk just outside in the lobby. And kind of just give an idea what you would plan to do with the board and the first 120 good ideas will get one shipped to you sometime in April or May. So, that's pretty nice. Let's thank Intel for that. [Claps] So that's pretty nice. Um.. So that's pretty cool
3:00 - 4:00:
Nice little development board for you [inaudible comment from audience] [laughs] No, too low.
and let's see I want to mention the YOCTO reception, hosted by YOCTO and Intel is reception tonight. Should be a lot of fun. Its over at the hiller aviation museum and there's a little.. cute little boarding pass thing you got talking about that also we've got little stickers done by jambe that you can get if you want to proudly proclaim that you came to ELC
and the last thing to do is to introduce our keynote speaker for this morning. So, I've known Jon Corbet for a number of years. He's kind of.. I don't know if this is the right term because he has no beard.. he's one of the grey beards in the Linux industry and an incredible incredible asset
4:00 - 5:00:
At these events it is very customary for us to put in a plug for LWN.net and this event is no exception. In my opinion one of the premier sources for information about Linux, the Linux kernel and the industry and open source.
If you are not a subscribing member of LWN.net, you should be one.. shame on you.. because this is an asset. Really a community asset that we should support and Jon very graciously accepted to give our kernel report and he'll tell us all about what's going on with the kernel in the last little bit and maybe a little bit about what's coming up in the future
So without further ado, let me introduce Jon Corbet.
5:00 - 6:00: [Slide 1 - The kernel report]
>> JON CORBET: Hi. Thanks a lot. Good morning everybody. How many of you have seen me give one of these talks before? [Laughter] A fair number.
[Slide 2 - The plan]
Well you'll be glad to hear I've reorganised it.
The plan remains the same, which is to look back over a years worth of kernel developments with an eye towards what's going on in the future. I've changed the way things are done. Hopefully it will work out well.
[Slide 3 - Starting off 2011]
We're actually going to start just over a year ago with.. at the beginning of 2011 we saw the release of the 2.6.37 kernel.
What better to start a new year than with a new kernel? This was a fairly big release with well over 11,000 changesets in it.
This kernel brought in the first of a set of scalability patches for the virtual filesystem layer which added a fair amount to the complexity of that layer but also if you had the right kind of workload brought about something of the order of 30% performance improvement if you were doing lots of opens and closes that sort of thing. So that was good to have.
Block I/O bandwidth controller.. its actually a second bandwidth controller working at a higher level in the I/O scheduler stack allowing
6:00 - 7:00:
the placement of absolute limits on block I/O bandwidth
Finally got some support for the point to point tunnelling protocol in the mainline kernel.
Basic support for the parallel NFS protocol and
Wakeup sources which is an interesting.. on its own wakeup sources is just an accounting mechanism for tracking devices in the system that can wake the system from a sleep state but its part of a bigger effort to replicate the android opportunistic suspend mechanism and provide an implementation of that mechanism that works well within the mainline kernel. So we are still seeing pieces of that going in and pieces of it under discussion but at some point we may have a solution for that
So that was 2.6.37 - a lot went in there.
[Slide 4 - What have we done since then?]
And a lot has happened since then. So what has gone on since 2.6.37?
We've made 5 more kernel releases. I can call them out as i come to them going on the year.
We've merged almost 60,000 change sets it was over the course of the last year. These have come from over 3000 developers and at this point we have over 400 companies that we can identify that have contributed to the kernel
7:00 - 8:00:
So, I've put up numbers like this before. We've seen them before.
We know at this point that the kernel is a very active, very fast moving project. Perhaps the biggest on the planet, hard to say and it continues to move on and it shows no real signs of slowing down
[Slide 5- February]
So February of 2011
[Slide 6 - Greg K-H quote]
One of the things that happened early on in february was a little note from Greg Kroah-Hartman congratulating ralink saying as you can see ralink has stopped dumping drivers on us and instead is now working on patching the driver that we already have in the upstream kernel and trying to make that driver support their new hardware this he said shows a huge willingness to learn how to work with the kernel community and they need to be praised for this change in attitude
This is..i mean its a nice note but there is nothing all that special in a way its something that we go through with a lot of companies they take a little while to figure out how to work with us and then they do
8:00 - 9:00:
and so we see a lot of progress as companies figure out how does mainline kernel process work, how can we get our code in there, why is it in our interest to do so and then they figure how to do it and they become part of the machine. And we see this happening over and over again
[Slide 7 - Employer Contributions]
So this seems a good as time as any to put up this slide. This is a variant of a slide that i've been putting up for a while showing the top contributors to the kernel over the course of the last year from the beginning of the 2.6.38 development cycle through the 3.2 release
So, we see as always, volunteers top of the list at just under 14%. The percentage of changes coming in from people working on their own time has actually slowly fallen over the years and why that is is hard to say. One could take a pessimistic view and say that the kernel is getting too big and too complex, the easy projects are done and so we are putting off our volunteers that way. On the other hand, one could look at this and one could say well anybody who has shown any ability to actually get code into the kernel in any kind of reliable way
9:00 - 10:00:
tends not to stay a volunteer for very long unless they really really want to because they tend to get buried in job offers.
so that of course is not going to be a bad thing.
other than that we see a lot of the same companies that we've been seeing for quite a long time
we can see companies that are not only competing fiercely in other areas of the market but in fact at this point they all seem to be suing each other elsewhere [Laughter] but they are still working together quite well at this level and um.. the situation hasn't changed a whole lot
but here's one change I want to call out
this is.. was.. alright.. we'll do it the old fashioned way. no we won't..
[Slide 9 - Kernel changeset contributions by employer]
speak to me.. come on.. um.. alright. well. this is the one i was going to get to eventually. um.. what is going on here.. alright.. we'll get there
10:00 - 11:00:
never done that before
This is just a plot of the percentage contributions from the top companies. The same companies we saw on the last slide and we see.. in a noisy sort of way.. we have horizontal lines. Right? The companies who were contributing back around 2.6.20 which is when this slide starts are still contributing heavily now.
The big change I want to call out now is those two heavy lines at the bottom. Those two lines correspond to TI and Samsung. Right? And they correspond to a trend we seeing in general which is a much increased of level of involvement of the mobile and embedded community as mobile Linux grows in importance and grows in deployments these companies are figuring out that they need to work with the mainline and they are doing so.
So we're seeing this strong upward trend in contributions from that part of the market. I think we will continue to see that for quite some time
[Slide 10 - Also in February]
One other thing that happened in February, just thought I'd point out quickly,
11:00 - 12:00:
This didn't actually happen then but this when it came out and became well known. Red Hat for their enterprise kernel stopped shipping individual patches with their kernel and now you just get one big blog with all their changes. All 7699 changes in their current kernel kind of mixed in together into a single thing. So its become much harder for people who want to look in there and see what are the individual changes that Red Hat has actually made to the kernel. And why have they done that. And that's kind of a sad statement on the nature of the competition at that level of the marketplace. And there's not much to be done for it. It is of course entirely compliant with the licensing and so on but its not something that our community has really welcomed
[Slide 11 - March]
[Slide 12 - 2.6.38]
Alright.. Let's move on to March.. Well. Another month. Another kernel release. That's when 2.6.38 came out
A couple of very interesting patches that came out then. Including per-session group scheduling. This is the famous 200 line kernel patch that people were talking about for a while that turned out into about 800 line kernel patch by the time it actually went into the mainline
12:00 - 13:00:
that allows the kernel to employ the group scheduling mechanism automatically and partition processes into different groups and schedule them.. schedule the groups against each other this can give you much nicer interactivity in desktop situations. Its useful in a lot of other situations as well, to keep different groups of processes from interfering with each other in the scheduler and the nice thing about this particular patch is not that it added this feature because we've had this feature for years. It sort of made it just work. There's a whole lot of nice things that we have in the kernel that people don't really make a whole lot of use of because you have to do fiddly things with them to make them work. If you can make them simply work for people then things are a whole lot better
Transparent huge pages, which is also on the list, is another example of that. The huge page mechanism in most processors allows the memory management unit to work with much larger pages. You get a nice performance improvement out of that for just about any workload you can think of because you save a lot of pressure on your translation look aside buffer and such.
Transparent huge pages used to require you set up huge TLBFS to the side
13:00 - 14:00:
make changes to your application to make use of it and so on. So there weren't too many people actually using it. Now we're all using it because it simply works in the system and we all get the benefits from that
So we had various other things. Including the other half of the virtual file system scalability patch set I mentioned before. Transmit packet steering which is a networking scalability change that came out of Google and improvements to the block i/o bandwidth controller and of course a whole bunch of other stuff but its those two patches i think that were really the significant features of 2.6.38
[Slide 13 - Linus quote]
Something else that happened in March was that Linus threw a temper tantrum. This is not necessarily all that surprising, it happens but this was a fairly big one
He got a bunch of pull requests from ARM community and he decided that he was really just fed up with them so he stopped actually pulling them and said "ok somebody needs to get a grip in the ARM community". So, what's going on here
[Slide 14 - What is the "ARM problem"?]
14:00 - 15:00:
I probably don't need to tell the people in this room that the ARM architecture is interesting, in that it doesn't really come with the platform defintion the way some other architectures do. Every ARM system is unique in its own very special way. And so, there's a lot less commonality in terms of the code needed to support those systems
When i say the "Embedded" mindset what i mean is.. we have a lot of people working on their own special little projects working under intense time pressures and so on and so they've tended to solve their specific problem and if they put code upstream at all they've worked in their own piece of the tree and not really thought about making the kernel better for everybody involved.. so we've ended up with.. um.. with a lot of stuff.. there's really been nobody being overseeing this tree to try to make this better. so we get a lot of little sub-trees with a lot of duplicated code and a fairly big ugly mess has resulted in that tree
15:00 - 16:00:
[Slide 16 - Why is this happening]
Another way of putting this is that for years we've been asking the embedded community to contribute back to the kernel. Please please give us your code back. And the fact is that they are now doing that. Right? What we have here is a problem that is a result of our own success. We've gotten what we wished for. You should always be careful what you wish for. But in fact what is happening is really a good thing. The only real problem is that we haven't developed the processes within the community to manage all of this stuff
[Slide 17 - Cleaning up the mess]
So, what's doing.. what's happening? We're trying to clean things up. More high level oversight now. Arnd is running a tree that funnels a lot of the system on chip stuff into the kernel and so on. Trying to keep a better grip on it. There's a lot of cleanup going on at various levels. Consolidating code that was duplicated across all the various different ARM sub architectures and a move towards the device tree mechanism that can hopefully someday help us get rid of these board files and such
16:00 - 17:00:
and maybe someday i doubt we'll ever have a single kernel that runs on every ARM system but we can maybe have a kernel that runs a lot of the more popular ones without having to build it specifically for whatever system you're working on at any given time. we're heading in that direction. things are getting a whole lot better
And ARM is not only becoming a proper first tier architecture its actually starting to drive things in interesting ways. If you look on LWN today we've just posted an article from an ARM developer about a concept called big dot little. We've got system on chips now that have two different classes of ARM processors on them. There's the big fast power hungry one and the little slow power efficient one and you try to figure out which one of those you should be using for any given workload at any given time. This is going to stress the scheduler in very interesting ways. The scheduler has not been written with that kind of stuff in mind so we now have stuff coming up from the mobile embedded world that is going to be driving the future directions of the core kernel and its going to be very interesting to watch
17:00 - 18:00:
[Slide 18 - April]
Moving on to April. That's what April looks like in Utah. Nice place although the network connectivity is kind of poor, but you've got to take it.
[Slide 19 - Native Linux KVM Tool]
So, we had an interesting fight that came about in April. As a result of the posting a thing called the native linux KVM tool. This is just a small replacement for QMU. Its a hardware emulator aimed mostly at kernel developers don't want to mess around with a big full functional system
The sticking point is that they wanted to merge this code (which is user-space code) into the kernel source tree.
[Slide 20 - Ingo Molnar quote]
So, we've had people actually pushing this idea pretty hard. This is Ingo Molnar who says "some day somebody is going to integrate into a single tree and make the killer distribution that pushes everything else aside". So, not everybody agrees with this needless to say. This is an idea that is being pushed.
[Slide 21 - User-space code in the kernel tree?]
The people that push the idea of putting user space into the kernel tree say a number of advantages
18:00 - 19:00:
The code become whole lot more visible because you can patch it and you end up in the kernel change log. Tend to get more developers. They say you can develop kernel ABI and users together so that the two can develop into a much.. sort of.. homogenous whole. It encourages developers to think across this kernel user-space boundary which is a pretty hard boundary that people usually lurk on one side of or the other. The idea is you are supposed to get better integration across the whole system if you do this
[Slide 22 - User-space code in the kernel tree?]
On the other hand, people complain that it makes the kernel tree bigger. The kernel tree.. source tree.. is not all that small now. There are claims that you get ABI stability problems because if you can change user space ABI and its primary user at same time because you've got them all together in the same tree you can in theory create stability problems for other users of that ABI. This is not suppposed to happen. We are not supposed to do that in the kernel but there are people who claim this does happen
19:00 - 20:00:
They claim that other out of tree projects are disadvantaged because they just don't get the same visibility and they ask where does it end? Do we put our desktop environment into the kernel tree? Do we throw libre office in there? You never quite know what's going to go on there. So this is a fight that going to go on for a while. these people not hoing to give up on this. Linus seems to be holding the line about pulling more user space stuff in for now. but i don't think we're done with that discussion
[Slide 23 - April]
one other thing. sorry mark wherever you are.. but um. but this is a quote that really stuck with me at the time. that for all the progress that we have made about getting companies to work with the development process. We still have pockets of resistance where they say we need to work with proprietary drivers and binary only code
20:00 - 21:00:
and this is something that we have been fighting for years. in the end it seems like we almost always win and that companies figure out that it is not in their own interest to ship their code in this way but it continues to happen and we're seeing it certaintly in the mobile area with graphics drivers and some others. I hope that once again we can get past this and get to where we have truely free platforms that we can work on from top to bottom
[Slide 24 - May]
Alright. Moving on to May
One thing that happened in May was the posting of a mechanism called secure computing or an enhancement to this. the idea here was to add a feature where a process could install a install a bitmap in the kernel saying only the system calls that are indicated in this bitmap will be accessible to me from here on out. It was meant for use with the Chromium browser to as part of their sandboxing scheme.. just trying to reduce the attack surface of the kernel.. to take a whole bunch of system calls they know they won't use and take them off limits so they cannot be accessed at all
21:00 - 22:00:
so people look at this and they say "ok, cool.. but how about we enhance this a bit we can add a filtering mechanism we maybe hook it into the tracepoints mechanism because we've got all these points in the kernel"
at which point the tracepoint people jump in and say "hey wait a minute you're not doing that to my tracepoints" and a big fight results and the end result after the dust is settled is that nothing gets merged and the developer goes away
so one can point at this and can say it shows the kernel development process at its absolute worse. right?
we had this big fight and we had a promising young developer trying to do cool stuff got discouraged because people tried to throw their own irons in the fire and nothing happened. on the other hand, last month or so he came back with a completely reworked version of this patch using the berkeley packet filter which works a whole lot better much nicer design and it looks like it probably will go into the kernel perhaps as soon as 3.4
so one can say in fact this shows we did things right. we push back of things until we something that suits us..
22:00 - 23:00:
that solves the problem correctly and we wait until we get the right solution in there so you know you can interpret it either way
[Slide 26 - Yet another kernel release]
also we had yet another kernel release in May when 2.6.39 came out.
various interesting sorts of additions including directed yield which is a virtualisation optimisation, ipset mechanism for managing a large number of ip addresses and the firewalling mechanism, transcendent memory is an interesting approach to memory management used again mostly in the xen area.. the core went in at this point not the users of it.. user namespaces are part of the solution to the containers problem. containers can be thought of as a form of lightweight virtualisation where you wall off a group of processes but they are all running on the host kernel, right? you're not running on a virtual machine, you're running on the host kernel. this is actually a much harder problem to solve than um. full virtualisation is
23:00 - 24:00:
they've been working on it for many years and they will continue to be for a little while yet but we got another one of the pieces in there
the media controller is a response to hardware complexity. we've now got.. in this case.. video acquisitation devices, web cam devices, or your phone camera devices that now have multiple hardware components. you are trying to patch the data between them in various sorts of ways. an exported interface to user space that actually allows this to be used. this is complex stuff and i don't think we've solved this problem very well yet but we're trying to do our best
[Slide 27 - Big Kernel Lock]
one other nice thing that happened.. in May
only took us 15 years. thanks to Arnd who really carried that across the finishing line although a whole lot of people worked on it to get it there. given enough time we can solve just about any problem we have to
so we got 2.6.39 out
24:00 - 25:00:
[Slide 28 - During the 2.6.40 merge window]
and then linus started hearing voices.. right after that.. during the merge window. saying the numbers are getting too big and maybe he's going to call this thing 2.8.0. because when the voices speak, he listens
so of course when you say something like that, one of the first things that going to happen is you going to get more voices.
[Slide 29 - During the 2.6.40 merge window] so greg kroah-hartman comes out and says well if you do this i'm going to buy you a bottle of whatever whisky you want
well it must have worked because, as we all now know, linus did this
he didn't called it 2.8.0 he called it 3.0. this was, as i think pretty much everybody knows at this point, just a numbering change. there was nothing particularly special about the 3.0 kernel that merited a major number change.. just we were getting really really tired of these big version numbers. so we went on from there
[Slide 30 - June]
so the week after that greg did in fact present him with a bottle of whisky on stage.. with cups. he didn't open it.. although what happened later that evening.. was.. um.. not necessarily well advised
25:00 - 26:00:
and so we now have the 3.x kernel series instead of 2.6 and everybody is happy ever after until we get to 3 dot something big
[Slide 31 - Ext4 snapshots posted]
alright.. in june we saw another posting of an interesting feature for the ext4 filesystem.. being the snapshotting feature
snapshots of course are a nice feature. you save a copy of the state of the filesystem at some particular point. you can use it for things like rolling back from a failed system upgrade, you can use it to create a [...] version of the filesystem for backups, you could also actually make use of a feature like this in an embedded device for a factory reset kind of functionality where you simply have a snapshot being the initial factory state of the device and all you've got to do is go back to that snapshot and you've done your reset
nice feature. everybody likes it
[Slide 32 - Quote from Josef Bacik]
but we still had some complaints about it. this [...] is from from Josef Bacik who is a btrfs developer
and he's saying why is it that we shoehorning these big features into ext4 which is supposed to be a stable filesystem right?
26:00 - 27:00:
meanwhile we've got btrfs that already does this all this stuff and we're trying to stabilise it
other people have asked very similar questions, why are we doing stuff like this to ext4 when ext4 was meant to be our stopgap until the real next generation filesystems
[Slide 33 - What's up with ext4?]
well, as it happens, there is an awful lot going on with ext4
with 3.2 we saw the addition of a significant new feature called bigalloc allows it to finally move away from allocating blocks in 4kb chunks and you can allocate much larger chunks if you've got a file sysem mostly holding large files then what you end up with is much more efficient operations on those files this is useful within in google for example where they're trying to do this where a lot of the driving force came from and other places as well. a significant speed up. various other things in the works, the snapshot feature that i mentioned. inline data which is an optimisation for very small files
27:00 - 28:00:
secure erase.. if you delete a file it actually goes away from the media completely. checksumming and metadata.. trying to defend against corruptions on the media and so on
[Slide 34 - In other words]
so what's going on is that ext 4 rather than sitting there and gathering dust and fading away looks to develop and grow for some time yet because there are people adopting it and using it and who want to do other stuff with it so as long as people keep hacking on it and adding new features that stuff is going to go in i don't think ext4 going to just kind of fade away anytime real soon
[Slide 35 - UEFI secure boot]
also in june people started talking about this feature called UEFI secure boot. the objective of this feature of course is to create hardware that will only give control of the system to a trusted boot loader this is a boot loader that has been signed by a key that is known to the system level bios. its actually a useful feature you can use it to thwart attacks on the BIOS or the bootloader of the system. very low level root kind stuff
28:00 - 29:00:
and it helps you to ensure the system is running what you think it is. there is of course just one tiny nagging problem.
[Slide 36 - Who is “trusted”?]
this thing.. who is trusted? who do you trust? is it the person who thinks they actually brought a computer? or it is the sold it to them? the person that put the hardware onto it? is it the entertainment industry? all of these people are making a play for control over this and to be the trusted party who says what you can actually run on your computer
[Slide 37 - UEFI secure boot...]
so UEFI secure boot is another one of these mechanisms that could result in the loss of control over our system. something that we have been fighting for, for years we have been gaining. we have a whole lot more open systems than we have had in years. the situation is nice and getting better. this could take things the other direction
[Slide 38 - Where things stand]
so what's going on. there has of course been a whole lot of work to call attention to the problem: white papers have been written, blog postings have been posted
29:00 - 30:00:
some concessions have actually been gained from all of this, including the understanding that any x86 system that you buy can be put into setup mode. this means that for any of these systems the user can not only boot whatever system they want on there but they can install their own keys and boot whatever system they want in the trusted mode, which is useful, they can actually make use of that capability if they want to do so. it won't be impossible to install your own keys.
[Slide 39 - Where things stand]
a few little glitches still.. installing that key may not be easy.. in fact its going to have to be done at a low level so users wanting to do this are going to be entrusting their lives to the user interface skills of BIOS engineers
which seriously may involve typing in big long keys as big long series of hex digits and so on not going to be fun. okay.
there's no provision at all from booting from CD in a trusted mode so live CDs could be a thing of the past
30:00 - 31:00:
and i've put down here that ARM systems can be totally locked down. in fact if they are going to meet the Windows labelling requirements, ARM systems must be totally locked down. its not just that they can, its actually a requirement that they be done in that mode and that of course is not not what we want to see at all. so there's a lot to to be done in this area still and we're going to be fighting about UEFI and its successors for a while yet
[Slide 40 - July]
alright. moving on
[Slide 41 - 3.0-rc7-rt]
in july one of the things we saw was the first version of realtime patchset to come out since march. the realtime patchset had been stuck on 2.6.33 and users have been stuck there for a while much to their chagrin. it wasn't what they wanted. they ran into some pretty knarly technical problems that kept things from advancing for a while but we finally did get a new version in july
[Slide 42 - The state of realtime]
so where does the realtime patchset stand at this point. well of course we get very nice determinism out of it if you have the right hardware at this point. if you don't have the right hardware you still lose
31:00 - 32:00:
there's not much to be done about that. one of the most thorny problems with this has been the problem of per CPU data. per CPU data is one of the most significant scalability techniques that's used on larger systems but does not really work that well in the realtime world its not good for determinism. so they've come up with a solution that involves some fairly scary locking assumptions, so locking assumptions tend to be scary in general, so if they are designated as such they are really scary, so this is going to have to be watched for a while but the plan is to merge a lot of the realtime tree in the next year
[Slide 43 - Open issues in realtime]
we will see. but this group of people back in october at the realtime mini submit did in fact partition this stuff and do it, i know steves been rushing right out to merge that stuff and so on. we'll see how that's coming along. still a couple of open issues, one of which is deadline scheduling
32:00 - 33:00:
there is actually a pretty nice deadline scheduling patch out there but the developer who was working on that seems to have got distracted by other things we haven't seen a revision of that for a while somebody else may have to pick up deadline scheduling before we actually get..
>>AUDIENCE: inaudible comment
somebody else is - oh cool. i haven't seen that yet.
and there's a lot of interest in cpu isolation where you take one or more CPUs on the system and you say this is only going to run the application, no interrupts, no operating system, nothing else just the application. there have been various proposals how to implement that. people are working on that. we will perhaps see a solution emerging over the course of the next year. we will see
[Slide 44 - The 3.0 release is delayed]
also in july as we were coming around to the 3.0 release. all of a sudden it got delayed it got delayed to the point that it pushed merge window into Linus vacation which is not something he wanted at all. It made him little grumpy. but we had this bug in dcache scalability patches that would cause files to just sort-of disappear for a little while. disappearing files are just not one of the things on the intended feature set for 3.0
33:00 - 34:00:
so we had a fairly high profile debugging crew, being Linus Torvalds, Al Viro, the chief VFS maintainer and Hugh Dickens arguely the top memory management developer and it still took them about a week to figure this out. it was a really subtle and strange bug
[Slide 45 - Some parts of the kernel have reached a truly scary level of complexity]
which drives home a point some people have been making for a while. some parts of the kernel especially the core kernel have reached the level of complexity that is truly daunting. really pretty scary. try to go in there and figure out what is going on i'm not sure there is anyone who understands it all anymore at this point. and i don't know how you fix that but that kind of complexity is the sort of thing that can kill operating systems over the long term. so we need to really think about how it is that we can if not get rid of some of this complexity at least keep it from getting worse. we will see
[Slide 46 - 3.0 kernel released]
anyway despite that 3.0 did come out towards the end of july
34:00 - 35:00:
new posix clocks waking from suspend and all that. we now have a just in time built into the kernel for the Berkeley packet filter subsystem for filtering packets that you are trying to inspect coming off the net sort of thing. send multiple message system call. scalability for applications that are sending lots and lots of little messages. ICMP sockets so you can write an unprivledged ping client at last. namespace file descriptors another piece of the containers problem allowing the system administrator to move into and out of containerized systems. cleancache.. which is some of that transcedent memory stuff i was talking about before. a different way of caching pages and such in the system. and that was our 3.0 kernel
[Slide 47 - August]
in August I went to Taipei and got a lesson that certain aspects of agriculture really don't change no matter where you go it all comes down to pizza and beer of course.
[Slide 48 - x32]
35:00 - 36:00:
We had a discussion about an ABI called x32 in august. which is an interesting apporach to 64 bit computing. 64 bit mode is.. you know.. 64 bit processors are very nice things.. do a lot of stuff.. you can address vast amounts of memory.. and all that. you really want to run your kernel in 64 bit mode if you can but as it turns out very few applications actually need either 64bit data or 64 bit pointers
so for these applications expanding all that stuff out to 64 bits is really just bloat. you've expanded your application quite a bit that will slow it down.. of course and use more memory, not something you want
the response to this is x32 ABI. where you're running your processor in a 64 bit mode but using you're a restricted set of instructions that deal in 32bit data so you're dealing.. 32 bit data and 32 bit pointers so you get all that space back you shrink the applications back down but you still have a processor that can natively address however many gigabytes of memory you want to put into it and so on
36:00 - 37:00:
most of the work in this area is to be done in the user space at this point.. getting the libraries and all that stuff but we do need to define an ABI with the kernel for processors running in this mode or personality you can think about
this has been mostly done, the developer has got distracted with other stuff and slowed down but i wouldn't be surprised to see sometime in this year some distributor starting to show some real interest in putting together an x32 version of their distributions
[Slide 49 - 20 years of Linux]
In August, we celebrated 20 years of Linux of course 20 years since Linus first came out and said i've got this little project
that line plot there is just a count of lines of code in the kernel over the course of 20 years. you can see a pretty obvious trend, right? despite economic crises and whatever else went on over the course of these 20 years the kernel just continues to grow
from its beginnings, the first kernel posting was about 10000 lines of code right? tiny
37:00 - 38:00:
there's only one place where we ever got smaller, which was 2.6.36 when got rid of dead config files. the only time in the entire time of the kernel that this has happened. i suspect that's a record that going to remain unbroken for some time yet
[Slide 50 - Kernel.org compromised]
of course the other sad thing that happened in august was the comprimise of kernel.org so.. still piecing that together we know attackers were on the system for some time before they were discovered. they got on using stolen credentials and then they installed trojan SSH binaries so they could steal more credentials and move onto other systems and as a result a number of other systems were comprimised as a result of this
as far as we can tell, there was no attempt to actually comprimise the software distributed on kernel.org systems we would have caught a lot of it pretty quickly if they had tried to in other places it might have taken longer but a really determined effort to find any attempt to comprimise. as far as we can tell, that was just not what they were trying to do
38:00 - 39:00:
the result of this was that we lost kernel.org for almost 2 months. which was kind of a pain and that resulted in the delay 3.1 release and so on
[Slide 51 - What has been done]
what has been done. well we have a new kernel.org. they basically tore it down and rebuilt it with a bunch of new machines. split the functionality apart. hired more stuff so that we have a better handle on what's going on there. restricted access because it occurred to somebody somewhere along the line maybe having 450 shell accounts into your master system is not the best of ideas if you really want to avoid going though this yet again. so we don't have that anymore. there's a new web of trust for the kernel developers built around a set of gpgs keys and so on. a lot of this stuff was enabled by support from linux foundation. really helped out a lot with this
so we have a new kernel.org that is more functional and more secure and hopefully we don't have to go through this for some time
39:00 - 40:00:
but i will say something that i have been saying for some time which is we don't take security seriously enough. we just don't. level of the attackers out there and their motivation is such that we're going to see instances across the free software community we've seen them in the past, we will continue to. we have to get more serious about this and that's not going to be easy or fun to do
[Slide 53 - September]
alright moving on to september.
[Slide 54 - Oracle to use Btrfs by default]
one thing that happened in september was that oracle said they would ship btrfs by default. i kind of felt like i was being watched as I was reviewing the slide this morning so took a picture out my window.. you really want to photoshop a red eye over that or something.
but you know. what oracle is doing with btrfs is actually very good stuff. what they are doing is that they are pushing it. they want it to be the core of their enterprise distribution
so what's going on with btrfs. well there is some new feature work happening with btrfs
40:00 - 41:00:
but the btrfs developers are very much focused on stability in getting this filesystem to the point where it is ready for production use. distributions are thinking about it. fedora have once again pushed back its plans to go to btrfs by default but this will happen sometime soon and we will have people actually starting to use it
[Slide 56 - Still missing]
but there are some missing pieces. the biggest one arguably being the file system checker and in particular the file system repair tool. writing a filesystem repair tool is actually a very hard task because if you are not really careful when you fix a broken file system you can break it worse than it was before and take data that might have been recoverable and make it unrecoverable so this has been a very long term project that's been done with a lot of care and is taking a while
in the meantime they have added a couple of features including what they are calling the root block history array this is just an array of the previous states of the filesystem btrfs being a copy on write filesystem
41:00 - 42:00:
essentially every previous state of the filesystem is a snapshot so you save a few of those and if something goes wrong and you catch it soon enough you just go back to a previous state of the filesystem and your problems go away
also written a read only recovery tool that can extract the data from a badly broken filesystem without making changes to it so we have a lot of the pieces there but the full filesystem checker is actually out there in the repository if you go look for it but it hasn't been announced ready for anybody to use on a real filesystem. we're also missing the RAID support, the patches for that actually exist, they have existed for a few years, they just haven't been merged
that stuff will come around and we'll see btrfs stabilising quite a bit. starting to go into real production use in some places
[Slide 57 - October]
october, october everyone in prague because that's where the action was including the 2011 kernel submit. so we talked about a lot of stuff at europe but
42:00 - 43:00:
i really wanted to point out just two outcomes. two outcomes that were the key outcomes of this. one wasn't a statement. not a new one but we heard it again.
that kernel maintainers should be saying no more often. that we take stuff into the kernel too easily that perhaps shouldn't be there and we need to push back a little more and insist on really good reasonsto get more code into the kernel especially code that adds complexity
on the other hand, it was understood that was code was widely used out there, that has been widely shipped should perhaps be brought into the kernel even if it doesn't live up to our technical standards otherwise. this was of course said with the android code in mind. its been out there, its been shipped to millions of people and its kind of silly at this point that we don't have that sort of stuff in the mainline. so those were a couple of the decisions that came out
[Slide 60 - A slow moment at the Summit]
also, in a slow moment actually actually while he was sitting at a table at the kernel summit, linus released 3.1 kernel. came out
43:00 - 44:00:
this was a 95 day development cycle. which is quite a bit longer than usual. the reason for this of course was kernel.org outage and the lack of a desire in particular to open up a merge window without having kernel.org as a place to host a lot of trees. added various features including dynamic writeback throttling which is an attempt to solve our writeback problems writeback being the process of flushing dirty pages of memory back to their backing store on disk pages that had been written to file and so on. we've had performance problems in that area for a while. some of that stuff went in for 3.1. we got a new architecture.. openRISC which is actually an open source hardware design and some improvements to the ptrace system call, lseek system call. various other things that went into the 3.1 kernel that was released during the kernel summit.
[Slide 61 - Embedded long-term support initiative]
one other thing that happened in prague was the announcement of the embedded long term support initiative. so there are various aspects to this
44:00 - 45:00:
including this concept that we'll pick one kernel a year and maintain it for a 2 years. going forward just going to be part of the regular stable kernel process. already happened with the 3.0 kernel
couple of other trees involved with this initiative. helping the embedded and mobile communities standardize on a single tree and to get their code back into the mainline. shibatu san is talking about this later today so if you're interested in this I advise going to see his talk this is a very welcome and very worthwhile effort and i wish it the best of luck.
[Slide 63 - Per-group TCP buffer limits]
something i want to talk about very quickly. the addition of per group TCP buffer limits. this is a limitation on the amount of memory in the kernel can be taken up by outgoing network data. its interesting because its the first mechanism put in place to limit the amount of memory used by the kernel on behalf of user space processes. this is something that there is interest to do in a lot of other areas in various caches and so on
45:00 - 46:00:
we haven't had the mechanism to do that. we do have that now with the TCP buffer limits
[Slide 64 - Control Groups]
and its worth talking just a little bit about control groups in general. control groups of course are a simple mechanism for grouping processes into hierarchies and then allowing the addition of controllers to apply various policies to those groups
developers complain a lot about control groups. they really hate them. a lot of the real trouble is actually not in the control group mechanism but in the controllers. we have these various controllers for things like memory usage, block I/O bandwidth, cpu scheduling and a bunch of other stuff. a lot of these controllers don't necessarily integrate well not only with each other but with the systems they are meant to be controlling and that was partly by design because people writing controllers have tried to be nonintrusive how they did it. but it means that we end up with these warts on the side of the kernel implementing these controllers
46:00 - 47:00:
so there's going to have to be cleanup work done that. what we really need central control group maintainer to beat heads together and make this stuff integrate better with the kernel and with itself. i think that's going to have to come together fairly soon
[Slide 65 - LTTng pulled into staging]
one other thing happened in november. Linux trace toolkit next generation. its a tracing toolkit with a whole lot of nice features nice user space support and so on. its been out there for years. widely used in some areas. it was pulled into the staging tree with the idea that it would be merged for the 3.3 kernel and what happened was a big long discussion
[Slide 66 - Two pivotal summit outcomes]
that really recalled the discussion here the same slide i put up before. right?
on one hand, people were saying maintainers are supposed to say no more often. we're going to say no to this one because it doesn't integrate with the existing tracing mechanisms. we don't like the way some parts of it work. we say no
47:00 - 48:00:
on the other hand, people were saying this is code that has in fact been shipped to a whole lot of people. its widely used. this is something we should bring into the kernel under the decision that we should bring in this sort of code.
so people thought about it and all that.
[Slide 67 - The outcome]
the end result is that the linux tracing toolkit lost in this case. it was ushered back out the staging tree. if you look in the kernel history you actually see it in there then goes away again. that functionality if its to get into the kernel will need to be broken up and integrated with our existing tracing functionality and i honestly don't know if that's going to happen to any degree or not. its kind of a shame; there's some good stuff there. its a long story working with the development community that hasn't always worked out that well
[Slide 68 - December]
december. that's what december looks like in colorado
[Slide 69 - The Android mainlining project]
we had an announcement in december of what was called the android mainlining project
48:00 - 49:00:
this was the joining together of a couple of efforts that had been going on out there. reinvigorated by the decision at the kernel submit to try to get this kernel code merged. brought together a whole lot of stuff
[Slide 70 - The Android mainlining project]
in fact i need to make one change already. because one piece has already been shoved back out. pmem contiguous memory allocation we talked about that a bit on monday for those of you that were here. various other pieces to the android system. couple of pieces that aren't here yet including wake locks and the ion memory management system. a lot of this stuff has found its way into the staging tree. there's been some discussion about moving it into the mainline. there's been all the usual same old arguments all over again. we see how long that takes. slowly stuff is merging there as greg will tell you with a 3.3 kernel you will actually be able to run android user space on a mainline kernel if you just don't care about little things like battery life but we're getting there. this is progress in the right direction
49:00 - 50:00:
[Slide 71 - January] [Slide 72 - Happy New Year]
so, jan of 2012. coming around towards the end. exactly one year after 2.6.37. once again on january 4th. we had a kernel release this was 3.2. this was a big release with almost 12,000 changesets. proportional rate reduction being a networking change to deal better with losses of data in transit. extended verification module is a trusted computing mechanism that is meant to defend the files on your system from offline attacks. you can detect if somebody has been messing with things. we got a new controller for the control system where you can now say this group of processes never gets more than 10% of the CPU even if that cpu time is available because they are not paying for it. cross-memory attach is a interprocess communication mechanism. we've got the hexagon architecture yet another new architecture. the bfs root block array that i mentioned before went in.
50:00 - 51:00:
and a thing called I/O-less dirty throttling which is a rather complex application of control systems to the writeback problem. trying to balance the level of dirty pages in the system with what the backing store devices can do and keep dirty pages from overcoming things. it looks like good code and both people that can actually understand it really approve of it. for the rest of us, we'll see
[Slide 73 - 3.3 merge window]
also we had the 3.3 merge window in january. this is the merge window for the upcoming kernel. that we will see sometime in march
a lot of networking stuff went in for 3.3. including team device this is a lightweight bonding device. network priority control group controlling access to the outgoing interface based on control groups. tcp buffer size i mentioned before. byte queue limits are an attack on the buffer bloat problem. putting in limits of the amount of data that can be queued to any outgoing interface. trying to keep from buffering too much data and messing up with our protocols
51:00 - 52:00:
Open vSwitch is a complex new virtual switch used in the virtualisation area it actually holds out the potential for other interesting flow control types of applications and so on but people aren't doing that yet
outside of networking, we also got support for large physical address extensions in the ARM architecture so now you'll be able to put more than 32GBof memory in a 32bit arm system and we're going to have all the same kind of ugly stuff that we've had with X86 in that area but people want to do it
android drivers return as i mentioned before
dma buffer sharing APIs under user space control. share buffers between different devices. attack on hardware complexity problem
and a whole bunch of other stuff went in for 3.3. that is now stabilising. like i said sometime in march likely to see that.. that particular release
52:00 - 53:00:
[Slide 74 - February] [Slide 75 - Greg KH joins the Linux Foundation]
that brings us around to right about now. there's really only one thing I'm going to mention
Greg Kroah-Hartman has now joined linux foundation and be able to work on the stable trees and such as something closer to a full time job although one wonders, because if you look one of the first things he did was actually start submitting patches to libre office
which may not be what the linux fondation had in mind even though its a worthy cause for sure.. but, no this is a good thing. it will help to beef up on the stable tree. people who watch the stuff he does.. wonder how it is that he gets stuff done. so hopefully he will have some more time to do that
[Slide 76 - Stuff not covered]
so, there's a whole lot of other things i could have talked about although i have talked you all into the ground over the course of the last hour
[Slide 77 - Questions?]
instead i will stop here and I will be happy to answer any questions that people may have. somebody must have a question alright
53:00 - 54:00:
>>AUDIENCE: you had a slide that which said you have like 3000 developers do we need more?
>>JON CORBET: ok, question is, you had a slide saying that there are about 3000 developers over the course of one year. do we need more?
this is a question we've actually asked at kernel submits and so on. how do we bring in more developers and so on
and in a sense yeah we always need more. there are plenty of unsolved problems, a lot we can do. people do ask how big can our community get before it simply becomes unwieldly. certainly there's got to be a limit but i don't think we are there.. at this point. i think we are handling the size of our community quite nicely. the whole sub-maintainers system works well enough. things are working pretty smoothly at the moment so whether we need more that depends on the problems to be solved and if there are people that feel they need to be solved then we need them
54:00 - 55:00:
>>AUDIENCE: not that i know anything about the topic but i was wondering if you could speak more about the LTT stuff because it seems to me the problem is that the people who say no are not actually working to replicate the functionality... people who added the functionality.. module based system.. dead end situation here.. elaborate on the way out of this
>>JON CORBET: OK, so. the question from this guy claims to know nothing about LTT
being the guy who started the project many years ago.
essentially, how do we get around the problem of getting LTT into the kernel? since the people who are blocking it are not working to replicate its functionality .. and you know.. its a hard problem.. people who are blocking it ..
55:00 - 56:00:
not only don't feel the need.. don't feel its their job.. people who want a particular functionality in the kernel pretty much get to take it upon themselves.. that's the way kernel development works.. can't tell other people to implement something for you.. LTT is a hard problem.. and i think it comes down to perhaps people in the kernel not always understanding what the benefits of what LTT are because if you look at it from an outside point of view. you see things like complicated trace format standards documents, yet another ring buffer, things like that that don't necessary look useful to the kernel
on the other side, we see a project that it frankly i have to say is not entirely unhappy to be outside the mainline frankly. i think that's part of it.
what ltt needs i think is somebody who both really understands it, who really appreciates what it does and is willing to work with the kernel community re-create that functionality whatever is missing from the kernel's tracing functionality in ways that work with the LTT tools
56:00 - 57:00:
its not going to be a small job
>>AUDIENCE: to what extent.. inaudible
>>JON CORBET: to what extent will the realtime patch set make it into the mainline kernel?
and everytime i've made a prediction on this i've been really badly burned. when it comes to getting the stuff into the kernel, there are no deterministic timelines at all. but the intent is to merge if not all of it most of it.. they like to do a lot of this within a course of a year. thomas ? has indicated he's getting a little tired of that
57:00 - 58:00:
doesn't want to maintain it out of tree for much longer
i think we will see more of it coming in.. some of the stuff is sufficiently scary. that it may take a while yet. i won't say that in a year we will see most of it merged. the desire is to merge it and someday i think.. the interesting thing is if you look at it.. a whole bunch of stuff has to come in
a lot of stuff we think of as core mainline functionality was initially realtime stuff.. and we've all got the benefits of that.. part of the problem is that they keep adding stuff to the realtime tree.. its a treadmill in a way but we'll get there
its probably getting to be about break time.. one more
58:00 - 59:00:
>>JON CORBET: Does this mean there is a conservation of determinism?
Who let this guy into the room? He snuck in.. i don't know.. Every patch gets into the mainline on its own schedule. there's not much to be done for that. on that cheery note. I think I will thank you all very much and enjoy the rest of the conference