|
In this readme we want to provide some more technical information about RPM and PyRPM itself and how everything works. Some of the collected information here can be found in other documents spread out over the Internet in more detail. See the related projects section at the end of this file.
Use pychecker or pylint to improve code quality, 4 space indents and no tabs. Most scripts support giving --hotshot as first option to run them with the hotshot python profiler. oldpyrpm.py has some TODO items listed in the source.
Python is a very good scripting language. Very nice for fast prototyping and implementing new features. We currently see limits if too much data needs to be processed. This is happening with parsing xml data as well as with the huge amount of dependency data in rpm packages (where all files in rpm packages can also be used for dependencies, those are about 350000 files in FC-development).
These python modules could see some improvements:
python-elementtree is about 3 times faster in reading in the comps.xml file compared to libxml2. elementtree still does some internal automatic detection and conversion between data encodings that looks like it might add overhead. Migrating to python-elementtree should either be done for the complete code or not at all. Since libxml2 is the default xml library, we'll stick with it for now. XML processing is one of the biggest CPU consumers right now.
gzip decompression should offer a filetype object with .read(). The current hacks in the class PyGZIP should go away.
bsddb access seems very slow (reading rpmdb)
bzip2 decompression does not have streaming support
At first we want to provide some high level information about how rpm works without going into too much techincal detail. This should help understand how rpm works and what problems it solves (and which it doesn't and why yum is needed). Following that will be a section that describes then what yum does and what problems it solves (and again, which it can't solve and which we try to solve in our yum derivate). Afterwards we will go more into the rpm binary format and the format of yum repositories as well as several interessting points about the way rpm works and the problems that appear with doing to.
Rpms whole concept is based on socalled dependencies which define the relation between packages. Those are simply expressed using Requires and Provides. In mathematical terms this can be viewed as a directed graph where the nodes are packages and the edges are requirements from packages that require something to packages that provide that requirement. To make this a little more visible here a small ascii art:
A ----> B ----> C
In this example:
A requires B
and
B requires C
So in order to be able to install A we need to install B and C as well, as B needs C.
Requires and provides can be versioned and have additional flags like <, <=, ==, >=, > which need to be handled properly.
The rpmvercmp algorithm compares two labels (like the Version or the Release tag) to see which is newer.
In this algorithm, "digits" and "letters" are defined as ASCII digits (0-9) and ASCII letters (a-z and A-Z). Other Unicode digits and letters (like accented Latin letters) are not considered letters. ASCII letters and digits are called "alphanumeric" characters.
Please note that the algorithm's actions is undefined in some cases, in a ways may make the resulting comparisons stop working sanely (see [WWW] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178798 for an example where the order of the comparison is more important than the operands). To avoid these, make sure that all your labels start and end with alphanumeric characters. So while things like "1.", "+a", or "_" are allowed as labels, the result of such comparisons are undefined. For the exact (non-symmetric) algorithm, see lib/vercmp.c in the RPM source code. The following algorithm is a simplification based on the version available in FC4, and is considered to be stable, as the last time it changed in any way was in January 2003.
Each label is separated into a list of maximal alphabetic or numeric sections, with separators (non-alphanumeric characters) ignored. If there is any extra non-alphanumeric character at the end, that. So, 2.0.1 becomes (2, 0, 1), while (2xFg33.+f.5) becomes (2, xFg, 33, f, 5).
All numbers are converted to their numeric value. So 10 becomes 10, 000230 becomes 230, and 00000 becomes 0.
The elements in the list are compared one by one using the following algorithm. If two elements are decided to be different, the label with the newer element wins as the newer label. If the elements are decided to be equal, the next elements are compared until we either reach different elements or one of the lists runs out. In case one of the lists run out, the other label wins as the newer label. So, for example, (1, 2) is newer than (1, 1), and (1, 2, 0) is newer than (1, 2).
The algorithm for comparing list elements is as follows:
If one of the elements is a number, while the other is alphabetic, the numeric elements is considered newer. So 10 is newer than abc, and 0 is newer than Z.
If both the elements are numbers, the larger number is considered newer. So 5 is newer than 4 and 10 is newer than 2. If the numbers are equal, the elements are decided equal.
If both the elements are alphabetic, they are compared using the Unix strcmp function, with the greater string resulting in a newer element. So b is newer than a, add is newer than ZULU (because lowercase characters win in strcmp comparisons), and aba is newer than ab. If the strings are identical, the elements are decided equal.
Examples
Some random examples, to make sure you understand the rpmvercmp algorithm:
1.0010 is newer than 1.9 because 10 is more than 9.
1.05 is equal to 1.5, because both 05 and 5 are treated as the number 5.
1.0 is newer than 1, because it has one more element in the list, while previous elements are equal.
2.50 is newer than 2.5, because 50 is more than 5.
fc4 is equal to fc.4, because the alphabetic and numeric sections will always get separated into different elements anyway.
FC5 is older than fc4, because it uses uppercase letters.
2a is older than 2.0, because numbers are considered newer than letters.
1.0 is newer than 1.fc4 because numbers are considered newer than letters.
3.0.0_fc is the same as 3.0.0.fc, because the separators themselves are not important.
Sometimes it is necessary to replace an installed package with a new package that provides the same functionality as an installed one. By obsoleting that installed package in the new package the installed package will be removed during an update and replaced with the new package. Obsoletes can be versioned and have flags just like requires and provides.
An example was when ucd-snmp got replaced by net-snmp. There the new net-snmp package contained a "Obsoletes: ucd-snmp" and, as it provided the functionality of the old ucd-snmp a "Provides: ucd-snmp".
In rpm you can specify that your package conflicts with something another package provides. This can be used so that e.g. packages that are know not to work with one another can refuse to be installed if the other is already there. Conflicts are tested against the provides, just like requires. But instead of fullfilling a requirement they produce an error if there is a match. Conflicts can be versioned and have flags just like requires and provides.
If you look at a set of rpm packages, then a fileconflict is any duplicate filename where two files differ in their filemode (file permissions as well as file type like directory, regular file, etc), their filemd5sum (which is only set for regular files), their fileusername or their filegroupname. To support multilib/compatarch installations also all fileconflicts where one file is a ELF32 and the other is a ELF64 binary are ignored.
Since fileconflicts do occur pretty frequently, the default setting is to ignore them. yum within Fedora Core has fileconflict detection enabled.
Dependency checking is fairly simple: For a given set of packages take all requirements and check for every requirement if any package provides this requirement, possibly restricted by it's version and flags.
Given a set of packages to determine the order in which the packages need to be installed (or removed) a dependency graph is constructed. All packages with no requirements can be installed first. Afterwards all requires of the remaining packages that were provided by the installed packges get removed. This normally results in a new set of packages without any requirements. This process is then repeated until either no more packages need to be installed or we have requirement loops between packages. These loops then need to be broken up by intelligently selecting one dependency and removing that until we have packages without requirements again. If there are several packages in one round without requirements the order of those are independant and can be installed in any order.
A special case for the loop breaking are PreReqs. Those requirements are special in that sense that they should never ever be broken up if possible as they specify that a certain package absolutely has to be installed before the other one can be safely and correctly installed. This happens often if a package needs some specific binaries in one of it pre or post scripts.
Now that we know how dependency checks and ordering works the next step is dependency resolving. Here we need to leave the space of a single set of rpms and include some kind of repository from which the depresolver can pull packages. A typical scenario is where you have an installed system with a set packages B, a set of packages to be installed or updated I and a set of packages in one or more repositories called R.
The depresolver now takes all requirements from the packages in I and checks which are still not resolved by B. For those unfulfilled requirements it then looks at the packages in R and tries to find packages that fulfill those requirments and add the best matching package to I. This process is repeated until all requirments are fullfilled or aborted if they can't be resolved using R.
An important fact is that dependencies are "arch-less", meaning if you have a requirement on multilib systems this can lead to 64bit packages being pulled it from 32bit packages and vice versa. For libraries this can and usally will automatically be prevented as 64bit libraries on 64bit systems have a (64bit) at the end of their provides, so they specifically and correcltly will be required by binaries linked against them.
More problematic on the other hand are development packages where there a package foo-devel typically only has a "Requires: foo" as a requirement which would pull in a 64bit foo package for a 32bit foo-devel package, which is most of the time not what you want.
Multilib systems (mixed 32 and 64bit systems like AMD64 or S390x) have some special rules that apply to them as on those architectures both 32bit and 64bit packages can be installed at the same time. Each file in a rpm has a color. The color is 0 if the file is arch independant (text files, etc), 1 if it's a 32bit binary and 2 if it is a 64bit binary. If the color is 0, normal file conflict handling is done. If the color larger than 0 and both colors are equal, again, normal file conflicts and handling is done. If both are larger than 0 and differ then the "higher" color wins, meaning the 64bit binary. No file conflicts will be done for that case either.
There are 2 basic forms for updates: Specified or complete updates. For the former this is easily done by looking for packages that match the ones given on the command line. Those can either be names (with version, release and arch information, of course), regular expressions or binary rpm packages. For the later a special pre updates pass has to be done in order to correctly obsolete old packages with newer ones (like the changes from xorg-x11 to modular-x from FC3 to FC4). Other than that it's identical to the specified case where the names to be updates are simply the names of all packages installed on the system.
For a single arch system it is fairly easy to find the best update package: We just look for the package with the "highest" arch (except if exactarch is specified in which case the arch of the update package needs to exactly match the one of the installed package) and then the highest version.
For multilib system things get a lot trickier. If the given update package is either marked as "install only" (like usually kernel packages) or if no package with that name has already been installed we use the simple algorithm as before, selecting the highest arch/version match for the update. If exactarch has been set then we just need to go through all installed archs of each name and find the highest versions for each arch.
If exactarch isn't set we now need to determine if more than one major arch for a name is installed. Major arch in this case means either 32bit or 64bit, e.g i386, i486, i586, i686, athlon for 32bit and x86_64 for 64bit. If there is only one we can and do allow even major arch changes (from 32bit to 64bit and vice versa). If packages for more than one major arch are already installed we need to find updates for each of the installed major archs and do the final selection as in the previous cases.
RPM provides the ability to have packages execute a set of scripts before and after installation and uninstallation. This often is needed to either prepare the system for the actual installation, do some post processing after a package is installed, some pre processing before a package is uninstalled or cleanup after a packages has been uninstalled.
Although the basic idea is fairly simple it still can cause some confusion when scripts are executed for example during updates of packages. This is the order how an update of a package is done in relation to the scripts:
Run %pre of new package
Install new files
Run %post of new package
Run %preun of old package
Delete any old files not overwritten by newer ones
Run %postun of old package
So during an update the new package gets installed first with all it's scripts and then the old package with all it's scripts get uninstalled.
Sometimes it is necessary for packages to run scripts when other packages are installed or uninstalled. This can be done using trigger scripts. Triggers are therefore very similar to requirements and are written the same way in the spec files. The order of how and when trigger scripts are called looks like this:
Run %pre for new version of package being installed
Install new files
Run %post for new version of package being installed
Run %triggerin from other packages set off by new install
Run %triggerin of new package
Run %triggerun of old package
Run %triggerun from other packages set off by old uninstall
Run %preun for old version of package being removed
Delete any old files not overwritten by newer ones
Run %postun for old version of package being removed
Run %triggerpostun of old package
Run %triggerpostun from other packages set off by old uninstall
Looks a little more complicated, but allows to perform postprocessing by other packages after some package has been installed or uninstalled. Take notice that there isn't a %triggerpostin and that %triggerin basically behaves like a %triggerpostin.
Last updated 2008-02-18 08:02:43 CEST