Man Pages

tc-hfsc(7) - phpMan tc-hfsc(7) - phpMan

Command: man perldoc info search(apropos)  

TC-HFSC(7)                           Linux                          TC-HFSC(7)

       tc-hfcs - Hierarchical Fair Service Curve

       HFSC - Hierarchical Fair Service Curve was first presented at SIGCOMM'97. Developed as a part of ALTQ (ALTerna-
       tive Queuing) on NetBSD, found its way quickly to other BSD systems, and then a few years ago  became  part  of
       the  linux  kernel. Still, it's not the most popular scheduling algorithm - especially if compared to HTB - and
       it's not well documented from enduser's perspective. This introduction aims to explain how HFSC  works  without
       going to deep into math side of things (although some if it will be inevitable).

       In short HFSC aims to:

           1)  guarantee precise bandwidth and delay allocation for all leaf classes (realtime criterion)

           2)  allocate excess bandwidth fairly as specified by class hierarchy (linkshare & upperlimit criterion)

           3)  minimize  any  discrepancy  between  the service curve and the actual amount of service provided during

       The main "selling" point of HFSC is feature (1), which is achieved by  using  nonlinear  service  curves  (more
       about  what  it  actually  is later). This is particularly useful in VoIP or games, where not only guarantee of
       consistent bandwidth is important, but initial delay of a data stream as well. Note that it  matters  only  for
       leaf classes (where the actual queues are) - thus class hierarchy is ignored in realtime case.

       Feature  (2) is well, obvious - any algorithm featuring class hierarchy (such as HTB or CBQ) strives to achieve
       that. HFSC does that well, although you might end with unusual situations, if you define service  curves  care-
       lessly - see section CORNER CASES for examples.

       Feature (3) is mentioned due to the nature of the problem. There may be situations where it's either not possi-
       ble to guarantee service of all curves at the same time, and/or it's impossible to do so fairly. Both  will  be
       explained later. Note that this is mainly related to interior (aka aggregate) classes, as the leafs are already
       handled by (1). Still - it's perfectly possible to create a leaf class w/o realtime service, and in such case -
       the caveats will naturally extend to leaf classes as well.

       For the remaining part of the document, we'll use following shortcuts:

           RT - realtime
           LS - linkshare
           UL - upperlimit
           SC - service curve

       To  understand how HFSC works, we must first introduce a service curve.  Overall, it's a nondecreasing function
       of some time unit, returning amount of service (allowed or allocated amount  of  bandwidth)  by  some  specific
       point in time. The purpose of it should be subconsciously obvious - if a class was allowed to transfer not less
       than the amount specified by its service curve - then service curve is not violated.

       Still - we need more elaborate criterion than just the above (although in most generic case it can  be  reduced
       to it). The criterion has to take two things into account:

           ?   idling periods

           ?   ability to "look back", so if during current active period service curve is violated, maybe it isn't if
               we count excess bandwidth received during earlier active period(s)

       Let's define the criterion as follows:

           (1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)

       Here 'w' denotes the amount of service received during some time period between t0 and t1. B is a  set  of  all
       times,  where  a  session  becomes active after idling period (further denoted as 'becoming backlogged'). For a
       clearer picture, imagine two situations:

           a)  our session was active during two periods, with a small time gap between them

           b)  as in (a), but with a larger gap

       Consider (a) - if the service received during both periods meets (1), then all is good. But what if it  doesn't
       do  so  during the 2nd period ? If the amount of service received during the 1st period is bigger than the ser-
       vice curve, then it might compensate for smaller service during the 2nd period and the gap  -  if  the  gap  is
       small enough.

       If  the  gap  is larger (b) - then it's less likely to happen (unless the excess bandwidth allocated during the
       1st part was really large). Still, the larger the gap - the less interesting is what happened in the past (e.g.
       10 minutes ago) - what matters is the current traffic that just started.

       From  HFSC's  perspective, more interesting is answering the following question: when should we start transfer-
       ring packets, so a service curve of a class is not violated. Or rephrasing it: How much X() amount  of  service
       should  a session receive by time t, so the service curve is not violated. Function X() defined as below is the
       basic building block of HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of course, X()  is
       based on equation (1) and is defined recursively:

           ?   At the 1st backlogged period beginning function X is initialized to generic service curve assigned to a

           ?   At any subsequent backlogged period, X() is:
               min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
               ... where t0 denotes the beginning of the current backlogged period.

       HFSC uses either linear, or two-piece linear service curves. In case of linear or two-piece linear convex func-
       tions  (first  slope  <  second  slope),  min()  in  X's definition reduces to the 2nd argument. But in case of
       two-piece concave functions, the 1st argument might quickly become lesser for some t>=t0. Note, that  for  some
       backlogged  period,  X()  is  defined  only  from that period's beginning. We also define X^(-1)(w) as smallest
       t>=t0, for which X(t) = w. We have to define it this way, as X() is usually not an injection.

       The above generic X() can be one of the following:

           E() In realtime criterion, selects packets eligible for sending.  If  none  are  eligible,  HFSC  will  use
               linkshare criterion. Eligible time 'et' is calculated with reference to packets' heads ( et = E^(-1)(w)
               ). It's based on RT service curve, but in case of a convex curve, uses its 2nd slope only.

           D() In realtime criterion, selects the most suitable packet from the ones chosen by E(). Deadline time 'dt'
               corresponds  to  packets'  tails  (dt = D^(-1)(w+l), where 'l' is packet's length). Based on RT service

           V() In linkshare criterion, arbitrates which packet to send next. Note that V() is function  of  a  virtual
               time  -  see  LINKSHARE  CRITERION section for details. Virtual time 'vt' corresponds to packets' heads
               (vt = V^(-1)(w)). Based on LS service curve.

           F() An extension to linkshare criterion, used to limit at which speed linkshare  criterion  is  allowed  to
               dequeue.  Fit-time  'ft'  corresponds  to  packets' heads as well (ft = F^(-1)(w)). Based on UL service

       Be sure to make clean distinction between session's RT, LS and UL service curves and the above "utility"  func-

       RT  criterion ignores class hierarchy and guarantees precise bandwidth and delay allocation. We say that packet
       is eligible for sending, when current real time is bigger than eligible time. From all  packets  eligible,  the
       one  most suited for sending, is the one with the smallest deadline time. Sounds simply, but consider following

       Interface 10mbit, two classes, both with two-piece linear service curves:

           ?   1st class - 2mbit for 100ms, then 7mbit (convex - 1st slope < 2nd slope)

           ?   2nd class - 7mbit for 100ms, then 2mbit (concave - 1st slope > 2nd slope)

       Assume for a moment, that we only use D() for both finding eligible packets, and choosing the most fitting one,
       thus  eligible  time  would be computed as D^(-1)(w) and deadline time would be computed as D^(-1)(w+l). If the
       2nd class starts sending packets 1 second after the 1st class, it's of course impossible to  guarantee  14mbit,
       as  the interface capability is only 10mbit.  The only workaround in this scenario is to allow the 1st class to
       send the packets earlier that would normally be allowed. That's where separate E() comes to help.  Putting  all
       the  math aside (see HFSC paper for details), E() for RT concave service curve is just like D(), but for the RT
       convex service curve - it's constructed using only RT service curve's 2nd slope (in our example - 7mbit).

       The effect of such E() - packets will be sent earlier, and at the same time D() will be updated  -  so  current
       deadline  time  calculated  from it will be bigger. Thus, when the 2nd class starts sending packets later, both
       the 1st and the 2nd class will be eligible, but the 2nd session's deadline time will be smaller and its packets
       will be sent first. When the 1st class becomes idle at some later point, the 2nd class will be able to "buffer"
       up again for later active period of the 1st class.

       A short remark - in a situation, where the total amount of bandwidth available on the interface is bigger  than
       the  allocated  total  realtime parts (imagine interface 10 mbit, but 1mbit/2mbit and 2mbit/1mbit classes), the
       sole speed of the interface could suffice to guarantee the times.

       Important part of RT criterion is that apart from updating its D() and E(), also V() used by  LS  criterion  is
       updated.  Generally  the RT criterion is secondary to LS one, and used only if there's a risk of violating pre-
       cise realtime requirements. Still, the "participation" in bandwidth distributed by LS criterion  is  there,  so
       V() has to be updated along the way. LS criterion can than properly compensate for non-ideal fair sharing situ-
       ation, caused by RT scheduling. If you use UL service curve its F() will be updated as well (UL  service  curve
       is an extension to LS one - see UPPERLIMIT CRITERION section).

       Anyway  -  careless specification of LS and RT service curves can lead to potentially undesired situations (see
       CORNER CASES for examples). This wasn't the case in HFSC paper where LS and RT service curves couldn't be spec-
       ified separately.

       LS  criterion's  task  is to distribute bandwidth according to specified class hierarchy. Contrary to RT crite-
       rion, there're no comparisons between current real time and virtual time - the  decision  is  based  solely  on
       direct comparison of virtual times of all active subclasses - the one with the smallest vt wins and gets sched-
       uled. One immediate conclusion from this fact is that absolute values don't matter - only ratios  between  them
       (so  for example, two children classes with simple linear 1mbit service curves will get the same treatment from
       LS criterion's perspective, as if they were 5mbit). The other conclusion is, that  in  perfectly  fluid  system
       with linear curves, all virtual times across whole class hierarchy would be equal.

       Why is VC defined in term of virtual time (and what is it) ?

       Imagine  an  example:  class A with two children - A1 and A2, both with let's say 10mbit SCs. If A2 is idle, A1
       receives all the bandwidth of A (and update its V() in the process). When A2 becomes active, A1's virtual  time
       is  already  far  bigger  than A2's one. Considering the type of decision made by LS criterion, A1 would become
       idle for a lot of time. We can workaround this situation by adjusting virtual time of the class becoming active
       -  we  do that by getting such time "up to date". HFSC uses a mean of the smallest and the biggest virtual time
       of currently active children fit for sending. As it's not real time anymore (excluding trivial case  of  situa-
       tion where all classes become active at the same time, and never become idle), it's called virtual time.

       Such  approach  has its price though. The problem is analogous to what was presented in previous section and is
       caused by non-linearity of service curves:

       1)  either it's impossible to guarantee service curves and satisfy fairness during certain time periods:

           Recall the example from RT section, slightly modified (with 3mbit slopes instead of 2mbit ones):

           ?   1st class - 3mbit for 100ms, then 7mbit (convex - 1st slope < 2nd slope)

           ?   2nd class - 7mbit for 100ms, then 3mbit (concave - 1st slope > 2nd slope)

           They sum up nicely to 10mbit - interface's capacity. But if we wanted to only use  LS  for  guarantees  and
           fairness  - it simply won't work. In LS context, only V() is used for making decision which class to sched-
           ule. If the 2nd class becomes active when the 1st one  is  in  its  second  slope,  the  fairness  will  be
           preserved  -  ratio  will be 1:1 (7mbit:7mbit), but LS itself is of course unable to guarantee the absolute
           values themselves - as it would have to go beyond of what the interface is capable of.

       2)  and/or it's impossible to guarantee service curves of all classes at the same time [fairly or not]:

           This is similar to the above case, but a bit more subtle. We will  consider  two  subtrees,  arbitrated  by
           their common (root here) parent:

           R (root) - 10mbit

           A  - 7mbit, then 3mbit
           A1 - 5mbit, then 2mbit
           A2 - 2mbit, then 1mbit

           B  - 3mbit, then 7mbit

           R arbitrates between left subtree (A) and right (B). Assume that A2 and B are constantly backlogged, and at
           some later point A1 becomes backlogged (when all other classes are in their 2nd linear part).

           What happens now ? B (choice made by R) will always get 7 mbit as R is only (obviously) concerned with  the
           ratio  between  its  direct  children. Thus A subtree gets 3mbit, but its children would want (at the point
           when A1 became backlogged) 5mbit + 1mbit. That's of course impossible, as they can only get  3mbit  due  to
           interface limitation.

           In  the left subtree - we have the same situation as previously (fair split between A1 and A2, but violated
           guarantees), but in the whole tree - there's no fairness (B got 7mbit, but A1 and A2 have to  fit  together
           in  3mbit) and there's no guarantees for all classes (only B got what it wanted). Even if we violated fair-
           ness in the A subtree and set A2's service curve to 0, A1 would still not get the required bandwidth.

       UL criterion is an extensions to LS one, that permits sending packets only if current real time is bigger  than
       fit-time  ('ft').  So the modified LS criterion becomes: choose the smallest virtual time from all active chil-
       dren, such that fit-time < current real time also holds. Fit-time is calculated from F(), which is based on  UL
       service  curve.  As you can see, it's role is kinda similar to E() used in RT criterion. Also, for obvious rea-
       sons - you can't specify UL service curve without LS one.

       Main purpose of UL service curve is to limit HFSC to bandwidth available on the  upstream  router  (think  adsl
       home  modem/router,  and linux server as nat/firewall/etc. with 100mbit+ connection to mentioned modem/router).
       Typically, it's used to create a single class directly under root, setting linear UL service curve to available
       bandwidth  - and then creating your class structure from that class downwards. Of course, you're free to add UL
       service (linear or not) curve to any class with LS criterion.

       Important part about UL service curve is, that whenever at some point in  time  a  class  doesn't  qualify  for
       linksharing due to its fit-time, the next time it does qualify, it will update its virtual time to the smallest
       virtual time of all active children fit for linksharing. This way, one of the main things LS criterion tries to
       achieve  -  equality of all virtual times across whole hierarchy - is preserved (in perfectly fluid system with
       only linear curves, all virtual times would be equal).

       Without that, 'vt' would lag behind other virtual times, and could  cause  problems.  Consider  interface  with
       capacity  10mbit,  and  following  leaf  classes (just in case you're skipping this text quickly - this example
       shows behavior that doesn't happen):

       A - ls 5.0mbit
       B - ls 2.5mbit
       C - ls 2.5mbit, ul 2.5mbit

       If B was idle, while A and C were constantly backlogged, they would normally (as far as LS  criterion  is  con-
       cerned) divide bandwidth in 2:1 ratio. But due to UL service curve in place, C would get at most 2.5mbit, and A
       would get the remaining 7.5mbit. The longer the backlogged period, the more virtual times  of  A  and  C  would
       drift  apart.  If  B  became  backlogged  at  some  later  point  in  time,  its  virtual  time would be set to
       (A's vt + C's vt)/2, thus blocking A from sending any traffic, until B's virtual time catches up with A.

       Another difference from original HFSC paper, is that RT and LS SCs can be specified separately. Moreover - leaf
       classes  are  allowed to have only either RT SC or LS SC. For interior classes, only LS SCs make sense - Any RT
       SC will be ignored.

       Separate service curves for LS and RT criteria can lead to certain traps, that  come  from  "fighting"  between
       ideal linksharing and enforced realtime guarantees. Those situations didn't exist in original HFSC paper, where
       specifying separate LS / RT service curves was not discussed.

       Consider interface with capacity 10mbit, with following leaf classes:

       A - ls 5.0mbit, rt 8mbit
       B - ls 2.5mbit
       C - ls 2.5mbit

       Imagine A and C are constantly backlogged. As B is idle, A and C would divide bandwidth in 2:1 ratio, consider-
       ing  LS service curve (so in theory - 6.66 and 3.33). Alas RT criterion takes priority, so A will get 8mbit and
       LS will be able to compensate class C for only 2 mbit - this will cause discrepancy between virtual times of  A
       and C.

       Assume  this situation lasts for a lot of time with no idle periods, and suddenly B becomes active. B's virtual
       time will be updated to (A's vt + C's vt)/2, effectively landing in the middle  between  A's  and  C's  virtual
       time.  The  effect - B, having no RT guarantees, will be punished and will not be allowed to transfer until C's
       virtual time catches up.

       If the interface had higher capacity - for example 100mbit, this example would behave perfectly fine though.

       Let's look a bit closer at the above example - it "cleverly" invalidates one of the basic things  LS  criterion
       tries to achieve - equality of all virtual times across class hierarchy. Leaf classes without RT service curves
       are literally left to their own fate (governed by messed up virtual times).

       Also - it doesn't make much sense. Class A will always be guaranteed up to 8mbit, and this  is  more  than  any
       absolute  bandwidth that could happen from its LS criterion (excluding trivial case of only A being active). If
       the bandwidth taken by A is smaller than absolute value from LS criterion, the unused part  will  be  automati-
       cally  assigned  to  other active classes (as A has idling periods in such case). The only "advantage" is, that
       even in case of low bandwidth on average, bursts would be handled at the speed defined by RT criterion.  Still,
       if extra speed is needed (e.g. due to latency), non linear service curves should be used in such case.

       In the other words - LS criterion is meaningless in the above example.

       You can quickly "workaround" it by making sure each leaf class has RT service curve assigned (thus guaranteeing
       all of them will get some bandwidth), but it doesn't make it any more valid.

       Keep in mind - if you use nonlinear curves and irregularities explained above happen only in the first segment,
       then there's little wrong with "overusing" RT curve a bit:

       A - ls 5.0mbit, rt 9mbit/30ms, then 1mbit
       B - ls 2.5mbit
       C - ls 2.5mbit

       Here,  the  vt  of A will "spike" in the initial period, but then A will never get more than 1mbit, until B & C
       catch up. Then everything will be back to normal.

       In certain situations, the scheduler can throttle itself and setup so called watchdog to wakeup  dequeue  func-
       tion  at some time later. In case of HFSC it happens when for example no packet is eligible for scheduling, and
       UL service curve is used to limit the speed at which LS criterion is allowed to dequeue  packets.  It's  called
       throttling, and accuracy of it is dependent on how the kernel is compiled.

       There're  3  important  options  in modern kernels, as far as timers' resolution goes: 'tickless system', 'high
       resolution timer support' and 'timer frequency'.

       If you have 'tickless system' enabled, then the timer interrupt will trigger as slowly as  possible,  but  each
       time  a  scheduler  throttles  itself (or any other part of the kernel needs better accuracy), the rate will be
       increased as needed / possible. The ceiling is either 'timer frequency' if 'high resolution timer  support'  is
       not  available  or  not  compiled  in, or it's hardware dependent and can go far beyond the highest 'timer fre-
       quency' setting available.

       If 'tickless system' is not enabled, the timer will trigger at a fixed rate specified by  'timer  frequency'  -
       regardless if high resolution timers are or aren't available.

       This is important to keep those settings in mind, as in scenario like: no tickless, no HR timers, frequency set
       to 100hz - throttling accuracy would be at 10ms.  It  doesn't  automatically  mean  you  would  be  limited  to
       ~0.8mbit/s  (assuming  packets at ~1KB) - as long as your queues are prepared to cover for timer inaccuracy. Of
       course, in case of e.g. locally generated udp traffic - appropriate socket size is needed as well. Short  exam-
       ple  to  make  it  more understandable (assume hardcore anti-schedule settings - HZ=100, no HR timers, no tick-

       tc qdisc add dev eth0 root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbit

       Assuming packet of ~1KB size and HZ=100, that averages to ~0.8mbit - anything beyond it (e.g. the above example
       with specified rate over 10x bigger) will require appropriate queuing and cause bursts every ~10 ms. As you can
       imagine, any HFSC's RT guarantees will be seriously invalidated by  that.   Aforementioned  example  is  mainly
       important  if  you deal with old hardware - as it's particularly popular for home server chores. Even then, you
       can easily set HZ=1000 and have very accurate scheduling for typical adsl speeds.

       Anything modern (apic or even hpet msi based timers + 'tickless  system')  will  provide  enough  accuracy  for
       superb 1gbit scheduling. For example, on one of basically cheap dual core AMD boards I have with following set-

       tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
       tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbit

       And simple:

       nc -u 54321 </dev/zero
       nc -l -p 54321 >/dev/null

       ...will yield following effects over period of ~10 seconds (taken from /proc/interrupts):

       319: 42124229   0  HPET_MSI-edge  hpet2 (before)
       319: 42436214   0  HPET_MSI-edge  hpet2 (after 10s.)

       That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvious drawback of it is that cpu load can be
       rather extensive with servicing that many timer interrupts. Example with 300mbit RT service curve on 1gbit link
       is particularly ugly, as it requires a lot of throttling with minuscule delays.

       Also note that it's just an example showing capability of current hardware.   The  above  example  (essentially
       300mbit TBF emulator) is pointless on internal interface to begin with - you will pretty much always want regu-
       lar LS service curve there, and in such scenario HFSC simply doesn't throttle at all.

       300mbit RT service curve (selected columns from mpstat -P ALL 1):

       10:56:43 PM  CPU  %sys     %irq   %soft   %idle
       10:56:44 PM  all  20.10    6.53   34.67   37.19
       10:56:44 PM    0  35.00    0.00   63.00    0.00
       10:56:44 PM    1   4.95   12.87    6.93   73.27

       So, in rare case you need those speeds with only RT service curve, or with UL service curve  -  remember  about

       For  reasons unknown (though well guessed), many examples you can google love to overuse UL criterion and stuff
       it in every node possible. This makes no sense and works against what HFSC tries to do (and  does  pretty  damn
       well).  Use UL where it makes sense - on the uppermost node to match upstream router's uplink capacity. Or - in
       special cases, such as testing (limit certain subtree to some speed) or customers that must never get more than
       certain  speed.  In  the last case you can usually achieve the same by just using RT criterion without LS+UL on
       leaf nodes.

       As for router case - remember it's good to differentiate between "traffic to router" (remote console, web  con-
       fig, etc.) and "outgoing traffic", so for example:

       tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
       tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50mbit
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2mbit ul m2 2mbit

       ... so "internet" tree under 1:1 and "router itself" as 1:999

       Please refer to tc-stab(8)

       tc(8), tc-hfsc(8), tc-stab(8)

       Please direct bugreports and patches to: <>

       Manpage created by Michal Soltys (

iproute2                        31 October 2011                     TC-HFSC(7)