The basic definition of network is a system of links (or channels) that interconnect nodes (or peer entities) in order to move information between them.
Obviously lots of different networks but we will largely speak about the internet.
- Ties together 20,000+ ISP networks
- Uses the IP protocol to do the tying of different networks.
- It is bloody huge, with a huge range of technologies involved
- We know all of the other problems (failures, trust etc)
Bandwitch/Capacity - is the width of the link, number of bits and sent/received per unit time (bits per second)
Latency/Delay - is the 'length' of the link, propogation time to travel along the link (seconds)
Bandwidth Delay Product - is the 'volume' of the link, amount of data that can be in flight at a time (undefined, bits)
It gives the maximum amount of data that can be transmitted by the sender at a given time before waiting for acknowledgment: it is the maximum amount of unacknowledged data.
Packet delay - time it takes for an entire packet to travel from A to B.
Packet Delay = Transmission Delay + Propogation Delay = (Packet Size/ Bandwidth) [How long it takes us to put the independent packet in the link] + Link Latency [How long it takes 1 bit to travel a link]
Jitter: is the variation in the time between data packets arriving, caused by network congestion, or route changes
Pipe View: Diagramatic view of packets in a link
>
- y: Bandwidth, x: Time
- <img src="/notes/typora-user-images/Screenshot 2021-01-22 at 17.51.53.webp" alt="Screenshot 2021-01-22 at 17.51.53" style="zoom:50%;" />
- Top Bracket: Packet transmission time
- Arrow: A packet, a unit of data to make transmission 'easier'
- The length of the pipe indicates its latency.
But it isn't that easy, we can't have EVERY node having a direct link to EVERY other node, so we need some smarts - switched networks. We have a box (a switch) sitting in our graph moving the data around.
Circuit switching - a source node reserves capacity along a path to the destination node, it then has sole use of the link so ends data when it wants and tears down the link at the end. This can be improved with:
Frequency Division Multiplexing ( means sharing physical resource) - users have less bandwidth because they split the channel into multiple frequencies (radio stations)
Time Division Multiplexing - users have full bandwidth, but can only transmit at certain points (their slot). Time is divided into frames (a full sequence of senders) which go into slots (each sender has one slot):

x: Sots
Top arrow: empty slots are wasted capacity
Bottom arrow: slots are at recurring time.
😊 guaranted performance + fast transfers over dedicated link
😔 wastes bandwidth if traffic is bursty (not a continuous stream, send-wait-send...)
😔 connnection setup (reserve and teardown) is overhead for small changes
😔 recovery from failure is slow as it needs to RESERVE a new link if the current one dies
eg POTS: Plain Old Telephone System
Packet switching - data is sent as chunks (packets) with a header (telling the network what to do with the packet) and a payload (data). We then get switches to forward packets to the 'right' place to go based on their forwarding table.
Connectionless assumptions of a packet-network (packet goes out automatically without determing whether the receiver is ready, or even whether a receiver exists)
We almost always assume switches are store and forward so they receive a packet in full before sending it on (this means we can use checksums etc).
The other option would be cut through switch which sends as data comes in (as the header comes first and tells us where next) BUT this is rarely used.
We don't reserve link capacity in advance so we leverage statistical multiplexing - means we can have multiple links sending data and we can handle it. Because:
we share our link OVER TIME, but data is transmitted based on demand rather than predetermined time step.
When data is there to be sent we send it. This works pretty well when traffic is bursty (connections sending at different times).
But at some point we will get data in from two places, at which point we get transient overload and we use a queue/buffer to store packets while we wait to be able to send them the buffer absorbs changing rate of transmission.
However with persistent overload we will fill the buffer and end up dropping packets.
This queuing means we have queuing delay caused by packet interference (when the network is busy our packet goes slower)
- Our key number is undefined: L is packet length, a: average packet arrival rate, R: bandwidth = Traffic intensity
- Traffic intensity approx 0 undefined small queuing delay
- Traffic intensity closest to 1 undefined large queuing delay
- Traffic intensity > 1 undefined infinite queuy delay (drop packets) as more data arriving that can be handled
😊 Pretty much all the cons of circuit switching: no wasted bandwidth, no setup overhead, resilient
😔 No guaranteed performance due to queuing delays
😔 Send extra information (headers)
Circuit switching: has a set number of max users simultaneously (total link size / bandwidth per person needed), so even if users barely need a link, they still stop other people using it.
Packet switching - has no set maximum, but we can calculate the probability of the link being overloaded:
Using a binomial distribution: n is number of users we WANT. p is probability of user being active at any time.
Overloaded number (k): number of users who would have to be active to overload network
undefined
The activity chance follows the law of large numbers (we have a big enough sample). The central limit theorem is valid aka the traffic characteristics are normal. ON-OFF periods are normally distributed
Bursts of traffic activity, packet-bursts, similtaneous bursts of traffic exhibit heavy-tails and are not normal.
Results in RTT computation that provides a smooth, continuous response.
Traffic may be synchronised – either overtly (router-update example) or accidently (packet loss at the same buffer. (Each users being active is independent.)
creating synchronised loss and retransmission events where the flows themselves are not independent.
Centralised multiplexing
coordinating how the different nodes share the physical resource with knowledge of flows, e.g. queue lengths (for variable-rate flows) and priorities, can maximise channel usage (consider static real-time scheduling)
also no latency when making a centralised (local) decision (compared to CSMA/CD when we might need to retry).
however requires a centralised scheduler - a central “point of decision” having both control over multiplexing and the required flow-specific knowledge
The internet has tier of ISP's
- Tier 3 and Local ISP's - these are the last hop of bits in the network, they are customers of the higher tier ISP's.
- Tier 2 - these are regional and are customers of Tier 1, but also do peer-to-peer with each other privately.
- Tier 1 - the backbone of the internet which have wires worldwide.
How to build an internet cake
We use the concept of abstraction to split the internet (just a big computer system) into modules, unsuprisingly there are positives and negatives. We say we divide the system into layers where:
Interaction only happens between adjacent layers (in the same system)
Bottom layer is physical media
Top layer is application
Entities (a layer in one system) communicates with its peer entities, this is supported by the layers below it.
OSI Model
Application: The thing
Presentation: Allow applications to interpret meaning of data (encrypt, compress).
Session: Tracks dialog between computers as sessions
Transport: (Un..) reliable transport
Network: Best effort global packet delivery
Data link: Best effort local packet delivery
Physical: Physical transfer of bits
Actual Internet: {from OSI}
Application {Application/Presentation/Session}: SMTP/HTTP/DNS
Transport {Transport}: TCP/UDP
Network {Network}: IPv4, IPv6
Data link {Data link}: Ethernet, 802.11
Physical {Physical}: Copper, Fiber, Radio
We implement layering via embedding:

Basically high-layer info is embedded with lower layer info.
We can partly decode in transit to figure out something in the switch/router.
In these layers we have protocols, we say that in order for entities to communicate they must speak the same protocol (& version number). So the protocols are standardized, so that this can happen. The intermediate layers between application and physical layers gives us abstractions to make life easier.
😊
Modularity hides underlying detail/complexity
Innovation in different layers can proceed in parralel.
Applications only need to understand a couple (TCP, UDP) of transport mechanisms to work.
Networks using different lower-level protocols can communicate (primarily with IP)
😔
Layers may duplicate functionality
There are layer violations (layers doing too much) when gains are large (TCP-over-wireless) or when no trust (Firewall)
Information hiding can hurt performance: TCP doesn't get get told WHY a packet was lost: corruption vs congestion.
Headers can get bigggg.
But overall the layering approach is the reason the internet exists
End-to-end principle is massively cited/debated but broadly means that implementing functionality IN the network does not mean we don't do the work in the ENDS as well. There are two interpretations from this:
Only if sufficient - only implement functionality at lower levels if it can be completely implemented at this level:
if we can stop the hosts doing work at a lower level, then do so, but if host will still have to do work then DONT.
Only if necessary - Don't implement anything in the network that can be implemented correctly by the hosts. We make the network layer minimal so the system remains more flexible.
Only if useful - Only implement functionality at a lower layer if there is performance enhancements when using the functionality AND when applications don't use it.
End-to-end vs hop-by-hop
hop-by-hop, do everything at every hop
hop-by-hop encipherment is a necessary but insufficient condition for safe/secure data
hop-by-hop can not ensure a web-page will not be intercepted however wifi security modes such as WEP/WPA/WPA2 help - as does 802.1x by ensuring all traffic (in wireless) is enciphered at least in the wireless/eduroam domain, thereby preventing trivial interception.
end to end do as little as necessary in the network - push functionality to the ends.
end-to-end ensures secure carriage between client and server; however it removes the ability for the network to optimise carriage (no caches work)
end-to-end is expensive but can ensure your website order / credit card is not readable from any intermediate systems.
Makeup of a network:
Hosts - implement the full stack, they are bits on the wire all the way to the application layer.
Router - participate in global delivery, so must read up to the IP layer to know what switch to head to next.
They don't support reliable delivery as they have do nothing with Transport Layer.
Switches - only participate in local delivery, so read the data link layer to find the next switch to send to.
There are alternatives to standardisation eg C:
>
- Only have one implementation of the thing, and everyone use it e.g. Java
- Have open-source projects to create the implementation e.g. Python
- Sole-sourced implementation.
The Data Link Layer: Layer 2
Where a packet is called a frame, which encapsulates a datagram by adding a head & a trailer. It gives:
Physical addressing
Reliable delivery between adjacent nodes
Flow control
Error Control
Access control
We implement the link layer in adaptors or Network Interface Cards; which attach into the host system.
Line coding
Is how we convert binary to voltage patterns:
Non-Return-To-Zero: the simplest technique forces high if 1 and low if 0
😔 Long stream of 0's/1's causes baseline wander
we measure what is 0/1 from an average of the signal, so with lots of a signal value our baseline changes
😔 Long stream of 0's/1's causes clock drift
we haven't got enough transisitions to resync/recover the clock
Non-Return-to-Zero-Mark: a 1 causes a transition (from either direction), 0 means do nothing
Non-Return-to-Zero-Inverted: same as NRZM but transitions exactly on the 1 (in the middle of the signal being held).
Differential Manchester: use a clock and XOR NRZ and clock together, where each clock cycle is 1 bit (full up and down)
😔 Our bit-rate (rate of actual data transfer) is half our Baud rate (number of hi/lo's sent) as we sent two levels per 0/1
NRZM/I has the same problem as NRZ wtih clock drift as we could still see few transitions, so we use block coding to make sure there are always some exchanges (not lines of 0's or 1's). We turn any 4 bit symbol into a 5 bit symbol with a magic table, then code that with line coding. This is most commonly used as 8b/10b codec so every byte is turned into a 10 bit string with transitions to then be coded.
This could also be done with line code scrambling statistically (to get lots of 0's/1's) choose a string of bits which both parties know, then XOR it with the data before sending. XOR is symmetric so we XOR the sent data to get the actual data.
Note this doesn't provide secrecy but can be combined with block coding to give the idea of framing (start/end of a frame)
Perfect signal do not exist: problems cause bits to flip:
Attenuation - loss of energy in signal
Distortion - signal has its form changed
Noise - all sorts of different noise
We send symbols on the wire which represents data. Normally we just have hi/lo for 1/0 but we can increase the number of symbols using quad level code (0, undefined, undefined, 1) corresponds to 00, 01, 10, 11), this could increase our bit rate (data transfer speed) if our baud rate (symbol transfer speed) stays constant.
Error detection
add a parity bit - every undefined bits we add a parity bit such that there is an even number of 1's present, then when receiving we spot errors when there is an odd number of 1's
😔 doesn't work for two-bit errors; to make it sueful we need a lot of extra bits
cyclic redundancy check - every undefined bits we add a CRC(undefined)
😊 can detect multi-bit errors
😔 doesn't fix errors
Action: For both of these we can detect errors then ask the sender-to-send again
Error correction
Parity bit grid - sort the bits in a grid, have parity bits for each column AND row, then for single bit error, we find the problem bit as one column and row will both be wrong.
Forward error correction - assign only some bit sequences as valid (as far away from each other as possible) then when we get a 'bad bit sequence' find its closest error-free data (where distance is number of bits needing to change) and fix it.
Multiple access protocol
An algorithm distributed across nodes to decided how they share (the medium - eg WiFi or Ethernet), and where the medium is their ONLY coordination method. (example was CDMA)
Ideal Protocol:
when one node wants to transmit its ends at rate undefined
when undefined nodes want to transmit each sends at rate undefined.
Fully decentralised so no clocks etc
Channel partitioning:
Time Division Multiplexing and/or Frequency Division Multiplxesing but now called Time Division Multiple Access & Frequency Division MA.
😔 we get wasted bandwidth/time 😊 is fair
CMDA: Code Division Multiple Access
This is another multiplxing scheme which is used in shared mediums (3G etc)
We assign a unique code/chipping sequence (must be orthogonal) to each user, and each user encodes their data by XOR'ing the chipping sequence with the data. Decoding is then the opposite.
The technique allows for multiple users to transmit simultaneously on the same frequency.

What does the data look like?
How is data bit and code combined?
How is channel output decoded?

How does it work with the multiple senders and single channel link ?
How does decoding differ?
Taking turns (Bluetooth, IBM Token Ring)
Polling - some master invites shares to submit data, and they do. The master polls the slaves.
😔 Polling overhead + lots of latency - single point of failure
Token passing - we pass a control token around, saying who's allowed to speak at any moment
😔 Token overhead + lots of latency + single point of failure (from losing token)
😊 gives bounded latency
Asynchronous Transfer Mode - we have slots like Time Division MA, but anyone can send at any time, they just attach 'who they are' as a header
😃 variant of packet switching
😔 virtual circuits need setting up undefined OVERHEAD
Random access - basic concept is when node has packet send it, when there are multiple nodes transmitting stop & wait.
we are broadcasting rather than point-to-point so can get collisions when a node is trying to receive multiple signals simultaneously -> neither signal gets heard.
Needs three ideas:
Carrier sense - listen before sending and don't interrupt if channel is busy
Collision detection - if two users start simultaneously stop and have another go later.
Randomness - determine the later in some random fashion, so you don't re-collide.
Chosen solution in wired connections is: Carrier Sense Multiple Access/Collision Detection (CSMA/CD)
We encorporate carrier sense and collision detection together
carrier sense isn't enough due to propogation delay
this is easy in wired connections but hard in wireless
When a collision is detected we STOP and try again later this avoids wasted time in garbage signals.
😔 the time for collisions to be detected can be upto undefined where undefined is time to travel between nodes, this can waste a LOT of time
performance (efficency): Time spent transmitting packets undefined Total time
undefined: packet size
undefined: bandwidth
undefined: constant about how many collisions
undefined: distance between two nodes
so large packets, small distances undefined performance undefined 1
but when we increase bandwidth undefined network 'fully used' less
because of this equation high speed networks have switches to REDUCE undefined.
The random access bit is done with: exponential backoff, after m^th^ collision:
choose k randomly from {0, ..., 2^m^ - 1}
wait (K \* slot size) before trying again.
min packet size used as slot
Ethernet Frame Format
A lot of what we've been speaking about is encorporated into some version of Ethernet it can be used on all sorts of cables, network setups and medias. The exact thing that is implemented depends on version but the common attributes:
Preamble: white noise to synchronise clocks
Destination MAC address: 48 bit addresses has no 'proper' structure, but are globally unique (probably) and don't normally change
Source MAC address
Type
Payload
CRC: error detection
Pairs of twisted cabling is a single connection. (coaxial/fibre)
802.11 (WiFi)
The basic idea is that we have access points set to a specific channel, which broadcast their Service Set ID (SSID) and MAC address periodically. Then hosts scan for access points and send them stuff.
One of our key problems is that collisision detection is really hard, because we can't listen and send simultaneously and a sender signal is far less likely ot be correct.
If we try and do normal collision detection we have two problems:
Hidden Terminals

A is hidden from C so neither realises they have caused a collission at B.
Exposed Terminals

C sees that B is sending to A, so decides not to send to D, even though it could do.
Key issues are:
There is no concept of a global collision; everyone reaches different people.
Collisions occur at the receiver not the sender and we only care about what receiver can hear not sender.
Wireless CSMA - (collision avoidance)
CA (collision avoidance):
Our carrier sense is still similar to collision detection, if the medium is busy wait a random time) rather than keep checking) but other than that no change.
We infer collisions from the (lack of) receipt of an ACK from a receiver.
On collisions we use exponential backoff to adapt contention window (avoid having contention again)
😊 Fairly low latency and little wasted bandwidth with low collision probability
😔 doesn't work well with hidden terminals still
when the hidden terminal sees an ACK it doesn't know if transmission is done.
RTS/CST:
Method
Sender sends a Request to Send frame with the length of the transmission
Receiver responds with a Clear to Sender frame
Sender sends data
Receiver sends an ACK
Situations
When sender doesn't get CTS back, assume collision and backoff
If a node hears RTS, but not CTS - send data as the relevant receiver is not in range of our node, so can send
If a node hears a CTS - stay quiet until you hear an ACK
😊 good for high-traffic Access Points and helps with hidden terminals problem
😔 slower than standard CA (but could be turned off dynamically).
Channels: another technique for avoiding wireless collisions
split our frequency band into channels which don't interfere with each other
so two nodes in some range won't interfer with each other (if they choose different channels).
For Wifi:
we split into 14 overlapping channels, but there are 3 completely separate channels - 1, 6, 11.
but other technologies already share some of WiFi's frequency space so cause interference eg Zigbee.
The hardware: behind the data link layer
Hubs - dumb physical repeater; bits come in from one node and hub sends out bits to everyone else. NOTHING SMART, no CD etc.
Switch - smart link layer device takes active role and selectively forwards frames to correct place based on MAC address and Switch Table.
They are transparent (hosts unaware of presence) and self-learning (they figure out MAC addresses as they go)
It allows us to connect different physical media as it needs to do conversions between different media protocols.
Our aim is to make a switch table with each entry containing
MAC address of host
Interface number to reach host: which 'wire' to send stuff down
Time to Live: how long entry lives
Approach
Firstly we say that EVERY packet received we record MAC address and interface number.
Then to find new MAC addressses we flood "who's MAC address is this?"
Wait for a response
😊 this works with connected swtiches as well, so every switch has a table of the next step for ALL hosts.
😔 problems when we have loops!
Solution: to loop problem!
create a spanning tree: a sub graph connecting each vertex WITHOUT ANY LOOPs, if a link is not in spanning tree, we won't use it.
Each switch determines if each interface is on the shortest path from the root i.e not in the spanning tree.
distributed minimum spanning tree algorithm:
send messages of form (Y, d, x)
Y: proposed root
d: distance to root
x: node message came from
everyone proposes themselves as root: (X, 0, X)
arbitrarily decide root based on lowest ID
switches constantly receive messages and will change their distance (and flood everyone else) if
message has lower root ID than current choice
message has shorter distance to some root
algorithm must react to failure:
root switch periodically annoces itself as root (and propogate around tree).
other switches detect failure through timeout and claim to be root if they think failure has happened (algorithm completely starts again)
Local networks can't be too big as:
creating MST's is hard for larger networkers
every switch knows the location of every other host - BIG tables.
The Network Layer
Aim: Produce a global system with its own addresses: building on top of the data link layer - which was fundamentally LOCAL.
Router: the tech that operates on the network level.
It does two jobs at different timescales:
Forwarding
we direct a data packet to an outgoing link based only on the individual router state (a forwarding table).
this has to be very fast
Routing
we compute the paths packets will take and generate the individual router state based off of conversations with other routers.
This can afford to be slower as we shouldn't have to do this too often.
Router Capacity = N (Number of external router paths) x R (Speed/Line Rate of a port)
We normally assume ports are full duplex so routers can send & receive simultaneously along the path.
Inside our router:
Data plane
Linecards (input)
process packets on their way in and strips away ethernet frame to get bits of the (IP) packet and make forwarding decisions.
they also manage queues and each have their own foorwading table.
Interconnect Fabric
basically does what linecards want, and moves packets between linecards.
Linecards (output)
repackage packets as ethernet frames and send on its port.
Control plane
Route/Control Processor
makes decisions over a long time, it create the forwarding table and pushes it to EVERY linecard (input) simultaneously.
Routing: control plane
We split the internet into Autonomous Systems/Domains and manage routing at two levels:
Intra-Domain routing - each domain chooses its routing protocol to route packets around its own domain (each domain is controlled by one admin).
Inter-Domain routing - the common protocol to allow domains to speak to each other, we use Border Gateway Protocol (it allows companies to do as little comms as possible), but we won't cover that in this course 😃.
The actual preference becomes "whatever BGP considers best" so it may pick either solution.
Devolves a BGP routing solution problem to a Longest prefix match problem. LPM always wins.
Local, short path routes for a particular IP prefix are injected into the local BGP routes. These cause IP traffic for an address in that prefix to be routed to a locally designated destination.
Can also us Content distribution system specialist DNS – useful when there is no local datacenter.
Implement netblock filtering to restrict the size of the routing table.
The whole concept of routing protocols: is that each router calculates its own routing table by treating the network as a graph where edges have an associated "cost" (which is some number/function of distance). They want to find the last cost path from A to B.
multi homing allows devices to be on multiple networks but this increase the routing table size.
Intra-Domain routing algorithms
Link State
Each node maintains its local link state (a list of directly attached nodes + cost).
Each nodes floods its local link state, and nodes learn the global link state and forward on the information.
Each node knows the whole graph, so computes Dijsktra to find its shortest path to every other node (<u>each iteration we find one more closest neighbour, so we have to do undefined iterations!</u>) then its forwarding table comes from this info (just the first hop needing to be done).
😔 scalability
we need to send undefined messages to discover global state.
every node has an undefined (can become undefined) operation to compute Dijkstra.
Dijkstra you explore the best node possible from you're current frontier.
Keep a stack on the left to show order you explored.
our link state database in each node is undefined
we have undefined entires in our forwarding table
😔 transient disruptions
failures result in some routers knowing before others so the link-state DB is inconsistent.
this can lead to forwarding loops of nodes thinking the other is the shortest path, so goes back and forth.
Distance Vector
Every node maintains a distance vector of how far away from EVERY other node it is.
It shares this with its neighbors.
On receiving a distance vector a node compares it to their current distance vector and updates any entries that are shorter (with a note to say where to send it).
This iterative process means nodes converge to the set of shortest paths by only knowing hte NEXXT node that it needs to send to.
😔 using a non-additive metric (eg maximal capacity of link) - we get a problem with looping as the 'cost' won't necessarily change in a loop, so we keep going looking for the best.
😔 routers use different metrics - if nodes care about different things their distance state vector will have different meaning to different nodes, this is a problem.
😔 routers can lie - either on purpose or not, we can get ALL traffic going through one router because it says its next to EVERYTHING.
😔 Overall it is more vulnerable to loops
😊 It has 'good' complexities
#messages = undefined
where #iterations is ideally undefined but varies due to routing loops or the count-to-infinity problem: nodes feed each other bad information due to a failed node and distance to failed node keeps increasing.
poisoned reverse we advertise undefined to the mutual node.
processing = undefined
Metrics for routing:
Propogation delay
Congesiton
Load balance
Bandwidth (available, capacity, maximal)
Price
Reliability
Loss rate
Combinations
Message delivery types:
Broadcast: send message to everyone
Multicast: send message to all members of some group
Anycast: send message to any members of some group.
Datapath
will keep working even when control plan is broken (it will keep using most recent table update).
per-packet delay: undefined (how to speed up)
typical values
RTT is .15s and BW 100Mb/s
for a LAN link
queueing ≫ propagation ≫ transmission ≫ processing
for a Data Centre link (unusually large b/w but bound (small) RTT values)
(queueing) ≫ transmission ≫ propagation ≫ processing
absurdly low latency and large bandwidth: undefined, 10 Gb/s such that actually bandwidth delay product much lower.
implementations: some of the considerations in DCN include practical circuit-orientated networks, with the lowest layer having no switches or queues.
the cause of propagation (proportional to distance)
the cause of queueing (proportional to contention)
the cause of processing (unusual packets)
the cause of transmission (the capacity of the link)
processing delay: time required to examine a packets headers and determine where to direct the packet.
use circuit switching (however needs reservation of link etc)
simple deltas on the amount of processing per packet
can include e.g., bit-level error checking (which you could remove)
queueing delay: packets experience the queueing delay as they are stored in queues awaiting transmission onto the link. The delay depends on the amount of congestion and length of queues.
accept more occasional loss with smaller queues
range from 0 (or negligible) to undefined scale values, undefined, but use number of packets and buffer story to calculate.
transmission delay & propagation delay already discussed.
per-packet processing:
accept packet arriving on link.
lookup packet destination in forwarding table: speed-limit on the operation: a naive forwarding table for a 32 bit address would have 4 billion entries!
we group entries to keep table size small
implement Longest Prefix Match by ordering the entries from most to least specific then searching downwards. That way the first matching entry we find will be the most specfic - we should choose Cmabrdige rather than England.
Exception processing requires that a packet be sent to a control processor causing significant additional delay - the control plane much slower and will also take up transfer time.
manipulate packet header eg decrement Time to Live
send packet to outgoing port.
buffer packet in the queue: need to decide how big to make our buffers: based a lot on TCP as this is a majority of the traffic
Rule of thumb | packets | Intuition | Assume | Evidence |
|---|---|---|---|---|
undefined | 1,000,000 | TCP Sawtooth | Single TCP Flow, 100% Util | Simulation, Emulation |
undefined | 10,000 (with flows) | Sawtooth Smoothing | Many Flows, 100% Util | Simulations, Test-bed and Real Network Experiments |
undefined | 20-50 | Non-bursty arrivals | Pace TCP, 85-90% Util | Simulations, Test-bed Experiments |
Def: T (RTT: Round trip time) - C (capacity) - W (window size)
Row 1: This is huge and present loads of problems: big buffer undefined longer queuing time as use slower DRAM.
Row 2: Can use SRAM feasibly. Basically says the peak sawtooth won't be reached by everyone simultaneously, we're more likely to have a fairly flat line.
Row 3: If access network is slower than the backbone this makes sense.
transmit packet
IPv4: the de facto transport layer for the internet
Structure

Version number: necessary so we know what else to expect
Header length: number of 32-bit words in the header typically 5
TOS: allow packets to be treated in different ways, v complex, largely ignored
Number of bytes of packet: max size in theory is (2^16^ - 1) but Ethernet frame has smaller max
2nd row: used for fragmentation if packet too big later xx.
Time To Live (TTL): Decremented at each hop and discarded if reaches 0: Time exceeded ICMP (Internet Control Message Protocol) message sent when TTL=0
Protocol: Identifies the higher level protocol: 6 is TCP; 17 is UDP
Checksum: check if the header is correct: if it is wrong router discard it. Recalculated every router.
Source: useful for reply mechanices and acceptance decisions.
Destination: get there! Allows node to make forwarding decisions
Options: asks routers to do some extra work
Fragmentation
When forwarding a packet an internet router can split it into pieces for the next hop, this is needed as network links Maximum Transfer Units - which is the largest possible data-link frame. This changes as we travel as different technologies have different capacities.
We then reassemble at the final destination, even if fragmented multiple times.
Structure:
The identifier (16 bits) are used to tell which fragments belong together.
The flags (3 bits) have:
Reserved (RF)
Don't Fragment (DF) - tells routers to not fragment, they instead DROP the packet and send back "Too large" ICMP message.
More (MF) - says that this fragment isn't the last one from the original packet.
The offset (13 bits) is used to say which part of the original datagram the fragment covers in 8 byte units.
Addressing
An IP address is 32 bits long in the form 128.12.3.244.
We assign an address to each interface (connection between computer and physical link).
Generally hosts have one interface, whereas routers have multiple.
Subnets are groups of interfaces
the first part of the IP address is the subnet part
the second part is host part
/24 in the address: is the subnet mask: that tells us how many bits are 'subnet'.
Devices in a subnet can interface with each other without a router intervening
e.g: subnet[223.1.2].host[3]/24
We use Classless Inter-Domain Routing:
we AND the subnet mask (a bunch of 1's) with IP address to get the subnet address.
we get IP address by:
Hard coding it in a system admin file.
Dynamic Host Configuration Protocol - we dynamically ask for an IP address from a server and it gives us one.
this means a network can reuse IP addresses when computers are turned off.
the server knows addresses its allowed to allocate by being told by ISP's who get told what they can allocate by ICANN. Noe the hierarchical organisation makes routing easier in the form of adverts by ISP's saying "send me anything with addresses xxx".
special addresses:
0.0.0.0/8 - labels the current network (to be used as source address)
10.0.0.0/8 - used for local communications within a private network
127.0.0.0/8 - used for loopback addresses/localhost, connecting to itself.
192.0.0.0/24 - used for private network
20.3.2.0/24 or 20.3.0.0/16 - looks like a subnet mask is the network identifier
20.3.2.255/24 or 20.3.255.255/16 - subnet then all 1's is the broadcast address
192.108.1.1/24 or 192.108.1.254/24 - is OFTEN used as the default gateway, the IP address of the router in NAT senses.
NAT: Network Address Translation
A technique developed to avoid address exhaustion in IPv4
We allocate our local network to be the IP address range 10.0.0.0/24, and each interface gets a local address.
Then the router is given a single address to represent the whole network (this is the address the rest of the internet sees), and we identify hosts with port numbers within the packets.
Specifically for a datagram:
Outgoing datagrams replace (source IP addr, #port number) [LAN side address] with (NAT IP addr, #new source port) [WAN side address] in the router.
Remember using NAT Translation Table the translation pairs we use.
Incoming datagrams use the table to replace the destination fields in IP & TCP headers.
😊
Only one IP address needed for multiple devices: reduces address exhaustion
Can mess with our local IP addresses without telling outside world
Can change external address and internal hosts don't need any change
More security as devices aren't explictly addressable: FIREWALL
😔
Violates end-to-end principle as needs to deal with layer 4 (TCP ports)
Address shortage could be solved in other ways (IPv6)
We may want a server behind a NAT.
difficult as no prior connections to source - outside machine, must know the right port number to direct to ?
😯 statically configure NAT to designate a certain machine to a specific port
😯 universal Plug n Play allows NATted hosts to learn public IP addresses and add/remove port mappings for external access.
😯 Relay system - client talks to relay, NATted server initiates comms with relay.
ICMP: Internet Control Message Protocol
Its messages are carried in IP datagrams: they don't transmite DATA, just set messages.
Structure:
Type - Code - Checksum
Combined in an IP packet with protocol value of 1
They are sent for error reporting and echo request/reply ALL over the place.
An example use is traceroute:
send a series of UDP messages with increasing TTL's (from 1) on an unlikely port number.
when the n^th^ datagram reaches the n^th^ router TTL expires, router discards the datagram and returns an ICMP message (with router name - found with reverse lookup + IP address)
When get an ICMP message back we can calculate RTT
😔
ICMP messages sometimes follow different paths to IP packets
Discover routes at the interface level not the router level.
Sometimes routes just ignore our messages cause they don't like ICMP.
Address Resolution Protocol: sending an IP address to the right place - linking network and data link layer:
Every node maintains an ARP Table of <IP address, MAC address> pairs.
When sending a packet we consult the table, if its in there then send it.
If the IP address isn't there, we broadcast "who has this IP address\*\*" and someone responds with their MAC address, then sender caches result in table.
😔 ARP allows people to impersonate IP addresses really easily, ARP poisoning is a classic security flaw.
Example: A Sending a Packet to B

Host A constructs an IP packet to send to B
Host A has a gateway router R: this is used to reach addresses outside its subnet.
Address for R learned via DHCP/config through broadcast
Host A learns the MAC address of R's interface
ARP request & response
Host A encapsulate the packet and sends to R
Router R's adaptor receives the packet
R extracts the IP packet from Ethernet frame and reads destination
Router R consults its forwarding table
Router R's learns the MAC address of host B
ARP request & response
Router R encapsulates the packet and sends to B
IPv6: solving the unnecessary complexity of IPv4 and address exhaustion
Protocol changes
Larger Address Space - addresses 128 bits long rather than 32
Increased MTU - networks can handle far more traffic than 1970s
No enroute/in-network fragmentation - simplfies handling of packets in network, hosts are expected to do Path MTU Discovery before sending.
No checksums in IP header but REQUIRED in UDP: this is inline with the end-to-end principle.
Extension headers - are used for extensions to IPv6, they are specified with the next header number which tells us the extension OR the protocol of the next layer.
Auto-configuration
Flow-label
Replace broadcasts with multicast
Addresses Details
The basic format is 8 16-bit blocks separated by colons (4 hex digits per block). We condense representation by eliminating leading zeros and replacing ONCE a block: eg 000c... to c:d
We split them into:
Network prefix (often split into routing prefix & subnet ID) is 64 bits and used for routing a packet.
Interface identifier - is 64 bits and used for identifying a specific host on a subnet
A /64 mask routes to 2^0 subnet, /63 routes to 2^1 etc...
Reserved:
::/0 default route/specify all networks
host-local: localhost; ::1 is equivalent to 127.0.0.1
link-local: not routed: fe80::/10 is equivalent to 169.254.0.0/16
site-local: not routed globally: fc00::/7 is equivalent to 192.168.0.0/16 or 10.0.0.0/8
global unicast: 2000::/3 is basically any v4 address not reserved in some other way
multicast: ff00::/8 is equivalent to 224.0.0.0/4
Broadcasting
ping6 ff02::1 All nodes in broadcast domain
ping6 ff02::2 All routers in broadcast domain
Actual ISP numbers (example)
ISPs have 16 bits for site prefixes (from /32 to /48)
Site prefixes have 16 bits for subnet IDs (from /48 to /64)
Each interface ID has 64 bit address space
Assume privacy addresses (no 64 bit EUI-64 constraint. . . random numbers generated for arbitrary sized interface IDs)
Say the ISP keeps control of the subnet bits gives them 32 bits of customer address space
Or even takes 16 bits of the interface IDs gives them 48 bits of customer address space (256T addresses!)
The Neighbour Discovery Protocol is a subset of ICMPv6 that allows hosts to discover other hardware on the network.
A host can solicit a neighbour host to get the MAC address of the host (or check if the address is in use)
A host can solicit a router to learn about network configuration.
It knows where to send this due to a well-known multicast address.
Address assignment
The network prefix is derived from Router Advertisements (bascially they are told this by a 'higher up', which has been allocated by ICANN).
The identifier:
can use Stateless Address Auto Configuration (SLAAC): allows hosts to automatically configure/choose their addresses and learn the routes
when a host start up it decides upon a link local address, just so it can get the network prefix off of a router (or wait for a periodic router advert announcement)
once it has the network prefix it derives an Extended Unique Identifier from its MAC address (maths magic!). This is actually how it generates its unique link local address.
After router solicitation, the router knows the IP address generated belongs to that host!
😔 this has privacy concerns as the EUI-64 doesn't change over time so can be tracked around the world.
😔 embedding the MAC address in the IP address means a host's vendor is identifiable undefined bad for security.
we can use stateful DHCPv6 which is just like the old DHCP.
Transition of IPv4 to IPv6
We note that the two protocols are fundamentally incompatible, so to introduce IPv6 we need new infrastructure while keeping IPv4 working.
We use dual stacked hosts in the public internet which can handle IPv4 and IPv6: which then use only ONE of the protocols in the backend but that doesn't matter as users see the dual stacked host. You can turn off IPv6 by not showing a AAAA record.
IPv6 carries the obligation to try IPv6 before IPv4, this led to some conflicts with DNS resolution (since different from IPv4) so Happy Eyeballs (so you don't stare at the screen) was a solution which modifies application behaviour:
IPv6-only to the host but an IPv4-only app trying to access an IPv4-only service:
464XLAT: DNS64 + NAT64 + shim layer on the host itself to offer IPv4 addresses / understanding to apps.
Improvements
Have verifiable source addresses for both accountability & anonymity
Use different packet headers for the edge and the core as they do different jobs
A payment field or responsbility field for who is in charge of packet delivery (ISP's etc).
Transport Layer
Location: normally at the end - hosts.
Aim:
Multiplexing - multiple applications on one host can use the network, we need to direct packets to the correct application.
Reliable data delivery - if we just used IP, we can get packets doing all sorts and we won't necessarily have a reliable service, so the transport layer aims to give this.
Present in TCP and not NEEDED.
Paced data delivery - we want to do some pacing control so we don't overwhelm the network (or waste bandwidth).
Present in TCP and not NEEDED.
Our main two protocols:
User Datagram Protocol (UDP) a minimalist protocol on which we build everything else
Tranmission Control Protocolo (TCP) gives reliable, in-order byte stream abstraction with congestion control but no performance guarantees.
UDP:
Our application layer talks to us via sockets - software abstraction in the OS. So each application can define these sockets to put data into and take data out of it. We then have ports which are transport layer identifiers. We store a mapping between sockets and ports, and our packets contain source and destination port numbers. So when a host receives an IP packet; it combines destination IP address AND destination port number to send the TCP payload to the correct socket.
Ports are 16 bit numbers with 0-1023 having particular services on them:
SSH: 22
HTTP: 80
HTTPS: 443
Then 1024-65535 have dynamically chosen meanings they are ephemeral ports.
And that is basically all UDP has: meaning its very lightweight and adaptable, its packets (which are contained in IP packets) have:
Source port
Dest Port
Checksum
was optional in IPv4 but is now mandatory in IPv6
Length
length of UDP header + UDP payload
Reliable Data Transfer: the IP layer only gives us best effort transfer, but things can happen to packets: corrupted - lost - delayed - reordered - duplicated.

Version 1.0: assume reliable channel
We represent the protocol with state machines.
where state means the next is uniquely determined by next event.

Version 2.0: we now consider that packets may get corrupted (and get detected with checksum) but they aren't ever lost.
We have receiver feedback so receiver can send ACK or NAK to the sender

Sender:
IDLE: waiting for application to ask
Waiting: hold onto packet sent, until ACK
Receiver
Check corrupt: send ACK or NAK accordingly
But we have a problem - ACK/NAK can also get corrupted; sender doesn't know what happened at receiver and can't just retransmit: possible duplicate.
Solution - data and ACK packets carry sequence numbers and we check corruption of ACK/NAK on corrupt we resend as though it was a NAK - receiver discards (doesn't deliver) duplicate packet.
we only need 0, 1 for sequence numbers because there is only ever one outstanding message.
From this addition we can also eliminate NAK as the ACK now says the # of the last received packet and if that number is the exepcted # we are good, if not treat it as though it was NAK.
Version 3.0: we can lose packets and they can be corrupted.
We introduce timeouts at the sender, it waits a reasonable amount of time then resends packet if no ACK received.
This can lead to duplicates being received as it could be the ACK got lost, not the packet.
This is already handled with sequence numbers introduced earlier - the receiver checks if the packet has the sequence number it is expecting, if not it sends back ACK of previous number.
Performance issue: we get a bottleneck as we are waiting for EVERY packet
time to transmit packet = undefined
L: packet length
R: Bandwidth
Utilisation = time sending packets / total time = undefined.
Conceivable for undefined to be 30ms and undefined to be 0.008ms, we have A LOT of wasted time.
Sender:

topRdt_send: set sequence number of packet
topRight: if a reply is mangled or wrong, we do nothing, but leave timer running
lowerTopRight: guess that receiver has not got packet and send again.
upperBotRight: correct reply so happy that the right packet has been received.
botRight: if we receive any ACK's just discard them as we aren't expecting any.
botRdt_send: same as other half, but with a seq #1 of 1.
Receiver
Wait for #0 packet
same
udt_rcv(packet) && ( corrupt || hasSeq1 )
udt_send(ack)
next
udt_rcv(packet) && ( not_corrupt && hasSeq0 )
udt_send(ack) && rdt_rcv(data)
Wait for #1 packet: analogous
Solution to Performance Issues
Pipelined/PacketWindow protocols allow us to send multiple packets, before getting the ACK back (lots of packets in transit simultaneously).
Our window is a set of adjacent numbers, we have a sliding packet window which moves on during successful ACK. We have to limit the size of the window as any packet in the window must be stored in a buffer, so it can be resent when issues arise.
The back edge of our sender window is controlled by the last non-ACK'd packet.
Everything in the window is sendable
A window size n with last contiguous packet ACK'd being A: {A + 1, A+ 2, ... A + n}
where A + n + 1 can't be sent before A + 1 is ACK'd.
The back edge of our receiver window is controlled by last received packet, we can only accept packets with numbers IN the window (B + 1 undefined B + n) where B is the last contiguous received packet.
But we have two options for how we ACK:
Cumulative ACK - receiver sends ACK with number of Next In Order sequence number it expects. So our ACK for B + 4 will still be B + 1 until B + 1 is received.
Selective ACK- ACK individually acknowledges correctly received packets.
We have two general schemes of use:
Go Back N - uses cumulative ACK's and ONLY accepts packets in-order. If it receives any packet which is not in order it discards it.

rdt_send
if in window then send packet and increment seq number
otherwise buffer it and block.
timeout
resend whole window
success
update last received packet i.e back edge of window

anything but correct packet, send ACK of old i.e NAK. If correct, send correct ACK, accept data and move window.
😊 Simple to implement
😔 Perfectly good packets get thrown away on receive - causes lots of network congestion as one error resends WHOLE window.
Selective Repeat - uses selective ACK and allows receiver to accept out of order packets (as long as they are in window). Sender has a timer for EVERY packet sent and resends it (and only it) when packet timer runs out.
😊 don't throw away out of order packets
😔 takes a lot more space i.e. SACK mask for 127x2 outstanding packets is 8-16 bytes.
😔 hard to implement
From all this we see we CAN fully utilise a link with a high enough window size (governed by buffer size)
Throughput undefined n/RTT
Remember windows are ONLY for efficiency, situations exist where they don't help:
with low latency and low bandwidth a simple Automatic Repeat reQuest (ARQ) system would be fine without windows.
When connection is TERRIBLE, we are governed by packets being lost, NOT the ACK system.
When the user doesn't have much to send.
Transmission Control Protocol (TCP):
Format of TCP packets:
It obviously has source port, dest port, checksum and header length for all the same reasons as UDP.
Checksum: computed over its header and data (not whole packet).
We use sequence numbers which are the byte offsets from a file.
It represents which byte in the larger file the packet payload starts from, so they don't count in 1's but instead in ANY number (but commonly in Maximum sgement sizes = MTU - (IP header) - (TCP header)</u>).
We also start the sequence number from a random Initial Sequence Number to reduce Man in the Middle attacks.
We treat data transfer as a byte stream hence why our sequence number count in bytes.
There is a Must be Zero and an Urgent pointer which we ignore
There are also Flags and an Advertised Window to be discussed later.
Acknowledgments
We make TCP full duplex by having a sequence number (for the attached payload) and an acknowledgement (saying what has been received). It implements a form of Go Back N, so the acknowledgement number is always the NEXT byte it is expecting, but the receiver can buffer out of order packets (like selective repeat).
It implements a fast retransmit optimization which is triggered by k duplicate ACK's (k is normally 3). We immediately resend the required packet after k ACK's.
We fudge retransmission timeout based on the RTT estimation, which is an exponential averaging of previous RTT's.
Karn/Partridge algorithm says only sample RTT for original transmissions (not retransmissions) and sets undefined (variable in exponential averaging) to 0.875.
it says timeout value = 2 \* Est RTT, but uses exponential backoff when we don't have values to estimate from.
We don't include retransmissions because we could get the ACK for the original packet and then the RTT sample will be really small.
exponential averaging: undefined
Jacobson/Karels algorithm measures the deviation as well as using exponential averaging then timeout value = Est RTT + 4\*(Est Deviation). This is because deviation is caused by queuing time in routers.
often a linear deviation.
ensures that the tail probability of a spurious retransmission is small.
Initialising a TCP connection
We have host A wanting to speak to host B, they have to first synchronise ISN's as ports can be getting reused so we want to establish NEW connection.
It takes the form of a three way handshake:
SYN - A sends a packet with SYN flag and it's chosen ISN in seq #
SYN ACK - B sends a packet with SYN ACK flag with it's chosen ISN in seq # & A's ISN + 1 in ACK field.
ACK - A sends a packet with ACK flag with seq #1 = A's ISN and Ack# = B's ISN + 1
Note that the two ISN's wil be DIFFERENT, as we are setting up a channel for FULL duplex communication.
Losing SYN has large effect
We deal with SYN packets getting lost by setting an arbitrary timer to start (~3 seconds) and if we haven't got SYN ack by that point we send again.
Termination of connection
Normal termination, one side at a time
A send FIN
B send ACK (A's side of connection closed)
B send FIN
A send ACK
A could still receive another FIN from B if ACK is lost, so stays open a bit, then connection closed
Normal termination, both together
A send Fin
B send FIN ACK
A send ACK
Connection closes after a small wait period
Abrupt termination
A send RST (because application on A has crashed)
B doesn't need to ACK anything.
But A will continue to send RST anytime B sends it a message
as RST has not been sent reliably
Advertised window
the receiver indicates its window in its ACK's.
This helps with flow control - adjusting sending rate so we don't overwhelm receiver. The receiver advertises: W = B - (Last Byte Received - Last Byte Read)
W: window size
B: buffer size
Last byte received: RHS of buffer used
Last byte read: last byte of contiguous received
Stuff before this CAN still be in buffer it just need processing
The sender agrees to send no faster than W/RTT this means it can't overwhelm the receiver AND the receiver can adjust if things change.
We also do congestion control - adjusting sending rate so we don't overwhelm network
Due to our queuing system at routers we can delay packets and even drop packets when our network is busy. So we can't put too much load on the network or it becomes useless.
TCP solves this with end hosts adjusting their sending rates based on the implicit feedback from the network
We could do a far better job if routers told us the problem.
Jacobson's solution to congestion control
adapt our congestion window size in relation to network congestion we have to:
Discover available BW
Adjust to variations in BW
Share BW between flows
This is necessary to solve as:

Basically the knee is the point where throughput is increasing slowly and delay increases fast and the cliff is when throughput plummets and delay goes to infinity.
General approaches:
Send Without care: the market will regulate itself DUDEEEE: leads to lots of lost packets
Reserve BW: pre-arrange BW allocations and enforce it, requires negotiation and gives low utilitisation of wire.
Pricing: who pays wins, packets bid against each other to not be dropped.
Dynamic adjustment: hosts get info about congestion (from network report or inferring) and adjust their level.
also have On-Off flow control: explicit messages from receiver to transmitter saying stop or go.
These message may be embedded in a layer: physical coding, DLL (PAUSE frames), etc.
Appropriate when delay has a constant and known range (bounding).
In TCP: we use dynamic adjustment and our ACTUAL window size is minimum of congestion window size and receiver advertised window size.
Bandwidth Discovery: done with slow start
start with CWND=1 (packet) then double CWND everytime we have an RTT without loss (the whole window comes back okay).
When we get our first loss we have an estimate of available BW.
Adjusting to BW variations: by oscillating around available BW using Additive Increase Multiplicative Decrease.
window size grows by 1 MSS (Maximum Segment Size) for every RTT without loss then on loss of packet we half the CWND.
This change is done when CWND = ssthressh (slow start threshold), which is an initially large number, that halves every time we get a loss undefined the first few saw teeth may be exponential.
Fact recovery: we grant credit for every duplicated ACK so that halving window doesn't mean everything grinds to a halt.
We enter fast recovery after 3 duplicate ACKs (seomtime we resend the packet).
We set CWND = old CWND/2 + 3.
Then every extra duplicate Ack we increment CWND as we are clearly still getting stuff through its just one packet was lost.
We leave fast recovery when we stop getting duplicate ACKs and return to normal
On timeout: we normally got to slow start and CWND to 1
since we have lost a packet AND we haven't been receiving duplicate ACK's so the rest ouf our packets probably aren't going through!
🤨 There are a variety of TCP flavours which do congestion control differently (with different constants and techniques) and the fact all of those interact means that analysing congestion is a NIGHTMARE.
Modelling TCP throughput

Left arrow: Note this increase is more like when steps when zoomed in: <img src="/notes/Computer Networking Notes.assets/Screenshot 2021-01-26 at 19.54.34.webp" alt="Screenshot 2021-01-26 at 19.54.34" style="zoom:25%;" />
undefined
undefined
A is the total data transferred, we do some maths with this and get:
Throughput = undefined undefined undefined
undefined: packet drop rate / probability of packet loss
Assumptions:
RTT is predictable and stationary. RTT may fluctuate at a short time scale, for example,
due to queueing delay.
The loss probability undefined is predictable. It can be dynamic depending on the level of congestion.
All loss indications must be exclusively triple dupACKs. TCP retransmission timeout leads to slow start, deviating from the "sawtooth" pattern.
Only the network is the bottleneck. The receiver may also throttle the rate with Flow Control.
😔
We can be misled by non-congestion related losses (receiver crashes)
We can fill up queues and just be delayed (during discovery)
Short flows may complete before discovering availabilty of capacity.
Using AIMD is impractical for highspeed links as we spend AGES building up ours speed.
Sawtooth discovery is choppy and some apps want a CONSTANT throughput
A hack here is to set your throughput according to estimates from previous equation (so we're TCP friendly), but don't chop about, just set at one speed.
Unfair under different Round Trip Times (throughput inversely proportional to RTT)
Tight coupling with reliability mechanisms (the window)
Endhosts can cheat
😯
1&2 could be solved: if routers tell endpoints of congestion
3-7 solved: if routers tell endpoints what rate to send at
8 solved: if routers enforce fair sharing
Elements of congestion control and why TCP isn't up to par.
Isolation/Fairness: TCP assumes good citizens
Adjustment: TCP is often slow in adjusting due to AIMD
Detecting congestion: TCP uses implicit feedback
Router Assisted Congestion Control
Fair Queuing tries to overcome some of the problems with TCP congestion control by getting routers to do some work!
Basic concept:
is that we compute the time at which the last bit of a packet would have left the router if were served bit-by-bit, then we serve packets in order of these calculated deadlines.
It is a generalisation of round robin queuing but with packets of varying sizes being handled in a fair manner.

😊
Isolation as cheating hosts don't get benefit
Bandwidth share doesn't depend on RTT
Flows can have any congestion control method they want
😔
More bookkeeping/work being done per flow and per packet in routers
undermines the connectionless assumptions of a packet-network such as the Internet (thereby breaking things like alternative path routing)
😯 It doesn't eliminate congestion, it just manages it!
🤨 Saying "this is fair" is difficult as what should be fair? If every flow gets equal BW, then there is incentive to make lots of flows to get greater control, seems weird doesnt it....
Explicit Congestion Notification: basically routers can set bits in packets to tell hosts that they're busy right now, then a host can treat that bit as a dropped packet and adjust accordingly.
😊 Don't confuse corruption with congestion (there are other reasons for packet loss)
😊 Can be an early indicator to get hosts to avoid delays
😔 Not used very much for historical reasons - some routers used to throw away packets with ECN turned on.
Application Layer
We often talk about communications getting into the client-server paradigm, where hosts are either:
Servers: always on with a permenant IP address
Clients: may be intermittently connected with dynamic IP address, which want to communicate with server (clients initialise connection).
DNS
For clients to talk to servers we have a human readable name which needs to be translated to an IP address, this is because we want:
To be able to map one name to multiple IP addresses (load balancing)
To be able to move our website to a new IP address, and still be accessible.
To have multiple names for the same website (.org and a .co.uk)
But we need to be able to translate between the two, so we need a Domain Name System. We manage it as a hierarchical database, where different domains (the.\*\*) are managed by different people with an unamed root at the top of the hierarchy commanding the whole thing.
We have Top Level Domain Servers for .co.uk etc and authoratative DNS servers which actually do the mappping and sit underneath the TLD's.
We actually have 13 root servers spread out across the world which are then further replicated by any-casting to have around 100 root servers.
We use resolution: we don't directly ask these servers
Iterative resolution: we ask the local DNS serer for an IP.

The local DNS server (if it doesn't know) asks the root for the relevant next server to contact and then we ask the next and so on, until we get the authorative DNS server for the domain, which tells us the IP address.
Recursive resolution: means the OTHER DNS servers do the work, so they recursively ask the next DNS server themselves.

😔 puts extra burden on root DNS server that it doesn't want.
😊 Iterative solutions do not permit any cache-advantage across multiple hosts making the same enquiry, while recursive does.
Hybrid resolution: means we use recursion INSIDE our organsiation, then use iterative OUTSIDE.

😊 most used system
Caching: necessary because fresh requests take a long time (increases availability as replica)
In the DNS servers they cache responses to common queries then they have a TTL field, which expires after a certain time so that the data does not go stale in our local DNS server.
Some servers also implement negative caches where they remember 'dud' domain names (especially common misspellings) which take time to fial. But this isn't often implemented.
Generally, DNS records do not frequently change, so a fairly large TTL is possible. However, if a record does change, the domain could be inaccessible for up to TTL time i.e. can handle that much time of downtime if Its change.
86400 seconds, or one day, is a good tradeoff.
With MX records they do not want any messages lost if their IP changes, so one day of downtime is unacceptable. As a result, the record expires after only 60 seconds, and is thus resolved almost every time an email is sent.
The nameserver record indicates which DNS server is authoritative for that domain
Reliability
Replication of servers (primary/secondary) - so that a name service (DNS for a domain) always has a server up - this also helps us load balance.
Use special UDP - we build on UDP as we want speed but we add reliability techniques so messages aren't lost.
If a request timeout ask a new server or exponential backoff
Each version of a query has same ID - as we don't care where the data comes from: all replicas should ALWAYS have same entries.
Increase availability by having name server on separate routable entities and/or mirrors.
Security: on its own isn't hugely secure
There is no verification of answers - there are many attacks (eg Man in the Middle) which could exploit this
Requires trust of local DNS server - we fundamentally have to trust the local DNS server to be 'a goodie'.
Cache poisioning - local records on a machine can be easily messed with.
DNSSEC: to provide message authentication and integrity verification wtih cryptographic signatures.
Does not:
Provide Authorisation
Provide Confidentiality
Protect against DDOS
We distribute keys via DNS, by building chains of trust from the root downwards, so parents sign the keys of their children.
World Wide Web
Internet includes email file transfer etc - anything could be transmitted.
Web - are some specific protocols for webpages etc.
We use Hyper Text Markup Language to describe web pages.
We use **Uniform Resource Locator ** to describe the names of pages:
protocol://hostname[:port]/directorypath/resource
protocol: http, ftp, https etc
hostname: DNS name or IP address
port: default to protocol's standard port
HTTP: used for most thing internet (Sending HTML files, JPG ...)
We send requests which are generally just headers in plaintext with
Structure: Message type - Host - UserAgent - ...
Where message type:
GET: this is asking for a certain file/bit of data in its entirety
HEAD: the same as GET, but only wants the header (so it can check content size, accessbiltiy, recent modifications etc)
POST: we submit stuff to the server (often from a form or something) to make something new.
PUT: we update or modify some information
Servers send responses:
Structure: A status line - A header - Some data
Where status line: generally contains protocol, status code and status message. Status codes:
1XX: information response
2XX: class of successful responses
200: just means 'ok', generally comes back with complete data requested
3XX: redirection messages about URI being elsewhere
4XX: class of Client Error Messages
400: says that we made a bad request
404: page not found
5XX: server error responses
500: internal server error
502: bad gateway
It is stateless by design
😊 Scalability of servers is easy as failure handling/out of order requests don't matter
😔 We need state!
We use cookies: an arbitrary string with some meaning to the server (e.g. DB index)
A client's first request asks for a cookie
The server responds with a cookie.
Every other client request sends that cookie with it, so servers can interpret the cookie to associate that client and send back the relevant information.
Performance
We note that a webpage needs multiple GET requests as we fetch HTML then images, CSS etc...
This means HTTP/1.0's stop & wait is very slow as each request does:
SYN, SYN ACK, ACK /w GET, Data, ACK
Solutions
Concurrent requests/responses - we open up multiple connections in parralel. We don't necessarily need to preserve ordering because it's stateless BUT it can screw the network (remember fair sharing??)
Pipelined Requests/responses - we use one TCP connection, but we batch requests (for the five different images or something)
Persistent connections - we maintain TCP connection across multiple requests (including transfers subsequent to HTML such as images/CSS)
Scorecard | Get n small objects undefined | Get n large objects undefined | General |
|---|---|---|---|
One-at-a-time | 2 _ n _ RTT | n(F/B) | SLOW |
Persistent | (n + 1) \* RTT | n(F/B) | Aggregates TCP overhead, but closing connections is hard |
M concurrent paths | 2 (n/m) \* RTT | (n/m)(F/B) | Bad for network |
Pipelined | 2RTT | n(F/B) | Sensible and takes advantage of TCP slow start |
Pipelined + Persistent | 2RTT + RTT after that | n(F/B) | Pretty good effort |
Get n small objects: dominated by latency
Get n large objects: dominated by bandwidth
Caching
cache at the client using if-modified-since
Reduce network use
We note that clients often transfer the same information, so we want to reduce server work.
We add an if-modified-since to our GET requests so the server replies:
Not modified if resource has not changed since requested time: (304)
The content if resource has changed (200)
Because of this we add two fields to response headers:
Expires - tells us how long to cache for
No-cache - tells us to ALWAYS ask server for recent version.
reverse proxies (by the content provider) to have a cache of static documents close to the server, which is what clients actually speak to.
reducer server work
forward proxies (by the ISP's) to have a cache of static documents near the clients in the ISP's server (Netflix!)
Reduce network use and server work
CDN: Content distribution networks (eg Akamai) to integrate forward and reverse proxies, we have caches all over the place with are fed by reverse proxies.
We even move some processing to these servers which handles
Dynamic Web Pages
Transcoding
Some Security
We have to add infrastructre all over the place, but its worth it and makes LOTS of cash.
We can do this with:
DNS entries
multi machines at many locations
an authoritative DNS for a page return IP address of a CDN server
URL rewriting
rewritten URL's point to specific CDN servers
Akamai
Anycast
multiple servers advertise same IP Address so most local routing achieved.
SPDY to HTTP/2 (from HTTP/1.1)
Binary protocol: so parsing and transport is more efficient and we have less errors.
Multiplexing using interleaving and persistance
Server push: becauser servers now need to GENERATE pages, we push static content (eg CSS or static JS) first, THEN push the HTML pages while page is built.
Header compression: we use gzip or some similar techniques to compress our headers.
Multi-hosting
Multiple sites on one machine: we have multiple server processess with different IP's or ports and then HTTP requests indicate site name in the required header.
😊 useful for web hosting companies.
Multiple machines for one site: we take advantage of address translation, so different DNS servers have different entries for some URL.
😊 helps to handle load and reduce latency
at one location: we only need one IP address, then use a load balancer to direct traffic to different machines (with one TCP connection always going to the same machine).
multiple locations: DNS is configured to return 'best' address
QUIC: Quick UDP Internet Connections, it throws away TCP and starts again
Does Reliable Transport over UDP - even though UDP is normally fire and forget
Uses Forward Error Correction
Does default crypto to reduce startup cost of TLS
we haven't spoken about this, but TLS has A LOT of startup cost as lots of things need to be agreed.
Offers restartable connections
Voice over IP
So up to this point we've basically just spoken about getting web pages, but we also want to send other data - VOICE/VIDEO 😳
In comes Session Initiation Protocol (that's right we have sessions, no longer stateless) to allow telephoning over internet:
User location
User availability
User capabilities
Session setup
Session management
It is just a application layer protocol, nothing wild, uses status codes and stuff similar to HTTP.
The key difference is we need resource reservation so we can be sure packets will reach destination fast enough.
This DOESN'T actually happen over wider internet, but it does still exist in local networks where we control the infrastructure so we guarantee reliability.
The other (more common) solution is to just use HTTP and accept some packets are dropped (losing one frame isn't the end of the world).
Peer to peer networking: there are some situations where we don't have an always on server, we just have peers with intermittent connection and changing IP addresses. We have three main situations for this.
File distribution (eg Bit Torrent or BBC iplayer 1.0)
The basic concept is that if our one server is sending N copies of a file to N machines the server does a lot of work.
Instead P2P has the server sending one copy and clients share the files around each other.
Time to distrbute file size F to N clients:
undefined: upload speed of server to each client
undefined: peer i download speed
undefined: peer i upload speed
Server-Client: undefined
1: the servers output speed onto network is limiting.
2: the slowest download speed
P2P: undefined
1: server upload file once
2: smallest download speed clients
3: total number of bits downloaded across network / (fastest upload rate is server + all clients).
Bit torrent
We have our tracker server which tracks participating peers and has the file.
It gives the list of peers to one of the clients
The torrent is the group of peers exchanging chunks (256kb pieces) of file.
We pull chunks by asking our neighbours for their chunks list and then asking for the chunks we're missing.
We then have a tit-for-tat where neighbours send chunks WHILE the neighbour gets chunks off of us.
If we give them less than they give us less.
We use a Distributed Hash Table to keep track of peers with a de-centralised DB. This has its complexities but we basically have each peer only aware of its two successors (when we have a circular DHT) to handle peer churn and finding appropriate data.
Skype: using peers as relays
We have a Skype login server to handle authentication, but comms use a hierarchical overlay with Super Nodes.
A super node is any computer which is routable and has good network connectivity.
We need them primarily due to NAT networks: as an outside peer cannot initiate a call to a peer inside the NAT (as it is not routable).
So the super nodes choose a relay node, which booth peers send data to, which can forward the info on.
The reason we need super nodes is to reduce server work of trying to direct calls. It is a nifty little use of P2P.