IntelligentMirror Gets Even More Intelligent (1.0.1)

Warning : This version of IntelligentMirror is compatible with only squid-2.7 as of now. It is NOT compatible even with squid-3.0.

IntelligentMirror Version 1.0.1

I have been following squid development regularly (at least the part in which I am interested) and they have introduced a new directive in squid-2.7 known as StoreUrlRewrite (storeurl_rewrite_program). Using this directive you can instruct squid to cache url A (http://abc.com/foo/bar/version/crap.rpm) as url B (http://proxy.fedora.co.in/intelligentmirror/crap.rpm). In simple words you can direct squid to cache any url as any other url without any extra efforts.

So keeping the above directive in mind, I have worked out a different version of intelligentmirror especially for squid-2.7.

IntelligentMirror : Old method of operation

  1. IntelligentMirror gets a client request for a URL.
  2. Check: if URL is not in (RPM, metadata file)
    • Then its none of our business.
    • Let proxy handle it the normal way.
    • Done and exit.
  3. Check: if RPM/metadata is available in cache
    • Stream the RPM/metadata from cache.
    • Done and exit.
  4. Check: if RPM/metadata is not available in cache
    • Download in parallel for caching in some dir and stream.
    • Done and exit.

IntelligentMirror : New method of operation

  1. IntelligentMirror gets a client request for a URL.
  2. Check: if request for rpm
    1. Direct squid to cache the request as http://<same_host_all_the_time>/intelligentmirror/<rpmname>.rpm
  3. Check: if request for deb
    1. Direct squid to cache the request as http://<same_host_all_the_time>/intelligentmirror/<debname>.deb
  4. Done and exit.

So your squid will see every request for an rpm package as a request http://<same_host_all_the_time>/intelligentmirror/<rpmname>.rpm. So, if you happen to request the same rpm from a different mirror, it’ll still be served from cache 🙂

Improvements

  1. No need to check if the url supplied by squid is for rpm or not because storeurl_rewrite_program has an acl controller attached which will invoke intelligentmirror for urls ending in .rpm .
  2. No need to check if the url is already cached or not. No need to worry about the directory where you are going to store the packages. No human intervention is needed in maintaining the cache. Almighty squid is doing everything for us.
  3. No need to worry if the target package has changed because of the resigning or whatever because squid will do that for you.
  4. No need to actually download the package in parallel for caching because squid is already doing that.
  5. No need to worry about the hashing algorithms and storage optimizations for the cached content.

Availability

  1. RPM for Fedora/Red Hat
  2. Source RPM for Fedora/Red Hat
  3. Source Tarball

Install and Configure

The install and configure files should be enough to guide you through the installation if you choose the tar ball way. Otherwise you can always install from rpm from the above link.

Note1: You have to configure your squid to use intelligentmirror as a plugin even if you install via rpm. Check the configure file at the above link.

Note2: StoreUrlRewrite will probably be available in squid-3.1.

 

IntelligentMirror: RPM and DEB Caching Improved (0.4)

IntelligentMirror version 0.4 is available now. There have been significant improvements in intelligent mirror since last release.

Improvements

  1. Fixed defunct process problem. You will not see defunct python processes hanging around anymore. Previously every forked daemon used to got defucnt because parent never waited for the forked child to finish.
  2. IntelligentMirror now supports caching of Debian packages just like rpms. So now IntelligentMirror is best suited shared environments where people have different tastes.
  3. Intelligent Mirror now uses url_rewrite_program instead of redirect_program. This boosts the efficiency of IntelligentMirror by a significant factor as url_rewrite_program has an acl controller url_rewrite_access. And using url_rewrite_access only requests for rpm/deb packages will be passed to Intelligent Mirror. So, IM now need not process each and every incoming request. Also, it has redirector_bypass directive which will bypass IM in case all the instances of IM are busy serving requests. So, squid will not die with a fatal error in case of huge requests.
  4. Options to enable/disable caching for rpm and Debian packages have been added.
  5. Options to control the total size of caching directories and the size of individual package to be cached have also been introduced.
  6. Proxy authentication is also supported now just the way it is supported in yum.
  7. Packages are not checked for last-modified time anymore. Because in principle two rpms A and B can only have same name iff they have the same contents. So, the delay in response time in case of hits has reduced.

Availability

  1. RPMs for Fedora/Red Hat
  2. Source RPMs for Fedora/Red Hat
  3. Source Tar balls

Installation and configuration is easy and the INSTALL and README files should serve the purpose.

In case you have any suggestions or problems, leave a comment here or file a ticket on project page.

 

IntelligentMirror: Available for Testing

Note : A newer version of intelligentmirror is available now. Please check this.

Intelligent Mirror is basically a tool or squid plugin (redirector) to cache rpm packages so that the subsequent requests for the same package can be served from the local cache which will eventually save a lot of bandwidth and downloading time.

Who needs Intelligent Mirror?

  1. If you are on a shared network where a lot of people use linux distros with RPM as their package manager, then you need this. Universities should come under this category.
  2. If you have a set of systems having red hat derivatives and almost identical OS versions, you need this. LAN setups at home should come under this category.
  3. If you can’t afford to or don’t want to mirror entire fedora repo for local access due to bandwidth limitations, you need this.

What it does?

As described above, Intelligent Mirror, just caches rpms which are requested by the clients in a shared network. And subsequent requests for those rpms are served from the cache. For a detailed description, check the project page.

Why not use Squid in caching mode?

Squid caching is based on url hashing. Let me explain with an example how Intelligent Mirror is actually intelligent as compared to squid while caching rpms.

Let us say there is an rpm yum-3.2.0-1.fc7.i386.rpm . You executed “yum update yum“. And let us say the newer version of yum is yum-3.2.18-1.fc9.i386.rpm which was fetched from one of the fedora mirrors http://abc.com/ (say). Now someone on the same network launched “yum update yum” and he got the same rpm yum-3.2.18-1.fc9.i386.rpm. But this time rpm was fetched from another mirror http://xyz.com/ (say).

Case I : Squid caching

Squid will cache http://abc.com/linux/fc9/updates/i386/yum-3.2.18-1.fc9.i386.rpm . And when http://xyz.com/linux/fc9/updates/i386/yum-3.2.18-1.fc9.i386.rpm will be requested, it’ll result in a cache miss and squid will again download the same package and will cache this one as well. Now there are two problems

  1. Squid is not able to serve from the cache, though the package was the same.
  2. Additional storage space is being wasted in caching the same package. And this can really harm if unluckily a different mirror is picked in all the subsequent queries.

Case II : IntelligentMirror caching

Intelligent Mirror will cache the package yum-3.2.18-1.fc9.i386.rpm without bothering about its origin. And even if yum picks up a different mirror for the subsequent request, the package will be served from the cache and will not be fetched from upstream. So, the obvious advantage of saving the bandwidth and downloading time.

Download

Intelligent Mirror source tarball, rpm, source rpm are available for download from here.

Installing and Configuring Intelligent Mirror

Install Guide

Configuration Guide

Issues and Suggestions

If you see any issue or you have any suggestions for improving the functionality, either mail me at kulbirsaini25 AT GMAIL DoT COM or file a ticket on the project page.

 

How To: Configure Squid Proxy Server

Mission

To configure squid for simple proxying without caching anything.

Use Cases

  1. When you want to have control on what people browse on your lan.
  2. When number of machine is more than the number of IP addresses you can afford to buy.
  3. When you want to help this holy world in saving some IPV4 addresses 😛

Assumptions

  1. You have a machine connected directly to internet that you are going to use as a proxy server for other machines on your network.
  2. The machines on your network are using 192.168.0.0/16 as private address space. You can use anyone/multiple address spaces of the available but for this howto we assume 192.168.0.0/16 as the local network.
  3. The local IP address of the machine which will run squid proxy server is 192.168.36.204. You can have any IP, but for this howto we assume this.

How to proceed

First of all ensure that you have squid installed. After installing squid, you need to set access control in squid configuration file which resides in /etc/squid by default. Open /etc/squid/squid.conf and add/edit following lines according to your preferences. Few lines already exist in the configuration file, you can add the rest.

# The port on which squid will listen for requests
http_port 8080
# If 'cgi-bin' or '?' is in query, squid should not check with neighbours'/parents' cache
# and should go to target web-server.
hierarchy_stoplist cgi-bin ?
# If url contains 'cgi-bin' or '?', then it must not be cached
acl QUERY urlpath_regex cgi-bin \?
cache deny QUERY
acl apache rep_header Server ^Apache
broken_vary_encoding allow apache
# Absolute path to squid access log.
access_log /var/log/squid/access.log squid
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern .               0       20%     4320
# Access control list to control every IP address
acl all src 0.0.0.0/0.0.0.0
# Access control list for source machine in LAN
acl lan_src src 192.168.0.0/16
# Access control list for destination machine in LAN
acl lan_dst dst 192.168.0.0/16
# Access control list to manage squid cache
acl manager proto cache_object
# Access control list to define IP address allowed for source localhost
acl localhost src 127.0.0.1/255.255.255.255
# Access control list to define IP addresses allowed for localhost as destination
acl to_localhost dst 127.0.0.0/8
# Access control list to define Safe ports that should be allowed by default
acl SSL_ports port 443 563 1863 5190 5222 5050 6667
acl Safe_ports port 80          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443         # https
acl Safe_ports port 70          # gopher
acl Safe_ports port 210         # wais
acl Safe_ports port 1025-65535  # unregistered ports
acl Safe_ports port 280         # http-mgmt
acl Safe_ports port 488         # gss-http
acl Safe_ports port 591         # filemaker
acl Safe_ports port 777         # multiling http
acl CONNECT method CONNECT
# Allow cache management only from localhost
http_access allow manager localhost
# Deny cache management from remote hosts
http_access deny manager
# Deny http access via all the ports which are not listed as safe
http_access deny !Safe_ports
# Deny all connections via all ports which are not listed as safe
http_access deny CONNECT !SSL_ports
# Allow http access from localhost
http_access allow localhost
# Allow http access from machines on LAN
http_access allow lan_src
http_access deny all
http_reply_access allow all
icp_access allow all
# Deny caching for everyone so that there is not caching at all
cache deny all
coredump_dir /var/spool/squid
# Never allow direct connection to machines on the internet
prefer_direct off
never_direct allow all
# Allow direct connetion if the destination machine is on LAN
always_direct allow lan_dst
# Delete this line if you don't have /etc/hosts file
hosts_file /etc/hosts
# Allow AIM connections
# Delete the following 9 lines if you don't want people to connect to AIM
acl AIM_ports port 5190 9898 6667
acl AIM_domains dstdomain .oscar.aol.com .blue.aol.com .freenode.net
acl AIM_domains dstdomain .messaging.aol.com .aim.com
acl AIM_hosts dstdomain login.oscar.aol.com login.glogin.messaging.aol.com toc.oscar.aol.com irc.freenode.net
acl AIM_nets dst 64.12.0.0/255.255.0.0
acl AIM_methods method CONNECT
http_access allow AIM_methods AIM_ports AIM_nets
http_access allow AIM_methods AIM_ports AIM_hosts
http_access allow AIM_methods AIM_ports AIM_domains
# Allow connections to Yahoo Messenger
# Delete the following 6 lines if you don't want people to connect to Yahoo Messenger
acl YIM_ports port 5050
acl YIM_domains dstdomain .yahoo.com .yahoo.co.jp
acl YIM_hosts dstdomain scs.msg.yahoo.com cs.yahoo.co.jp
acl YIM_methods method CONNECT
http_access allow YIM_methods YIM_ports YIM_hosts
http_access allow YIM_methods YIM_ports YIM_domains
# Allow connections to Google Talk
# Delete the following 6 lines if you don't want people to connect to Google Talk
acl GTALK_ports port 5222 5050
acl GTALK_domains dstdomain .google.com
acl GTALK_hosts dstdomain talk.google.com
acl GTALK_methods method CONNECT
http_access allow GTALK_methods GTALK_ports GTALK_hosts
http_access allow GTALK_methods GTALK_ports GTALK_domains
# Allow connections to MSN
# Delete the following 6 lines if you don't want people to connect to Google Talk
acl MSN_ports port 1863 443 1503
acl MSN_domains dstdomain .microsoft.com .hotmail.com .live.com .msft.net .msn.com .passport.com
acl MSN_hosts dstdomain messenger.hotmail.com
acl MSN_nets dst 207.46.111.0/255.255.255.0
acl MSN_methods method CONNECT
http_access allow MSN_methods MSN_ports MSN_hosts

Now, start the squid proxy server as

service squid start

Also, if you want squid to be started every time you boot the machine, execute the following command

chkconfig --level 345 squid on

You have a squid proxy server running now. You can ask clients to configure there browsers to use 192.168.36.204 as a proxy server with 8080 as proxy port. Command line utilities like elinks, lynx, yum, wget etc. can be asked to use proxy by exporting http_proxy variable as below. Users can also add these lines to ~/.bashrc file to avoid exporting every-time.

export http_proxy='http://192.168.36.204:8080'
export ftp_proxy='http://192.168.36.204:8080'

I highly recommend the book “Squid Proxy Server 3.1: Beginner’s Guide (Paperback)” for further reading.

 

How To: Configure Hierarchicy of Proxy Servers (Squid)

Yesterday I came across this idea of caching all the data that I browse on my hard disk so that the average load time of a website decreases. Actually the idea is I’ll cache all the static data that I browse like images, static html pages, CSS files and similar things which does not change frequently and can be served from the cache. But while setting up the proxy server on my machine, I faced the problem that my machine which is going to act as a proxy server is behind my institute’s proxy. So, a simple caching proxy server can’t serve my needs and I have to really figure out how to setup a hierarchical proxy server. Below we’ll see how to setup a hierarchical proxy server.

Approach

When I thought of setting up a caching proxy server, squid immediately struck my mind. Actually I don’t know about any other proxy servers. I never setup proxy server before this ( I tried a lot of time, but in vain). So, I started googling about squid setup. There were a lot of tutorials, but either they were too small to get things going or they were too verbose that I couldn’t manage to read them. So, I directly jump into squid configuration file squid.conf . And with references from here and there, I managed to setup the proxy server successfully.

Note: The configurations below worked on Fedora 7 with squid 2.6STABLE16. The same configurations may work with other squid versions and on other operating systems as well, but try them at your own risk.

Part 1 : Setting up simple proxy server with squid

Setting up a very simple and usable proxy server is really easy. You need to add/edit only 2-3 lines /etc/squid/squid.conf to get started.

Add your ip to the access list.

1
2
3
acl myip src 172.17.8.175 #<your_ip_which_will_use_the_proxy_server> (e.g. )
http_access allow myip
http_port 8080 #<http_proxy_port> (this is 3128 by default. you can set it to anything you like. e.g. 8080)

Save the squid.conf file. Then issue these commands.

1
2
[root@localhost squid]# squid -z [Enter] (as root) (This needs to be executed only once.)
[root@localhost squid]# service squid start [Enter] (as root)

If you want to start the squid server on boot, issue this command.

[root@localhost squid]# chkconfig --level 345 squid on [Enter] (as root)

Now, your machine is a proxy server. You can setup your browser to use the machine as a proxy server.

Conditions

The proxy server will work only if your machine has a public IP and is directly connected to internet.

Part 2: Setting up a hierarchical caching proxy server with squid

The above setup works fine if a machine is directly connected to internet. But my machine itself is behind a proxy, so setting up a proxy on my machine is of no use unless the proxy on my machine uses the institute proxy for connecting to internet. So, here we jump into squid.conf again and this time we have to really do some brain storming. If you are a newbie to Linux and don’t know how to make a system work when nothing seems to help, you will probably be better off by using institute’s proxy.

Here is the scenario.

1
2
3
4
5
6
7
8
9
10
11
12
13
1. Your browser sends a content request to proxy on your machine.
2. Check: if a cache HIT from institute proxy cache (HIT means content was found in cache)
	2a. Check: if content is older than the original upstream content
		2aa. Fetch content from upstream and serve the client
	2b. else
		2ba. Serve the content from the cache
3. Check: if cache HIT from proxy on your machine
	3a. Check: if content is older than the original upstream content
		3aa. Fetch content from upstream and serve the client
	3b. else
		3ba. Serve the content from the cache
4. Cache MISS from both the proxies
	4a. Fetch the content from upstream and serve the client

The above method of operation is very basic and is my understanding of squid. It may not be the exact squid behavior.

Now, lets see the configurations needed for setting up the hierarchical caching proxy server with squid.

Assumptions

I assume that we already have squid setup at institute’s proxy whether in caching mode or not. The best way to add/edit the following lines in your squid.conf is to search for particular parameter and then edit the value to set as given.

I also assume that you have simple proxy server setup on your machine and now we want to make it act as child proxy of the institute’s proxy.

Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Your local machine will act as a sibling proxy
cache_peer 172.17.8.175 sibling 3128 3130 no-query weight=10
# The institute's proxy server will act as a parent proxy
# 'default' mean the last-resort
cache_peer 192.168.36.204 parent 8080 3130 no-query proxy-only no-digest default
# allow accessing peer cache for access list 'myip'
cache_peer_access 172.17.8.175 allow myip
# Don't cache dynamic content
hierarchy_stoplist cgi-bin ?
acl QUERY urlpath_regex cgi-bin \?
cache deny QUERY
# Size of main memory to be used for caching
cache_mem 200 MB
# max size of content to be stored in main memory
maximum_object_size_in_memory 7000 KB
# policy for cache replacement if memory is full
cache_replacement_policy heap LFUDA
# the directory to be used for storing cache on your hdd
cache_dir aufs /var/spool/squid 200 16 256
# max file descriptor open at a time .. 0(unlimited)
max_open_disk_fds 0
# min object size to cache on hdd
minimum_object_size 0 KB
# max object size to cache on hdd
maximum_object_size 16384 KB
# access log
access_log /var/log/squid/access.log squid
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern .               0       20%     4320
store_avg_object_size 20 KB
acl apache rep_header Server ^Apache
broken_vary_encoding allow apache
refresh_stale_hit 5 seconds
acl SSL_ports port 443 563 1863 5190 5222 5050 6667
# Allow AIM protocols
acl AIM_ports port 5190 9898 6667
acl AIM_domains dstdomain .oscar.aol.com .blue.aol.com .freenode.net
acl AIM_domains dstdomain .messaging.aol.com .aim.com
acl AIM_hosts dstdomain login.oscar.aol.com login.glogin.messaging.aol.com toc.oscar.aol.com irc.freenode.net
acl AIM_nets dst 64.12.0.0/255.255.0.0
acl AIM_methods method CONNECT
http_access allow AIM_methods AIM_ports AIM_nets
http_access allow AIM_methods AIM_ports AIM_hosts
http_access allow AIM_methods AIM_ports AIM_domains
# Allow Yahoo Messenger
acl YIM_ports port 5050
acl YIM_domains dstdomain .yahoo.com .yahoo.co.jp
acl YIM_hosts dstdomain scs.msg.yahoo.com cs.yahoo.co.jp
acl YIM_methods method CONNECT
http_access allow YIM_methods YIM_ports YIM_hosts
http_access allow YIM_methods YIM_ports YIM_domains
# Allow GTalk
acl GTALK_ports port 5222 5050
acl GTALK_domains dstdomain .google.com
acl GTALK_hosts dstdomain talk.google.com
acl GTALK_methods method CONNECT
http_access allow GTALK_methods GTALK_ports GTALK_hosts
http_access allow GTALK_methods GTALK_ports GTALK_domains
# Allow MSN
acl MSN_ports port 1863 443 1503
acl MSN_domains dstdomain .microsoft.com .hotmail.com .live.com .msft.net .msn.com .passport.com
acl MSN_hosts dstdomain messenger.hotmail.com
acl MSN_nets dst 207.46.111.0/255.255.255.0
acl MSN_methods method CONNECT
http_access allow MSN_methods MSN_ports MSN_hosts
# Turn this off if hierarchical behavior is needed
nonhierarchical_direct off
never_direct deny myip
hosts_file /etc/hosts
coredump_dir /var/spool/squid

That’s the minimal configuration you need for running squid in hierarchical way. Save the squid.conf file and start/restart/reload the squid service. Setup your browser to use your machine as proxy and while using it’ll cache all the static content. You should experience some reduction in average page load time.

Advantages

I am currently using squid in above configuration. And its turning out to be nice for me. I am browsing websites faster and saving a chunk of bandwidth for my institute.

Disadvantages

Introduction of another proxy server increases the latency for dynamic content.

Notice

The above configurations and views are a result of my understanding of squid. If you feel this may break your system or it may have adverse effects, don’t use them. At least don’t use these on a production system.