How to copy files between Unix systems

Sun co-founder John Gage, who coined the phrase "The Network is the Computer" which later became a Sun slogan, is today entirely correct. Oh sure, he's not semantically correct, but he's basically correct in that the true power of the computer is not realized until it is connected to something. MP3 players are little computers that are networked to your computer long enough to load music on to them, for example. One useful and basic thing we can do to cause computers to participate with one another is to copy files between them. Sometimes it's most effective to just do this with a disk, but if they are connected to a network, it's usually easiest to copy the files that way, if not fastest.

Depending on your situation, you might choose any of the following methods (or anything you invent on the spot) to copy files between systems.

Sneaker Net

The original way to copy files between computer systems was via "sneaker net", or by hand-carrying. Originally this involved punch cards or paper tape (punch cards were originally entered by hand and used only to get data into computers, and a computer-driven card punch was an early and significant development in computing) but today we have a plethora of options, including floppy disc; magnetic tape from the olden reel-to-reel 1" tape days to the modern optically-tracked DLTs; optical discs of various types including CD-ROM, DVD-ROM, WORM, MO, and Blu-Ray; hard disk distribution including USB hard drives and even whole RAID sets; and flash storage devices connected via USB, PCI, ATA, SATA, or nearly any other bus you can come up with.

Compression

Data compression is often used to get more data onto a storage device - and of course, it's also useful while copying files over the network. Tape drives often include hardware to perform compression, but in almost every case it is possible to get better results by doing software compression — the hardware compression was much more useful when minicomputers had 25MHz processors and the like. Compression basically comes in two forms; solid archives, where the entire archive must be decompressed (either fully or on-the-fly) before you can extract files; and archives in which each file is compressed separately and entered into an index.

tar

tar (or Tape ARchive) is not a compressor, it is an archiver. It was originally meant for putting files on a tape and being able to extract them individually later, but it can also write those files into a file instead. It is a de facto standard on Unix systems, although these days you have the freedom to use ZIP, RAR, or other archivers which also compress; more on that later. tar has a lot of very nifty options that allow you to choose which files to include or exclude, where to put them, and so on, but I'm just going to show you the simplest example for now:

tar myfile.tar directory

This command will put the contents of the directory named directory into a tar file called myfile.tar. Options to tar can be specified right after the command and may include a leading dash (-) or not:

tar -tvf myfile.tar

The above example will produce a verbose listing of the archive (test verbose file; see the manpage for tar.) The below will extract it silently:

tar xf myfile.tar

pack, compress, gzip, and bzip2

An 'archive' file is a collection of files stuck together into one file for convenience. A tar file is not compressed, but you can compress this file with one of the above utilities. This can be achieved multiple ways. The simplest, most straightforward, and most disk-using option is to simply tar and then (for example) gzip:

gzip myfile.tar

This will produce a file called myfile.tar.gz in the current directory; myfile.tar is processed, and a compressed file written. Once this is successful, the original file is removed. Unix originally had the 'pack' command, which produced '.z' archives. Later there was 'compress', which produces the '.Z' archives (note case, which is sensitive on Unix.) Further down the road we got the 'gzip' (.gz) and 'bzip2' (.bz2) archivers. While there are functional and historical differences between them, all you generally need to know at this point is that to take them apart we use the command with 'un' prefixed (unpack, uncompress) or added into the middle (gunzip, bunzip2):

gunzip myfile.tar.gz

This command would decompress the "gzipped" tar archive, leaving behind the tar file myfile.tar. We can also do this with the command gzip -d which means decompress. The binary gunzip is either a symbolic link to gzip or a stub launcher binary which runs gzip for you; gzip knows whether it should compress or decompress based on the name under which it has been run.

Now, if you have GNU tar you can append the 'z' or 'j' options to automatically g(un)zip or b(un)zip2 while you create or disassemble a tar file. However, if you are on a Legacy UNIX system which has a more traditional implementation of tar, you can still do the following to decompress a gzipped tar in one step:

gzip -dc myfile.tar.gz | tar xvf -

In this example, we decompress myfile.tar.gz to STDOUT and then pipe it to tar. Tar is run with the options to extract, be verbose, and extract from a file; the filename is specified as - which means "read from STDIN". Don't be too worried about all this STDIN and STDOUT stuff; just know that when you see output from a command it's normally being written to STDOUT, and you can "redirect" this to another program or to a file. If you redirect that output to a program, then it reads the information from STDIN. You can also create compressed archives this way:

tar cvf - directory | gzip > myfile.tar.gz

Here we create a tape archive, write it to STDOUT, and then pipe that to gzip; gzip will happily compress anything we feed it on STDIN, and when we do it automatically writes it output to STDOUT. Here, we "redirect" that output into a file with a greater than symbol.

As an aside, you can also use this functionality to use tar to copy whole directories, which it tends to do much faster than a recursive 'cp' command:

tar cf - directory | tar -C output_directory -xf - )

The -C flag's argument tells tar where to write out the files.

zip, rar, and friends

In the PC world, the standard format for compressing files is PKZIP, invented by the late great Phil Katz. One of the earliest formats around (although predated by ARC, ZOO, and lharc) it had high standards for ease of use and a low $25 shareware registration fee (and would never disable itself if you didn't pay.) The combination made it an instant winner and the whole world is now familiar with ZIP files. The Info-Zip utilities provide a Free, full PKZIP functionality for Unix systems including Linux, and of course they work on Windows as well. Zipping a directory is fairly easy:

zip -r file.zip directory

The -r flag tells zip to scan recursively so that it will store whole directories (otherwise an empty directory is added to the zip file.) We also have other options; for example there's RAR, developed by Eugene Roshal. RAR provides substantially better compression than ZIP, although it is by no means the best on the block. RAR uses a more tar-like syntax:

rar a file.rar directory

The a command means to "add files to archive". Since the archive does not exist, it will be created.

Loading files onto disks

When you write a file to a disc, you can choose to use a file system or not. Sometimes the easiest way to transport the files is just to leave them on the hard disk, and to take the disk pack someplace and plug it in. So long as your system has the same filesystem driver as that needed to actually mount and read from the disk volume, this approach is fast and easy (although it does not provide for any compression and has other failings which should be obvious.) This is why we use archiver programs; reading (or writing!) an archive is simpler than reading a filesystem.

Hard disk or Flash Disk

If you want to write a file to a disc without using a filesystem you can do this trivially on Unix, because part of the Unix metaphor is that "everything is a file". While this breaks down in numerous places, it works fine here. If I have a flash device on /dev/sdc (the first partition would be /dev/sdc1) I can write a file to the entire volume like so:

cat filename > /dev/sdc

This actually fails more often than you would think, because most storage devices are actually "block devices" which must be written a full block at a time. The size of a block varies, but it's usually somewhere around 512 bytes or 2 kibibytes (2048 bytes, or two to the eleventh power.) We can instead use dd, which takes such things into account:

dd if=filename of=/dev/sdc

dd has lots of other options, most of which are useful when you're trying to convert a file (for instance, EBCDIC to ASCII.) They're not so interesting here. The problem with using this approach (however you get the file onto the volume) is that if the file is smaller than the volume, you're going to have to know how much smaller it is to get the original file faithfully. You could, however, also tar to a volume:

tar cvf /dev/sdc directory

Note that you will usually have to specify the 'f' flag to tar even when writing to a tape. There are rare exceptions, but as far as I know they all apply to truly ancient tape drives, which needed to be given control instructions.

Optical Disc

While WORM and MO drives may either behave like a hard disk or require special software, basically all CD-ROM, DVD-ROM, and similar drives require the use of "cd burning" software to create a readable CD. CD and DVDs generally come in one of two formats, ISO9660 or UDF, and you must make one of these formats and then use special software commands to write the format to the disc. While a number of programs will do this on the fly (including burn4free and Nero Linux) you can do it yourself with the "cdrecord" tool and "mkisofs" (or more recently, genisoimage.) It's actually possible to make an iso image instead of a tape archive or whatever, the only advantage being that you can burn it to a CD or DVD and stick it in your drive to read it, if you so desire. Here's an extremely simple example:

mkisofs -o image.iso directory

If you have only one CD burner in your system, burning this to a CD is usually as simple as the following:

cdrecord image.iso

Check the manpage for cdrecord if you have problems; you will need to specify a device to which the image will be written.

The Network

Now that you've seen a basic introduction to how files tend to be copied manually, let's see how this is done over the network. Before the internet we had a tool called UUCP for "Unix to Unix CoPy". UUCP was designed to use slow, high-cost links at times when they were cheapest (read: at night) in order to send batched files to remote systems. Using a proper "mail delivery agent", mail and USENET news can actually be batched for later delivery to "UUCP nodes". But we can use the 'uucp' command just to copy files, or the 'uux' command to execute commands remotely. I'm not going to go into how you accomplish this in detail for two reasons: First, you will probably never see UUCP. Second, it is a bear to configure. I only mention it at all because it will work without the benefit of internet links and is available for practically every operating system ever conceived; I have personally run it on MS-DOS (Waffle BBS and UUPC), AmigaDOS (AmigaUUCP), Linux (HoneyDanBer UUCP) and on SCO Xenix.

FTP

FTP, one of the first file transfer protocols used on the internet (actually, it was used on its predecessor, the ARPAnet) is an extremely crufty and annoying protocol. It unnecessarily uses a pair of TCP connections to the server, one for control and one for file transfer. This makes it annoying to use through NAT, and a second version (PASV mode FTP) was created to address this issue. There are dozens of FTP clients, both graphical and not; my favorites are ncftp on the console, and gFTP in the GUI - on Linux anyway. For Windows use, I suggest Filezilla, which is graphical-only. ncftp for Win32 has caused me tons of problems.

FTP uses a client-server architecture, which means that a different class of software must be used on each end of the connection. The FTP "client" package is fairly simple; it needs to know only how to read and write files, and speak FTP. We use it to establish a connection to the server (which has to under stand things like security) and read and/or write files. FTP has a huge disadvantage in that usernames and passwords are sent in the clear so anyone in a position to sniff your network connection can collect your password. You can use technologies like IPSEC to prevent this through end-to-end encryption.

An FTP session might look like the following:

drink@agamemnon:~$ ftp localhost
Connected to localhost.
220-ProFTPD 1.3.0 Server (Debian) [127.0.0.1]
Name (localhost:drink): drink
331 Password required for drink.
Password:
230 User drink logged in.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> ls
200 PORT command successful
150 Opening ASCII mode data connection for file list
drwxr-xr-x   5 drink    drink        4096 Feb 16 14:49 Desktop
(output truncated...)
-rwxr-xr-x   1 drink    drink      216080 Feb  2 12:56 xromwell.xbe
226 Transfer complete.
ftp> get xromwell.xbe
local: xromwell.xbe remote: xromwell.xbe
200 PORT command successful
150 Opening BINARY mode data connection for xromwell.xbe (216080 bytes)
226 Transfer complete.
216080 bytes received in 0.02 secs (13317.5 kB/s)
ftp> quit
221 Goodbye.
drink@agamemnon:~$

This is a straightforward example, it is not very useful because I connected to my own computer, but it gives you an idea of what is going on. I connected and retrieved a file. You can also mget a number of files (via wildcard, like *) or even mput a group of files.

rsh/ssh

rsh, which stands for "remote shell" is a means of executing commands remotely, via the network. It is inherently insecure in that it has no real security. Even current versions, which will use Kerberos for authentication, will send the command unencrypted. Thus I will not actually give any examples of how to use rsh. I mention it because it exists on legacy Unix systems, and you might want to enable it momentarily and then use it to get something real installed. Read the (Fine) Manual, please.

ssh, on the other hand, stands for "secure shell" and ssh is essentially a SSL-encrypted version of rsh. A new version, HPN-SSH, provides a multithreaded implementation which can encrypt just the authentication and then send the rest of the data in the clear. Whichever version you choose, you can rest easy knowing that the entire exchange is secured by cryptography. Here's an example of how to use tar to copy files via ssh:

tar cf - directory | ssh -e none user@host tar -C output_directory -xf -

This will cause tar to write the files in directory into a tape archive and write it to STDOUT. ssh then takes the input on STDIN and will write it (once it has authenticated) to STDIN on the command running on the remote host - in this case, another tar process. As before, the - character is used to denote STDIN/STDOUT to tar. If you do not have automatic login authentication set up for ssh, you will be prompted for a password before the command completes. Add a 'v' character to the tar flags on the other side to see what files are being unpacked. The -e none argument to ssh disables the escape character and prevents potential problems with that character appearing in your archive (although I've never actually had a problem.)

Incidentally, ssh comes with a utility called 'scp' which stands for 'secure copy'. It should, however, mean 'slow copy' because scp pauses for verification between each file. On a high-latency connection which might nonetheless have plenty of throughput, this can cause multi-second delays between files. For this reason you should use tar or another option, like rsync.

rsync

Here's another way to copy files between systems. The "rsync" command is a far superior way to accomplish this in many cases because it will detect unchanged files and simply not transfer them. Size comparisons are based on file size, file checksums, or both. The major difference between rsync and tar in this context is that rsync will permit us to resume a file transfer. Incidentally, FTP also has this ability (but again, it is lame and should never be used.)

Here's an example rsync command line:

cd whereever
rsync . user@host:/directory

This pretty much does what it says. The dot (.) means to send the contents of the current directory; you can send * instead but the dot will get hidden files and the splat (asterisk) won't. You can specify some nice flags to help you. Here's what it really looks like when I rsync:

rsync -av -e ssh files user@host:/directory

We use rsync with a variety of options (-a is equivalent to -rlptgoD - see the manpage) including -v (verbose) and the all important -e ssh which as you might imagine causes rsync to use ssh for communication ("execute" ssh.) So you can not only tar over ssh, but you can also rsync. This is a highly secure and efficient way to transfer files, permitting resume of partial transfers and all kinds of other goodies.

Summary

This article provides you with a number of ways to copy files across the network. Some of the methods even work between Windows and Unix systems, especially ftp; but using Cygwin it is possible to run all of those goodies (including rsync, with an ssh server!) on Windows NT or even Win9x.

Add new comment