LFTP mirror
What is LFTP
LFTP is a command-line program for several file transfer protocols. LFTP is designed for Unix and linux type of operating systems and it is distributed under GNU General Public License, so we are free to install and use.
LFTP can transfer files via FTP, FTPS, HTTP, HTTPS, FISH, SFTP, Bit Torrent and FTP over HTTP proxy. It also supports the File eXchange Protocol (FXP), which allows the client to transfer files from one remote FTP server to another.
The features of LFTP’s are transfer queues, segmented file transfer, resuming partial downloads, mirror directories, bandwidth throttling, and recursive copying of file directories. The client can be used interactively or automated with scripts.
As part of my job, we deal with pulling large number of files from remote systems as and when the data is available, using java based sftp/ftp pull using JSCH library. This is also good for most of the scenarios but for every other different data collection we had to write bunch of code on top of the common JSCH based framework to collect data, mainly because of the directly structures and number of files we need collect are way different from one and other.
One of the cool future which caught my eye and perfect solution for the above mentioned problem is LFTP mirror.
LFTP mirror
I would be focusing only mirror command in this blog. But it has many more functionality that can be explored. This mirror utility is mainly used to sync remote or local source and local destination.
we might be thinking of rsync or rrsync for syncing source and destination and yes i do agree we can use either of these two commands to sync the files but there are few notable problems with rsync.
below are the few problems in using rsync.
- rsync communicates over SSH and many of the source systems may not enable SSH because of the security reasons.
- rsync always communicates to the remote rsync, which means we need rsync to be installed on the remote machine. This may not be possible if the remote machine is managed out side of our organization.
- rsync behaves so bizarre when it requires to sync thousands of directories between source and destination where the small files are getting created/updated.
All these problems could be very well addressed with LFTP mirror.
Sample LFTP mirror commands
lftp sftp://${USER}:${PASSWORD}@${SRC_IP} -e "set xfer:log false;
set ftp:timezone; set net:timeout 30; set net:max-retries 2; set sftp:connect-program \"ssh -a -x -p ${PORT} -o StrictHostKeyChecking=no -i ${KEYFILE} \"" << EOF
mirror -v --parallel=5 --use-pget-n=1 --skip-noaccess --only-newer --log=${LOG_FILE} --use-cache --include=${FILERE} ${SRCBASE} ${DESTDIR};exitEOF
— parallel=N : Number of files can be downloaded in parallel
— use-pget -n=N : use pget to download each file with N segments, This is really useful when we are downloading bigger files.
— skip-noaccess : skill download the files where adequate permissions are not set to avoid mirror failure.
— log : copy the download files to given log file.
— use-cache : caches the directories for future retries. This is very useful when we have thousands of directories and sub directories to sync.
— include : Filter files using Regular Expression. we can add as many filters as we want.
We could also set Optional connection level properties like below
set net:timeout N : connection timeout in N sec
set net:max-retries N : retry connections N times
set ftp:timezone ‘zone’ : set the source time zone (America/Chicago)
Note: PASSWORD file seems to be required but it does not have to have valid/real password if we want to use private key authentication.
We can use other mirror options to pull files based on the time boundaries. I created a hack to move only latest files to a different directory and create hard link to the original file. This usually can be achieved through timestamp file passed to find command ( find — newer oldtimestamp -not -newer newtimestamp ) and combine that with rsync or any sftp pull mechanism. But with mirror we can use string representation of timestamp values and pull the files in a single command with more controlled parallel processing.
lftp sftp://${USER}:${PASSWORD}@${SRC_IP} -e "set xfer:log false; set ftp:timezone ${SRCTZ}; set net:timeout 30; set net:max-retries 1; set sftp:connect-program \"ssh -a -x -p ${PORT} -o StrictHostKeyChecking=no -i ${KEYFILE} \"" << EOFmirror -v --newer-than="${LASTDATE}" --older-than="${SYSDATE}" --parallel=5 --use-pget-n=1 --skip-noaccess --only-newer --log=${LOG_FILE} --use-cache --include=${FILERE} ${SRCBASE} ${DESTDIR}/${SRCBASE}/;
exit;
EOF
— newer-than : String representation of old timestamp (yyyyMMdd HHmm)
— older-than : String representation of new timestamp (yyyyMMdd HHmm)
Below logic search through the log for files got pulled in each run and create hard link to those files:
grep "get" ${LOG_FILE} | (while read ofile
do
if [[ $ofile =~ $RE ]]
then
D=${BASH_REMATCH[1]}
F=${BASH_REMATCH[2]} FILE="$D/$F"
echo "Link file : $INPUT_DIR/$F" >>${LOG_FILE} ln $FILE $INPUT_DIR/$F ##HARD LINK TO THE ORIGINAL FILE F_CNT=`expr $F_CNT + 1`
fi
done
Note: explore the mirror FXP option to remote to remote file transfer.
Conclusion:
LFTP is a very good unix/linux command line utility for data transfers and worth trying in Every operation.
LFTP is reliable, that is any not fatal error is ignored and the operation is repeated. So if downloading breaks, it will be restarted from the point automatically. Even if ftp server does not support REST command, LFTP will try to retrieve the file from the very beginning until the file is transferred completely.