Download resulting in a larger file as in reality

Dec 21, 2012 at 2:52 PM
Edited Dec 21, 2012 at 2:54 PM

when downloading ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.gbff.gz (239.858.593 bytes)

the resulting file is to big (240.410.592 bytes) . When I download the same file with filezilla this does not happen (239.858.593 bytes). 

Both files result in the same unpacked file, so the extra data is not needed. 

 

I have no idea what the cause is.

 

regards,

Coordinator
Dec 21, 2012 at 3:15 PM

I've got code running now to see if I get the same results and if I can figure out why.

Coordinator
Dec 21, 2012 at 3:44 PM
Edited Dec 21, 2012 at 4:01 PM

Well, the number of bytes transferred match what the server reported as the size and the number of bytes of the file also match:

File properties in windows 7:

Size: 228 MB (239,858,593 bytes)

From test program: Bytes Transferred/File Size Reported by Server

239,858,593/239,858,593 100.00 %

Internet Explorer:

12/17/2012 04:25PM 239,858,593 human.rna.gbff.gz

Chrome:

229 MB 12/17/12 4:25:00 PM

This is a definite oddity in the way the file size is being calculated.

If you calculate where 1 MB = 2^20 the file size is 228.75 MB, which rounded off is 229 so that explains what chrome reports. Seems to be a rounding error (or lack of rounding period) with windows 7. 

What difference were you seeing exactly with filezilla?

-- edit --

If you calculate where 1 MB = 1,000,000, the file size comes out to 239.86 MB. The Transmit FTP client calculates file sizes this way. I'm guessing that there is no real problem, just a discrepancy in unit sizes between the programs doing the math here. 

Coordinator
Dec 21, 2012 at 4:06 PM
Edited Dec 21, 2012 at 4:06 PM

I totally missed where you provided the file sizes. I'm not sure why that happened, I can't reproduce it here with TYPE I. I'm running the test now with TYPE A to see if that accounts for the discrepancy.

Coordinator
Dec 21, 2012 at 4:35 PM

Don't know, with TYPE A I get 229 MB (240,764,281 bytes) so I don't know. Here is the code I'm testing with:

 using (Stream 
                istream = FtpClient.OpenRead(new Uri("ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.gbff.gz")),
                ostream = new FileStream("human.rna.gbff.gz", FileMode.OpenOrCreate, FileAccess.Write)
            ) {
                byte[] buf = new byte[8192];
                int read;
                long total = 0;


                while ((read = istream.Read(buf, 0, buf.Length)) > 0) {
                    ostream.Write(buf, 0, read);
                    total += read;

                    Console.WriteLine("{0}/{1} {2:p}",
                        total, istream.Length, 
                        (double)total / (double)istream.Length);
                }
            }

Coordinator
Dec 21, 2012 at 4:39 PM

If you've got code you can share please do. First guess is are you writing the number of bytes read as opposed to the buffer length to your output stream? I.e.,

read = istream.Read(buf, 0, buf.Length);
ostream.Write(buf, 0, read); 

vs.:

istream.Read(buf, 0, buf.Length);
ostream.Write(buf, 0, buf.Length); 

Dec 22, 2012 at 11:21 AM
Edited Dec 22, 2012 at 11:27 AM

 

    class FtpDownloader
    {
        public string ftpServer;
        public string file;
        public string ftpPath;
        private string downloadPath;
        public event EventHandler statusUpdate;
        public  FtpClient cl;
        private long size = -1;    

        /// <summary>
        /// Initializer voor downloaden
        /// </summary>
        /// <param name="server">ftp://mlpa.com</param>
        /// <param name="ftpPath">/download/hg19/</param>
        /// <param name="downloadPath">d:\blat\</param>
        public FtpDownloader(string server, string ftpPath, string downloadPath)
        {            
            this.ftpServer = server;
            this.ftpPath = ftpPath;
            this.downloadPath = downloadPath;
            if(this.downloadPath[this.downloadPath.Length-1] != '\\')
                this.downloadPath += '\\';

            cl = new FtpClient();
            cl.Credentials = new NetworkCredential("anonymous", "karel@dng.com");
            cl.Host = this.ftpServer;            
            cl.Connect();
            cl.SetDataType(FtpDataType.Binary);
            cl.SetWorkingDirectory(ftpPath);                       
        }

        /// <summary>
        /// Checks for changes, has to be a gzip file
        /// </summary>
        /// <param name="fileIn">snp135.txt.gz</param>
        /// <returns>true if changed</returns>
        public bool UpdateNeeded(string fileIn)
        {

            this.file = fileIn;

            this.StatusUpdate("Checking if " + file + " is up to date.", 0);

            FileInfo fileInfo = new FileInfo(this.downloadPath + this.file);
            if(!fileInfo.Exists)
                return true;            

            string fileName = this.ftpPath + this.file;
            DateTime ftpFiletime = cl.GetModifiedTime(fileName);            
            this.size = cl.GetFileSize(file);
            
            //!Let op hier > bij size omdat er te grote bestanden uit kunnen komen.            
            if (ftpFiletime.Date > fileInfo.LastWriteTimeUtc.Date || 
                size > fileInfo.Length)
                return true;
            return false;

        }

        /// <summary>
        /// Download/ and unpack
        /// </summary>
        /// <param name="fileIn"></param>
        public void Download(string fileIn)
        {
            this.file = fileIn;
            string fileName = this.ftpPath + this.file;
                
            using (Stream iStream = cl.OpenRead(fileName))
            {
                byte[] buf = new byte[8192];
                int read = 0;
                long total = 0;
                this.StatusUpdate("Downloading " + file, 0);

                using (FileStream oStream = new FileStream(downloadPath + this.file, FileMode.OpenOrCreate))
                {                    
                    while ((read = iStream.Read(buf, 0, buf.Length)) > 0)
                    {
                        total += read;

                        // write the bytes to another stream...

                        oStream.Write(buf, 0, read);

                        this.StatusUpdate("", (int) ((total*1000) / size));                           
                    }
                }
                this.StatusUpdate("Unpacking " + file.Replace(".gz", ""),0);

                FileInfo f = new FileInfo(downloadPath + this.file);
                if(f.Length != this.size)
                {
                }

                //decompres();
            }
        }
    }

 

This would be the code I am using. The weird thing is that on 2 different ftps server there are no issues. Another thing that I noticed is that the files appear to be exactly the same, except for the extra bytes that are appended to the file that I using this library.

Coordinator
Dec 22, 2012 at 4:09 PM

How many bytes are transferred (the total variable) and what is the size being reported by the server? All the should be written based on that code is what's being sent over the data stream which is what should be happening. So what we need to see is how many bytes are actually read off the stream.

Coordinator
Dec 22, 2012 at 4:16 PM

Also, for good measures, close out the oStream object when the while() loop is finished. I usually use try { } finally { } around stream objects and always close Stream.Close() in the finally block to ensure it's closed out. Probably has nothing to do with what's happening but right now we really don't know how extra data could be getting written to the stream so I think it's worth doing.

Dec 22, 2012 at 5:40 PM
Edited Dec 22, 2012 at 6:05 PM

File from Filezilla: 228 MB (239.858.593 bytes)

File From Library: 229 MB (240.410.592 bytes)

File according to GetFileSize() 239.858.593

So far tried Type I and Type A, and also closing both streams did not make a difference.

Furthermore when using your code example I still end up with a file size difference.

Feb 5, 2013 at 5:53 PM
Use HexCmp from Fairdell to examine the bytes differences between the files and see if you can work out what the extra bytes are and where they are. It may give a clue to how they are creeping in