sbv3000

Network speed plummets

Recommended Posts

Hi All

Some advice please on a symptom I've got with network speed dropping between 2 bare metal systems, system 1 = source, 2 = target.

System 1 is a dual core amd e350 8gb ram and 20 HDD on marvel controllers

System 2 is a quad core amd apu 8gb ram and 16 HDD on marvel controllers

 

These are connected with Intel 1gb nics via a netgear snmp smart gbit switch monitored by prtg

 

I first noticed the problem doing a restore between 1 and 2. At the start I get around 300000 kbit/s as reported by prtg.

After about an hour the speed drops to about 9000 kbits/s and stays there.

If I cancel the restore and try a simple mounted folder copy, thats still the limit.

 

If I reboot system 1 then higher speed returns (for the hour or so) to the higher value then drops again

If I reboot system 2 it makes no difference

So the 'symptom' would seem to be on system 1 but I can't see what.

All network settings are default (mtu etc) and both systems are running 'healthy', nothing in the logs etc

I'm not linux specialised to know where to look so any thoughts would be helpful.

Share this post


Link to post
Share on other sites

Very interesting...

 

so, we are suspecting the system 1 (source)

both machines are running Intel 1GB NIC

 

but in the middle is NetGear switch, with SNMP enable, and PRTG.

 

Do you happen to have a dumb GB switch around so you can test the same restore between target to source, without the NetGear + SNMP + PRTG ?

 

just my hunch but I might be your switch, sometimes just by disabling the SNMP it will improve the performance. I've noticed that in some networks.

 

From personal experience, Smart Switches are not always the best when you want full speed, many times when I had similar odd throughput issues, it always ends up being the SmartSwitch causing problem, and I was using the Cisco Smart Switches but the cheap business model ones, not the High End server class type.

 

As soon as I used a dumb switch (un-managed switch from any brand) then full speed returns.

 

Now, the reason why most retail edition of Managed Switches acts so crappy, I'll have to attribute it to the low amount of RAM, CPU and chipsets used to handle the traffic.

The reason why un-managed switches works without a problem is precisely that, they have nothing to process, no middle-ware to slow down the traffic after the queue fills up.

 

Yet another reference, I've built my own custom switch with pFsense firewall, plenty of RAM 4GB (or 8GB, i forgot), Dual Core intel, all Intel NIC GB cards, and it works like a charm even with all the fancy network filter / tracing / logging /management features enabled, because it got plenty of power and RAM to handle it all.

 

So that is yet another thing you can try, if you have a spare computer, and some 3 spare NIC cards, then you can build a home made switch to verify if it's actually something in the System1 causing the issue, or if it's just the Netgear under performing.

Share this post


Link to post
Share on other sites

Hi AllGamer

Thanks for the advice, all interesting.

I hadn't considered the overhead of processing the snmp etc

I'll try with a dumb gbit switch and also disable the snmp on both nas's

I've a quad core intel nic I can have a play with pfsense - more anon

Share this post


Link to post
Share on other sites

Ok so I;

Disabled SNMP on both boxes

Plugged into a dumb gbit switch (both the intel and embedded nics)

Started a restore and a cifs file copy

Results watching through Resource Manager - On the source LAN1 (intel) ran at about 60mbit (rsync restore) and LAN2 (realtek) about 40mbit (cifs transfer).

On the target both sets of traffic ran through LAN1 which was running 90-100.

 

Interestingly after about 6-7 hours the speed dropped as seen before - down to about 5 on the intel and 10 on the realtek - I suspect some sort of memory leak on the drivers or o/s that accumulates after x amount of traffic or system activity - hence it took longer to drop because snmp was off etc

Share this post


Link to post
Share on other sites

Good info from this round of diagnostic.

 

With these results you can figure things out, or at least plan around the bug, if there are no better drivers available.

 

One thing I forgot to ask before.

 

Did you enable Jumbo Frames?

I noticed things can get a little weird at times when Jumbo frames are enabled.

 

Overall it's great when used with the same type of NIC and switches that supports the same frame size, but as in your example a mix of Intel and Realtek, then Jumbo Frames usually becomes a mess.

 

just something to keep in mind, in case you do have Jumbo enabled.

Share this post


Link to post
Share on other sites

ok so the latest testing and playing around with this

 

created an smb/cifs mount from 2 to 1 and tried some file transfer/copies - 80-120 meg speed, uninterrupted for hours/days.

 

ran the same copy whilst doing a restore - ditto 80-120 for 'x' hours then the restore speed dropped to 10-15 meg, but the copy carried on 80-100 meg

 

suspicion then is that the rsync process might be involved in some way.

Share this post


Link to post
Share on other sites
ok so the latest testing and playing around with this

 

created an smb/cifs mount from 2 to 1 and tried some file transfer/copies - 80-120 meg speed, uninterrupted for hours/days.

 

ran the same copy whilst doing a restore - ditto 80-120 for 'x' hours then the restore speed dropped to 10-15 meg, but the copy carried on 80-100 meg

 

suspicion then is that the rsync process might be involved in some way.

 

Very interesting results.

 

I've seen similar problems with Windows, like A lot!, and yes it's Windows fault from my tests, on other cases not related to yours.

 

But the description is very similar, in my case Windows Explorer will simply grinds to almost a halt if I try to copy a HUGE batch of files, specially if each file is larger than a few GB, and you are trying to copy over several GB or TB of data from one server to another server.

 

yet, If i try to copy each file 1 at a time, or even smaller batches of like 5 or 10 files at a time, then it copies them over really fast using max network speed.

 

 

Now that being said, the issue with Rsync seems to be very similar, it might simply be how these old Apps were designed, without looking into the source code I wouldn't know, but perhaps Explorer/ Rsync were not updated to handle now in days data load.

 

most file are larger than whatever queue buffer they were designed for.

it could be the read ahead that is failing after a long process, not freeing the memory to take in more file names / file listing.

it might maintain a list of files copied, and or to be copied in memory and after a while the list just get so long it takes too long to process (this is most-likely the issue).... WHICH reminds me of a very good test. :grin:

 

have you tried transferring files from 1 server to another server locally using FTP :wink:

 

Setup either side to be the server, and use the other as the client.

 

Then repeat the same test you did with rsync.

 

my Theory, Since FTP doesn't keep stuff in Memory, it reads the list from your list, the one the FTP Client creates when you selected all the files to upload, then it will have a lot less over head, or memory "leak" issue as Windows Explorer / rsync.

 

In a semi-related, yet not related matter, DropBox in Linux has a super horrible memory leak, it's been there forever and they haven't fix it yet. My work around is to kill the Process everytime the memory usage goes beyond X GB RAM

 

So, that's probably something else you can also look into, see if Rsync has any sort of memory leak after many hours / days of operation transferring a giant batch of files

Share this post


Link to post
Share on other sites

This may not apply to OP's situation but I saw more predictable behavior with rsync between two instances of Xpenology when I turned off compression and encryption.

Share this post


Link to post
Share on other sites

no compression or encryption, but other tests from a 'real' syno box running DSM6 (hyperbackup) using rsync, the speeds are slow as well.

Share this post


Link to post
Share on other sites

I'm sure you've already checked this, but 9000kbit is pretty damn close to 10Mbit. Did you check to make sure your ethernet adapter wasn't re-autonegotiating down to 10mbit? have you swapped out the cables?

Share this post


Link to post
Share on other sites

Resurrecting an old-ish thread but yeah, seeing slow Rsync copies between two machines.

One custom server with DSM6.0.2 (i3 4330, 16GB Ram, 4x 3TB WD Red (with SSD cache), 2x NICs in LACP (confirmed working) - the other is a Synology 216j 2 bay NAS (1x NIC).

 

Basically, I don't see it top 40MBytes/sec. Not massively slow, but nowhere near the maximum (I'd expect 90-100MB tops)..

I've seen quite a few posts suggesting RSYNC is at fault but wondered if anyone had any suggestions.

 

Interesting article.. https://lwn.net/Articles/400489/

 

Any thoughts on how to 'fix' it?

 

Cheers,

 

#H

Share this post


Link to post
Share on other sites