Ok, elephant in the room, wtf is going on with HC forums and lag?

Kramerguy · October 17, 2012

I've been posting here since 1996. These forums have always been plagued with latency and failed requests. The last few months have been utterly atrocious, especially the last few days. Half the posts I make end up with the circle of death (timeout), and in the end will either never appear or double-post.

I would think it's my browser or internet connection, but this has been pretty consistent through about 8 different jobs, 4 different residences, and no less than 3 different browsers and over a decade of different PC's.

Now sometimes, we get a month or two where the forums respond normally, but that's seriously the exception and not the rule.

I can only assume I'm not the only one, right?

wades_keys · October 17, 2012

MySQL database has the {censored}tiest locking engine of any database I've ever worked with. the engine does not scale out well in scenarios requiring lots of inserts/updates.

Also check your trace route. This is the day after the debate and there is a lot of social media chatter going on now. You're probably getting a really crappy route through a congested router. I know Mae-east down in Virginia was always a bottleneck for those on the east trying to route traffic westbound; likely still the case (though I believe that nose has been renamed and upgraded)

From windows cmd line type "tracert acapella.harmony-central.com" to get your route to the server. Acceptable per node latency is

Kramerguy · October 17, 2012

yeah mae east was fixed years ago. but to regurgitate - this has been a problem on many different networks and fairly consistent. I've seen it mentioned about hc's connection issues in other threads before, but nobody seems to be talking about it directly. It's been really bad the last few days, yet I'm on comcast business class clocking 50mbps consistently, HC is the ONLY site that I experience any timeouts and extreme lag on.

wades_keys · October 17, 2012

Originally Posted by Kramerguy

yeah mae east was fixed years ago. but to regurgitate - this has been a problem on many different networks and fairly consistent. I've seen it mentioned about hc's connection issues in other threads before, but nobody seems to be talking about it directly. It's been really bad the last few days, yet I'm on comcast business class clocking 50mbps consistently, HC is the ONLY site that I experience any timeouts and extreme lag on.

Post your traceroute and I'll post mine. I'm getting no latency at all today. IT net guys don't like talking about these scenarios because it's like pissing up a rope lol. "works fine for me."

BATCAT · October 17, 2012

Originally Posted by Kramerguy

I've been posting here since 1996. These forums have always been plagued with latency and failed requests. The last few months have been utterly atrocious, especially the last few days. Half the posts I make end up with the circle of death (timeout), and in the end will either never appear or double-post.

I would think it's my browser or internet connection, but this has been pretty consistent through about 8 different jobs, 4 different residences, and no less than 3 different browsers and over a decade of different PC's.

Now sometimes, we get a month or two where the forums respond normally, but that's seriously the exception and not the rule.

I can only assume I'm not the only one, right?

Yes, the problems you mention are intermittent, persistent, and forum-wide. They occur for all users AFAIK.

I have nothing to do with the technical side of the forums; I'm strictly a moderator. To be blunt, I can't offer much to satisfy you guys except to say everyone in Admin and IT are aware of the problems. Apparently it has to do with the specific way vBulletin is being run. I'm afraid I can't offer much else other than to say people really are working to fix the problems.

Kramerguy · October 17, 2012

well thanks for that AS. At least I know it's not just me

wades_keys · October 17, 2012

You'll notice the latencies occurring in direct relation to posting. This is a database locking issue almost certainly.

That's why search gets disable sometimes. The admins are dropping and rebuilding indexes on tables. every DBA knows that to speed up OLTP operations you reduce index coverage or optimize existing ones. Indexes get rebuilt on inserts an that slows that insert down. MySQL does have a delayed insert feature that can mitigate this at the expense of data integrity.

I'm sure the admin know this but I would advise them to look into horizontal table partitioning strategies.

Potts · October 17, 2012

That's some pretty solid info Wade. Are you a programmer/IT guy? Just wondering...

Kramerguy · October 17, 2012

My knowledge of DB's is mostly transactional, I took the MS 2779 coursework/classes (implementing/managing server), but it was just to pad what I was doing at the time, which was mostly t-sql queries and direct data manipulations (via enterprise / server manager, t-sql). and most of the coursework was like Chinese to me lol. I did a lot of migrations and stuff but yeah, I know crap about indexing strategies lol.

I know OJ and the political forum see tons of traffic.. and wonder if maybe they should consider oracle? Certainly there are other forums online that run well and see more overall traffic than HC?

RoadRanger · October 17, 2012

I think we should blame it on the original music folks .

RoadRanger · October 17, 2012

Originally Posted by wades_keys

I'm sure the admin know this but I would advise them to look into horizontal table partitioning strategies.

Does that have something to do with laying the top of your music stand flat?

SpaceNorman · October 17, 2012

Originally Posted by wades_keys

MySQL database has the {censored}tiest locking engine of any database I've ever worked with. the engine does not scale out well in scenarios requiring lots of inserts/updates...

My bet is that it's a network / security related issue. An operation the size of HC is no doubt running on a sizeable server farm - and most likely using a multi-tiered design that seperates the web server functions, the application server functions and the database server functions on different servers - each seperated from one another for processing efficiencies as well as for data security purposes. A poorly designed DMZ infrastructure and/or poorly designed firewall rules controlling access between each function can often result in similar behavior. Without detailed knowledge of the platform's physical and logical topology - it's tough to say.

The fact that it seems to be intermittent and that it has been going on for as long as it has - suggests to me that it's likely they've got integration issues in play here as well. Were it purely a database issue (i.e., index optimization, contention, etc.) - that sort of thing should jump out at the database admins. If it's a capacity issue (i.e, processor utilization, memory, etc.) - that sort of thing would jump out the server guys. A problem that's gone on for as long as this one obviously has - could well mean that the various disciplines have been over the components in their area of expertise with a fine tooth comb and haven't found the smoking gun.

A complex environment with application functions split across multiple servers (possibly running different operating systems - i.e., web and app servers running Windows or Linux, database servers running Unix, AIX, etc.), segmented on various subnets (which introduces potential switch configuration issues) and running thru firewalls (introducing conflicting firewall rule sets) ... along with whatever authentication approach they're using (i.e., Active Directory, etc.), certificates, etc. - means you've got a real mess to dissect. There are no doubt multiple vendors involved (each claiming that the problem is not their software). In the end, it's going to take somebody using some high end trace analysis tools to capture and analyze the traffic as it flows through the platform to identify just where the problem lies. That sort of analysis requires specialized resources and alot of time. Being that the forums are likely pretty low on the totem pole (they likely not big cash generators!) - I'm not the least bit surprised that we've been living with it for awhile.

mstreck · October 17, 2012

The only lag I experienced today was when I tried to exit a thread by clicking on the "Backstage With the Band" link at the top of the page.

This happened at work. Now I'm home and there is no lag.

Nijyo · October 17, 2012

Originally Posted by wades_keys

You'll notice the latencies occurring in direct relation to posting. This is a database locking issue almost certainly.

That's why search gets disable sometimes. The admins are dropping and rebuilding indexes on tables. every DBA knows that to speed up OLTP operations you reduce index coverage or optimize existing ones. Indexes get rebuilt on inserts an that slows that insert down. MySQL does have a delayed insert feature that can mitigate this at the expense of data integrity.

I'm sure the admin know this but I would advise them to look into horizontal table partitioning strategies.

Yeah, the errors I get from time to time are very obviously database transaction / connection errors.

wades_keys · October 17, 2012

Originally Posted by Potts

That's some pretty solid info Wade. Are you a programmer/IT guy? Just wondering...

Former corporate and independent it guy. 14 years experience. These are just educated guesses but the symptoms match what I've seen before.

My specialty was programming (real programming, not just web scripting) but I have extensive database experience as well.

I'm guessing that this site doesn't really have all that much behind it. Hell I'd spec it out with two databases and perhaps 4 web server instances. App servers aren't really in vogue on Linux: CORBA came and went years back. My bet would be on a small MySQL cluster and a few virtualized apache web server instances.

And Patrick, you'd be surprised at the lack of DBA abilities in the MySQL space. Toolsets su k and skillsets aren't typically where they need to be: you just don't typically get guys that really understand locking and transactions in that space, due in my opinion to lack of real enterprise experience where data integrity is paramount.

And MySQL is unique in that you can mix different engines in one database. I would for example question the need for transactional tables in an app such as this - switching high traffic tables to the myISAM table type can significantly boost performance in apps such as this.

wesg · October 18, 2012

Similar experience and background to Wade here, but I'm relatively new to Linux/MySQL having spent most of my career with Solaris/Oracle. I think Wade's on target and agree with his industry-practices observations.

Based on error messages, availability patterns, and resolution -- my gut says that something in their infrastructure is getting the crap pounded out of it, and that something is probably beating the out of the DB. Possibly password brute-forcing dudes looking to score mature accounts for spam posting.

It's certainly not related to posting/browsing volume from legitimate users; the lag and downtime patterns are just too wrong for that.

wades_keys · October 18, 2012

DB session pooling can be a real PITA when it comes to transactions. With PHP, it's REAL easy to get in a situation where the script never completes and leaves the connection open, and possibly leaves an un-committed transaction.

This is why IMO I'd dispense with locking altogether and go with MyISAM. Or go the facebook route and don't even use an RDBMS LOL (probably not feasible).

Problem with that approach is you typically have a lot of engine-specific stuff in php with no middleware layer. If the dev used some middleware layer then all you'd have to do is comment out all of the code in the transactional methods. Otherwise it's a search and replace across the whole code base.

nousername · October 19, 2012

This is when things are working well:

tracert acapella.harmony-central.com

Tracing route to acapella.harmony-central.com [74.63.163.229]
over a maximum of 30 hops:

1 2 8 ms 6 ms 7 ms 10.104.116.1
3 98 ms 84 ms 97 ms dtr03knwcwa-tge-0-0-0-1.knwc.wa.charter.com [96.34.105.36]
4 217 ms 167 ms 130 ms crr01yakmwa-tge-0-2-0-7.yakm.wa.charter.com [96.34.105.149]
5 124 ms 127 ms 119 ms bbr02snjsca-tge-0-2-0-10.snjs.ca.charter.com [96.34.2.90]
6 136 ms 115 ms 188 ms te0-2-0-4.ccr21.sjc03.atlas.cogentco.com [38.104.138.157]
7 36 ms 32 ms 36 ms te0-2-0-3.ccr21.sjc01.atlas.cogentco.com [66.28.4.73]
8 34 ms 33 ms 40 ms 154.54.86.177
9 422 ms 248 ms 227 ms te2-1.ccr01.slc01.atlas.cogentco.com [154.54.84.30]
10 273 ms 205 ms 208 ms te8-2.ccr01.den01.atlas.cogentco.com [154.54.82.222]
11 210 ms 175 ms 215 ms te4-2.ccr01.den03.atlas.cogentco.com [154.54.83.30]
12 81 ms 60 ms 58 ms 38.122.114.30
13 297 ms 250 ms 224 ms 66.51.1.253
14 222 ms 147 ms 78 ms teng-01-01.crsw02.den03.viawest.net [66.51.0.154]
15 81 ms 81 ms 81 ms 66.51.0.210
16 167 ms 169 ms 253 ms 74.63.139.36
17 164 ms 190 ms 83 ms 74.63.163.229

Trace complete.

wades_keys · October 20, 2012

Hey admin, you're scripts are inserting several unprintable characters into the data stream and this is causing issues with the html rendering in FireFox. You might want to check that out. This just started happening today.

Anyway, wow: that's a pretty {censored}ty route you got there....

Here's mine, from Lou KY and in this case you can see that the Apache server itself was under load and took almost a half second to respond.

Tracing route to acapella.harmony-central.com [74.63.163.229]
over a maximum of 30 hops:

1 * * * Request timed out.
2 15 ms 12 ms 21 ms 74-128-22-165.dhcp.insightbb.com [74.128.22.165]

3 14 ms 14 ms 10 ms 74.128.9.225
4 24 ms 58 ms 27 ms xe-8-1-3.edge5.Atlanta2.Level3.net [4.59.12.49]

5 26 ms 49 ms 46 ms vlan52.ebr2.Atlanta2.Level3.net [4.69.150.126]
6 42 ms 26 ms 23 ms ae-73-73.ebr3.Atlanta2.Level3.net [4.69.148.253]

7 51 ms 61 ms 43 ms ae-7-7.ebr3.Dallas1.Level3.net [4.69.134.21]
8 * 298 ms 396 ms ae-93-93.csw4.Dallas1.Level3.net [4.69.151.169]

9 46 ms 42 ms 41 ms ae-92-92.ebr2.Dallas1.Level3.net [4.69.151.166]

10 242 ms 333 ms 273 ms ae-2-2.ebr1.Denver1.Level3.net [4.69.132.105]
11 58 ms 96 ms 55 ms ae-12-51.car2.Denver1.Level3.net [4.69.147.68]
12 159 ms 68 ms 64 ms VIAWEST-INT.car2.Denver1.Level3.net [4.53.14.222
]
13 57 ms 53 ms 51 ms teng-02-02.crsw01.den03.viawest.net [66.51.0.233
]
14 52 ms 56 ms 73 ms 66.51.0.198
15 57 ms 70 ms 57 ms 74.63.139.36
16 460 ms 396 ms 512 ms 74.63.163.229

Trace complete.

Ok, elephant in the room, wtf is going on with HC forums and lag?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived