{"id":6312,"date":"2013-03-12T19:57:53","date_gmt":"2013-03-12T19:57:53","guid":{"rendered":"https:\/\/noi3.org\/site\/?p=6312"},"modified":"2013-03-12T19:57:53","modified_gmt":"2013-03-12T19:57:53","slug":"first-5-minutes-troubleshooting-a-server","status":"publish","type":"post","link":"https:\/\/site.noi3.org\/?p=6312","title":{"rendered":"First 5 Minutes Troubleshooting A Server"},"content":{"rendered":"<section>\n<p style=\"text-align: justify;\"> \t\t<img loading=\"lazy\" decoding=\"async\" alt=\"\" class=\"aligncenter  wp-image-5798\" height=\"381\" src=\"http:\/\/slashdot.org\/topic\/wp-content\/uploads\/2012\/11\/shutterstock_84500002.jpg\" title=\"Recycle Symbol\" width=\"400\" \/><\/p>\n<p style=\"text-align: justify;\"> \t\tBack when our team was dealing with operations, optimization and scalability at <a href=\"http:\/\/wiredcraft.com\">our previous company<\/a>, we had our fair share of troubleshooting poorly performing applications and infrastructures of various sizes, often large (think CNN or the World Bank). Tight deadlines, \u201cexotic\u201d technical stacks and lack of information usually made for memorable experiences.<\/p>\n<p style=\"text-align: justify;\"> \t\tThe cause of the issues was rarely obvious: here are a few things we usually got started with.<\/p>\n<p> \t <!--more-->  \t<\/p>\n<p style=\"text-align: justify;\"> \t\t<img loading=\"lazy\" decoding=\"async\" alt=\"\" class=\"aligncenter  wp-image-5798\" height=\"381\" src=\"http:\/\/slashdot.org\/topic\/wp-content\/uploads\/2012\/11\/shutterstock_84500002.jpg\" title=\"Recycle Symbol\" width=\"400\" \/><\/p>\n<p style=\"text-align: justify;\"> \t\tBack when our team was dealing with operations, optimization and scalability at <a href=\"http:\/\/wiredcraft.com\">our previous company<\/a>, we had our fair share of troubleshooting poorly performing applications and infrastructures of various sizes, often large (think CNN or the World Bank). Tight deadlines, \u201cexotic\u201d technical stacks and lack of information usually made for memorable experiences.<\/p>\n<p style=\"text-align: justify;\"> \t\tThe cause of the issues was rarely obvious: here are a few things we usually got started with.<\/p>\n<h3 id=\"get_some_context\" style=\"text-align: justify;\"> \t\tGet some context<\/h3>\n<p style=\"text-align: justify;\"> \t\tDon\u2019t rush on the servers just yet, you need to figure out how much is already known about the server and the specifics of the issues. You don\u2019t want to waste your time (trouble) shooting in the dark.<\/p>\n<p style=\"text-align: justify;\"> \t\tA few \u201cmust have\u201d:<\/p>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tWhat exactly are the symptoms of the issue? Unresponsiveness? Errors?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhen did the problem start being noticed?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs it reproducible?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tAny pattern (e.g. happens every hour)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhat were the latest changes on the platform (code, servers, stack)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tDoes it affect a specific user segment (logged in, logged out, geographically located\u2026)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs there any documentation for the architecture (physical and logical)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<strong>Is there a monitoring platform?<\/strong> Munin, Zabbix, Nagios, <a href=\"http:\/\/newrelic.com\/\">New Relic<\/a>\u2026 Anything will do.<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<strong>Any (centralized) logs?<\/strong>. Loggly, Airbrake, Graylog\u2026<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"> \t\tThe last two ones are the most convenient sources of information, but don\u2019t expect too much: they\u2019re also the ones usually painfully absent. Tough luck, make a note to get this corrected and move on.<\/p>\n<h3 id=\"whos_there\" style=\"text-align: justify;\"> \t\tWho\u2019s there?<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ w                     $ last<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tNot critical, but you\u2019d rather not be troubleshooting a platform others are playing with. One cook in the kitchen is enough.<\/p>\n<h3 id=\"what_was_previously_done\" style=\"text-align: justify;\"> \t\tWhat was previously done?<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ history<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tAlways a good thing to look at; combined with the knowledge of who was on the box earlier on. Be responsible by all means, being admin shouldn\u2019t allow you to break ones privacy.<\/p>\n<p style=\"text-align: justify;\"> \t\tA quick mental note for later, you may want to update the environment variable <code>HISTTIMEFORMAT<\/code> to keep track of the time those commands were ran. Nothing is more frustrating than investigating an outdated list of commands\u2026<\/p>\n<h3 id=\"what_is_running\" style=\"text-align: justify;\"> \t\tWhat is running?<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ pstree -a                     $ ps aux<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tWhile <code>ps aux<\/code> tends to be pretty verbose, <code>pstree -a<\/code> gives you a nice condensed view of what is running and who called what.<\/p>\n<h3 id=\"listening_services\" style=\"text-align: justify;\"> \t\tListening services<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ netstat -ntlp                     $ netstat -nulp                     $ netstat -nxlp<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tI tend to prefer running them separately, mainly because I don\u2019t like looking at all the services at the same time. <code>netstat -nalp<\/code> will do to though. Even then, I\u2019d ommit the <code>numeric<\/code> option (IPs are more readable IMHO).<\/p>\n<p style=\"text-align: justify;\"> \t\tIdentify the running services and whether they\u2019re expected to be running or not. Look for the various listening ports. You can always match the PID of the process with the output of <code>ps aux<\/code>; this can be quite useful especially when you end up with 2 or 3 Java or Erlang processes running concurrently.<\/p>\n<p style=\"text-align: justify;\"> \t\tWe usual prefer to have more or less specialized boxes, with a low number of services running on each one of them. If you see 3 dozens of listening ports you probably should make a mental note of investigating this further and see what can be cleaned up or reorganized.<\/p>\n<h3 id=\"cpu_and_ram\" style=\"text-align: justify;\"> \t\tCPU and RAM<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ free -m                     $ uptime                     $ top                     $ htop<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tThis should answer a few questions:<\/p>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tAny free RAM? Is it swapping?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs there still some CPU left? How many CPU cores are available on the server? Is one of them overloaded?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhat is causing the most load on the box? What is the load average?<\/li>\n<\/ul>\n<h3 id=\"hardware\" style=\"text-align: justify;\"> \t\tHardware<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ lspci                     $ dmidecode                     $ ethtool<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tThere are still a lot of bare-metal servers out there, this should help with;<\/p>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tIdentifying the RAID card (with BBU?), the CPU, the available memory slots. This may give you some hints on potential issues and\/or performance improvements.<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs your NIC properly set? Are you running in half-duplex? In 10MBps? Any TX\/RX errors?<\/li>\n<\/ul>\n<h3 id=\"io_performances\" style=\"text-align: justify;\"> \t\tIO Performances<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ iostat -kx 2                     $ vmstat 2 10                     $ mpstat 2 10                     $ dstat --top-io --top-bio<\/code><\/pre>\n<p style=\"text-align: justify;\"> \t\tVery useful commands to analyze the overall performances of your backend;<\/p>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tChecking the disk usage: has the box a filesystem\/disk with 100% disk usage?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs the swap currently in use (si\/so)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhat is using the CPU: system? User? Stolen (VM)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<code>dstat<\/code> is my all-time favorite. What is using the IO? Is MySQL sucking up the resources? Is it your PHP processes?<\/li>\n<\/ul>\n<h3 id=\"mount_points_and_filesystems\" style=\"text-align: justify;\"> \t\tMount points and filesystems<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ mount                     $ cat \/etc\/fstab                     $ vgs                     $ pvs                     $ lvs                     $ df -h                     $ lsof +D \/ \/* beware not to kill your box *\/<\/code><\/pre>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tHow many filesystems are mounted?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs there a dedicated filesystem for some of the services? (MySQL by any chance..?)<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhat are the filesystem mount options: noatime? default? Have some filesystem been re-mounted as read-only?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tDo you have any disk space left?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs there any big (deleted) files that haven\u2019t been flushed yet?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tDo you have room to extend a partition if disk space is an issue?<\/li>\n<\/ul>\n<h3 id=\"kernel_interrupts_and_network_usage\" style=\"text-align: justify;\"> \t\tKernel, interrupts and network usage<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ sysctl -a | grep ...                     $ cat \/proc\/interrupts                     $ cat \/proc\/net\/ip_conntrack \/* may take some time on busy servers *\/                     $ netstat                     $ ss -s<\/code><\/pre>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tAre your IRQ properly balanced across the CPU? Or is one of the core overloaded because of network interrupts, raid card, \u2026?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tHow much is swappinness set to? 60 is good enough for workstations, but when it come to servers this is generally a bad idea: you do not want your server to swap\u2026 ever. Otherwise your swapping process will be locked while data is read\/written to the disk.<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs <code>conntrack_max<\/code> set to a high enough number to handle your traffic?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tHow long do you maintain TCP connections in the various states (<code>TIME_WAIT<\/code>, \u2026)?<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<code>netstat<\/code> can be a bit slow to display all the existing connections, you may want to use <code>ss<\/code> instead to get a summary.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"> \t\tHave a look at <a href=\"http:\/\/www.lognormal.com\/blog\/2012\/09\/27\/linux-tcpip-tuning\/\">Linux TCP tuning<\/a> for some more pointer as to how to tune your network stack.<\/p>\n<h3 id=\"system_logs_and_kernel_messages\" style=\"text-align: justify;\"> \t\tSystem logs and kernel messages<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ dmesg                     $ less \/var\/log\/messages                     $ less \/var\/log\/secure                     $ less \/var\/log\/auth<\/code><\/pre>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tLook for any error or warning messages; is it spitting issues about the number of connections in your conntrack being too high?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tDo you see any hardware error, or filesystem error?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tCan you correlate the time from those events with the information provided beforehand?<\/li>\n<\/ul>\n<h3 id=\"cronjobs\" style=\"text-align: justify;\"> \t\tCronjobs<\/h3>\n<pre style=\"text-align: justify;\">                     <code>$ ls \/etc\/cron* + cat                     $ for user in $(cat \/etc\/passwd | cut -f1 -d:); do crontab -l -u $user; done<\/code><\/pre>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tIs there any cron job that is running too often?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tIs there some user\u2019s cron that is \u201chidden\u201d to the common eyes?<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWas there a backup of some sort running at the time of the issue?<\/li>\n<\/ul>\n<h3 id=\"application_logs\" style=\"text-align: justify;\"> \t\tApplication logs<\/h3>\n<p style=\"text-align: justify;\"> \t\tThere is a lot to analyze here, but it\u2019s unlikely you\u2019ll have time to be exhaustive at first. Focus on the obvious ones, for example in the case of a LAMP stack:<\/p>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\t<strong>Apache &amp; Nginx<\/strong>; chase down access and error logs, look for <code>5xx<\/code> errors, look for possible <code>limit_zone<\/code> errors.<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<strong>MySQL<\/strong>; look for errors in the <code>mysql.log<\/code>, trace of corrupted tables, innodb repair process in progress. Looks for slow logs and define if there is disk\/index\/query issues.<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<strong>PHP-FPM<\/strong>; if you have php-slow logs on, dig in and try to find errors (php, mysql, memcache, \u2026). If not, set it on.<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<strong>Varnish<\/strong>; in <code>varnishlog<\/code> and <code>varnishstat<\/code>, check your hit\/miss ratio. Are you missing some rules in your config that let end-users hit your backend instead?<\/li>\n<li style=\"text-align: justify;\"> \t\t\t<strong>HA-Proxy<\/strong>; what is your backend status? Are your health-checks successful? Do you hit your max queue size on the frontend or your backends?<\/li>\n<\/ul>\n<h3 id=\"conclusion\" style=\"text-align: justify;\"> \t\tConclusion<\/h3>\n<p style=\"text-align: justify;\"> \t\tAfter these first 5 minutes (give or take 10 minutes) you should have a better understanding of:<\/p>\n<ul>\n<li style=\"text-align: justify;\"> \t\t\tWhat is running.<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhether the issue seems to be related to IO\/hardware\/networking or configuration (bad code, kernel tuning, \u2026).<\/li>\n<li style=\"text-align: justify;\"> \t\t\tWhether there\u2019s a pattern you recognize: for example a bad use of the DB indexes, or too many apache workers.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"> \t\tYou may even have found the actual root cause. If not, you should be in a good place to start digging further, with the knowledge that you\u2019ve covered the obvious.<\/p>\n<hr \/>\n<p> \t\t<a href=\"http:\/\/devo.ps\/blog\/2013\/03\/06\/troubleshooting-5minutes-on-a-yet-unknown-box.html\">Articolul original<\/a><\/p>\n<hr \/>\n<p> \t\t\u00a0<\/p>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>Back when our team was dealing with operations, optimization and scalability at our previous company, we had our fair share of troubleshooting poorly performing applications&hellip;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30],"tags":[969,1220,206],"class_list":["post-6312","post","type-post","status-publish","format-standard","hentry","category-informatica","tag-comanda","tag-linie","tag-server"],"_links":{"self":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/posts\/6312","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6312"}],"version-history":[{"count":0,"href":"https:\/\/site.noi3.org\/index.php?rest_route=\/wp\/v2\/posts\/6312\/revisions"}],"wp:attachment":[{"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6312"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6312"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/site.noi3.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6312"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}