Many years ago I took a secure coding class. I mainly remember one thing from the course: “Assume all user input is evil.” This is fine because the instructors did say “If you only remember one thing from the course, remember this!”
Sometimes developers in an overzealous commitment to security decide to html encode all input, or strip all special characters such as “<” and “>”. There are worse things to be overzealously committed to, but we can do better.
Instead of sanitizing every field every time, we can say that the check depends on what the content is supposed to represent and how it will be displayed. For example, in a user display name, a link tag is probably not valid to have as part of the name, but a link tag or style tag may be very helpful and relevant in a displayed product description. For this reason, most HTML sanitizers are very flexible about how they can be configured, so that we can easily allow different kinds of HTML in different places.
Now that we’ve agreed we need to sanitize user-supplied text, the next question is when to do the sanitizing. You could make an argument to sanitize user-supplied text on input (when the text is first persisted) or on output (just before it is rendered on subsequent pages). A reason for sanitizing input is that philosophically it makes sense to catch potential security issues as early as possible, and you would avoid storing malicious input on your system at all. This takes malicious input and turns it into a validation issue, just like storing numbers for a zip code is a validation issue. A reason for sanitizing output is that the site then has the option to change sanitization policy dynamically. At some point html may be considered unsafe (say, allowing links in a product description) and later it could be considered safe (due to a business decision to allow it, or due to discovery of something you missed in your sanitization policy). If you sanitize only output, you are free to change or fix the policy whenever you want.
Some libraries are designed to operate on input. Lacking a strong driver to be able to change the sanitization policy dynamically, I would favor sanitizing on input as well. Stay tuned for Part II where we’ll look at a slick way to sanitize input as an input validation problem!