Cal Henderson’s talk at FOWA: lessons learned at flickr

Notes from Cal Henderson’s talk at the Future of Web Apps conference in San Francisco:

(note, a PDF of his slides can also be found here):

I’m not going to talk much about scaling, you can buy my book or this other book.

Flickr’s come a long way

Diagram: Things we already know, things we needed to know (there’s a small intersection point and it’s labeled as “HTML”)

What we learned, wasn’t unique (we weren’t unique and beautiful snowflakes)

Advance notice of outages (if we’re going to be down, we let people know 2 hours ahead of time and we put it across the entire site)

Disable stuff by component — put architecture in place so you can take one thing out of service and keep everything else running. Obviously, this can’t be done with everything.

Tell your users
– communicate what’s going wrong, be completely open
– if you are down for xyz reason, explain that xyz reason (obviously, if you have a crash, don’t go post your stack trace)

Clear escalation paths
– large applications are going break
– the most important thing you can is having escalation paths — what happens when things break?
– who are people supposed to call when there is something wrong, Stewart Butterfield is surfing at 3am and photos aren’t uploading correctly, who should he call. Cal? DBA? or our 24×7 team? (ha)

In-process alerts

Communication! Creative ways to handle this… put a coloring page up during an outage, clear communication is cheap!

Stats tracking is hard (and important!)
– if you want to know what’s going to happen in the future, have to look at the past
– More graphs, much more graphs, can fit huge dense sets of information on graphs
– tens of thousands of real-time graphs of things

Good tools for graphing stuff
– Cacti (ajax zoom stuff)
– Ganglia (massive stat tracking, friendster using)
— very good for tracking stats for network usage, memory, disk, etc.

Web stats
– usually bad, all of the free ones are awful
– measuremap is good for certain things (blogs)
– webtrends is pretty good, but it gets pretty expensive

Create dashboards!
– Konfabulator, sidebar-like stuff

Visual Complexity
http://www.visualcomplexity.com/vc/
how to visualize large amounts of data

APIs = cool
Who knew?

APIs force clean interfaces– UK power socket photograph
APIs allow for easy regression testing
Automated regression testing

Beware of abuse of APIs: not a whole lot harder to abuse than just web-based scraping, but…be weary of people that are bad at programming (I’m making 20 calls per second? I thought I was making one per hour… Ah decimal place (ha))

Track usage of APIs carefully

URLs

  • I heart (clean) URLs I heart (clean) URLs I heart (clean) URLs I heart (clean) URLs I heart (clean) URLs — I’m obsessive about them, understandable by humans (maybe not all humans) but at least all humans in this room
  • (under 60 characters means things will behave better in emails)
  • (put more improtant stuff on the left and less important stuff on the right so maybe if something gets cut-off, it still works)
  • Never break URLs: December 20th: amazon.com changed the format for wishlist URLs and broke all wishlists, don’t break things, don’t break things!
  • Careful of middle tiers: that is, what is I remove the 1234 from http://www.amazon.com/products/books/1234? What if I remove /books/. If your URLs make sense, what’s the bits that fit between one end and the other end?
  • Don’t navigate by URLs: when developing we’ve released features that aren’t linked to from anywhere because we refer to stuff by URLs– ack!
  • Don’t expose auto-incremented variables: makes scraping every page from the system super easy, maybe this isn’t so important
  • /noun/verb/ :URLs should go from least to specific to most specific from left to right, if you have an action as a part of your URL, that action (verb) should be at the far right.

Hiring people, developers, is really tough
– Good people have jobs
– maybe you can poach people away from something they are already doing…
– Read Joel’s recent article on thi
Giving notice/moving house

Older the product, longer the induction (induction: take a new employee and turn them into an awesome engineer)

Documents saved my life
– one way to reduce the time required for induction is to document!

Release early, release often
– new features and new work, not new products (so what Carl from Google Calendars said yesterday is logical and makes sense too)

Old days of the Internet– Under construction sign everywhere…–> it’s been replaced by an endless beta, “nothing is finished on the Internet” (Cal didn’t say this as a bad thing but as a new application/service software paradigm)

Small increments, visible progress: release in small bits, everytime you release, less moving parts

Lightweight QA, no safety net (we don’t test a whole lot of stuff, we don’t have a QA department… flickr doesn’t have one! we have a bit of an odd process, someone will build it, test it themselves and then be responsible for releasing it) no back and forth of building it and then to testing and then ot build it an then to testing and then to release manager…. too SLOW!!! Correct model for really large teams. If you have three developers, that’s not really large. Without QA, there’s no safety net… true but we’re safegaurded against this by releaseing early and relasing often

At Flickr: developers own processes and not the features: developers own stuff

Avoid branches

Shared development environment –> all three or four or five developers working on at the same time, find conflicts much quicker, you know you won’t have to spend a day after working for three having to merge your code in with the trunk.

No developer is an island: everyone works together

One touch deployment

Automating everything — army of robots, robots and scripts very important

Many tools –> componententize (army of robots!)

Always deployable –> agile, always keep the trunk deployable (being able to release once a month != agile, at least not for web apps)

Pragmatic –> make it work. if it’s maintainable, that’s good too.

Beautiful code –> not a priority, idealogical purity is not a priority, I’d rather get something that works
————————–

This stuff doesn’t work everywhere
Takes the right people and process

Like extreme programming doesn’t start working until you do it all… but then it pays off