facebook 的工程师文化

有人发表了How Facebook Ships Code,偶觉得其中关于Facebook的工程师驱动文化的部分特别有意思,于是翻译了一下(刚刚google之,网上也有其他翻译出来了,真是快手啊)..

* as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago. Nearly doubling staff in under a year!

截止2010年6月,fb有大概2000名员工,比之前的10个月,增加了将近1000名

* the two largest teams are Engineering and Ops, with roughly 400-500 team members each. Between the two they make up about 50% of the company.

最大的两个团队是工程团队,和运维团队,各有400-500名工程师

* product manager to engineer ratio is roughly 1-to-7 or 1-to-10

产品经理和工程师的比例大约是1:7到1:10之间

* all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers. estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.

新入职的工程师大概会进行一个4-6周的BootCamp训练来熟悉fb,修补bug,以及学习来自资深工程师的训练课程;大概10%的新兵无法完成这个过程被劝退

* after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)

BootCamp后,所有的工程师就可以去访问生产系统(DB)了——这里有一个文化"给员工越多授权,他们的责任心越高"——以及一系列明确的不能去做的禁令,比如,公开用户私人信息

* [EDIT thx fryfrog] “There are also very good safe guards in place to prevent anyone at the company from doing the horrible sorts of things you can imagine people have the power to do being on the inside. If you have to “become” someone who is asking for support, this is logged along with a reason and closely reviewed. Straying here is not tolerated, period.”

* any engineer can modify any part of FB’s code base and check-in at-will

任何工程师可以修改FB代码库里的任何部分

* very engineering driven culture. ”product managers are essentially useless here.” is a quote from an engineer. engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime.

极致的工程师文化——某工程师如此评价:"产品经理完全可以忽视鄙视无视"。流程执行到一半的时候工程师还能去修改规格,工程师有权利调整项目优先级,任何时刻插入自己新的idea

* during monthly cross-team meetings, the engineers are the ones who present progress reports. product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.” they really want engineers to publicly own products and be the main point of contact for the things they built.

在月度跨部门会议里,工程师负责做进度报告。产品营销和产品经理也会去参加这些会议,但如果他们particularly outspoken,领导层会得到这样的反馈"产品在上个会议讲的太多了"。这里期望工程师拥有产品并且成为他们项目的主角

* resourcing for projects is purely voluntary.
o a PM lobbies group of engineers, tries to get them excited about their ideas.
o Engineers decide which ones sound interesting to work on.
o Engineer talks to their manager, says “I’d like to work on these 5 things this week.”
o Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
o Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between. If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on. Same for Architect help. But in general, expectation is that engineers will handle everything they need themselves.
项目的资源完全来自工程师的自愿:
  • PM游说工程师们,试图吸引工程师为他们的想法而工作

  • 工程师自己决定去干哪个产品经理的活

  • 工程师然后去给他们的头儿报告:"我本周要干这么5件事情"

  • 工程师的头儿几乎可以说是放任手下各行其是,偶尔给点做事情优先级的忠告

  • 工程师自己处理所有的事情,从js到db的所有逻辑。如果他们期望得到设计师(FB里只有非常少的专职设计师)的帮助,他们需要自己去搞定设计师来加入他们的项目;如果需要架构师同样也得自己来搞定。但通常来说,工程师自己干所有的活

* arguments about whether or not a feature idea is worth doing or not generally get resolved by just spending a week implementing it and then testing it on a sample of users, e.g., 1% of Nevada users.

关于某个特性是否值得去做,基本上不花时间去争执。干一个星期的活,然后放给一小部分用户群(比如1%的内华达州用户)去测试来决定

* engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is. can be hard to get engineers excited about working on front-end projects and user interfaces. this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.” At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.

工程师一般来说都比较喜欢做做基础架构,高负载高并发,所谓"真正的技术难题"...等等涨声望值的东西。很难让一个工程师对用户界面修修补补而燃烧热情。在某些做consumer business的企业相反:每个人都希望做那些影响用户体验的事情这样他们就可以指着网页某处说:"介四俺做滴"。在FB,后端的工作比如newsfeed算法,广告精准投递算法,memcached优化,就是工程师最希望去做的事情(qyb:这一段不好翻译,谁能告诉我什么是 juicy project??)

* commits that affect certain high-priority features (e.g., news feed) get code reviewed before merge. News Feed is important enough that Zuckerberg reviews any changes to it, but that’s an exceptional case.
对那些高敏感度功能的代码提交,合并之前肯定要做codereview. News Feed 是最重要的部分,Zuckerberg 会亲自审查修改它的所有更改
* [CORRECTION -- thx epriest] “There is mandatory code review for all changes (i.e., by one or more engineers). I think the article is just saying that Zuck doesn’t look at every change personally.”

* [CORRECTION thx fryfrog] “All changes are reviewed by at least one person, and the system is easy for anyone else to look at and review your code even if you don’t invite them to. It would take intentionally malicious behavior to get un-reviewed code in.”

* no QA at all, zero. engineers responsible for testing, bug fixes, and post-launch maintenance of their own work. there are some unit-testing and integration-testing frameworks available, but only sporadically used.
FB没有QA,真的就是零个. 工程师负责测试,修补错误,发布后的维护。确实也有个单元测试集成测试框架,但很少被使用
* [CORRECTION thx fryfrog] “I would also add that we do have QA, just not an official QA group. Every employee at an office or connected via VPN is using a version of the site that includes all the changes that are next in line to go out. This version is updated frequently and is usually 1-12 hours ahead of what the world sees. All employees are strongly encouraged to report any bugs they see and these are very quickly actioned upon.”

"必须说FB是有QA的,只不过没有一个正式的QA团队。每个员工在内网使用系统的测试版本。版本经常升级,通常内部使用1-12个小时后就被发布到生产系统。强烈鼓励每个雇员去报告任何他们碰到的问题,这些问题也都飞快的得到响应"

* re: surprise at lack of QA or automated unit tests — “most engineers are capable of writing bug-free code. it’s just that they don’t have an incentive to do so at most companies. when there’s a QA department, it’s easy to just throw it over to them to find the errors.” [EDIT: please note that this was subjective opinion, I chose to include it in this post because of the stark contrast that this draws with standard development practice at other companies]
* [CORRECTION thx epriest] “We have automated testing, including “push-blocking” tests which must pass before the release goes out. We absolutely do not believe “most engineers are capable of writing bug-free code”, much less that this is a reasonable notion to base a business upon.”

"FB有自动测试,包括一旦出错就无法release的测试集合。我们完全不认同所谓'FB的绝大多数工程师有能力写出无错代码'这类提法",至少从商业风险的角度我们不会这么傲慢

* re: surprise at lack of PM influence/control — product managers have a lot of independence and freedom. The key to being influential is to have really good relationships with engineering managers. Need to be technical enough not to suggest stupid ideas. Aside from that, there’s no need to ask for any permission or pass any reviews when establishing roadmaps/backlogs. ”My product director doesn’t even really know all the things I have on my roadmap.” There are relatively few PMs, but they all feel like they have responsibility for a really important and personally-interesting area of the company.

re: 缺乏产品经理来影响/控制项目好像有点奇怪——但是产品经理有非常大的独立性和自由度。PM拥有影响力的关键是和工程经理搞好关系。产品经理需要有足够的技术头脑,别提傻想法,除了这点,产品经理制定其路线图的时候无需任何权限和额外许可。"我的产品主管并不完全了解我想做什么"。只有很少的产品经理,但所有PM们都很有责任心的去做那些真正重要,以及个人最感兴趣的部分。

* by default all code commits get packaged into weekly releases (tuesdays)

缺省所有的代码提交集成在一个周发布里(周二)

* with extra effort, changes can go out same day

通过额外的努力,提交也许可以被当天发布

* tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site

周二的发布要求所有提交到候选版本里工程师都到场待命

* engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”

在发布之前,工程师们必须在内部IRC里待命准备点名

* ops team runs code releases by gradually rolling code out
o facebook has around 60,000 servers
o there are 9 concentric levels for rolling out new code
o [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
o the smallest level is only 6 servers
o e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
o if a release is causing any issues (e.g., throwing errors, etc.) then push is halted. the engineer who committed the offending changeset is paged to fix the problem. and then the release starts over again at level 1.
o so a release may go thru levels repeatedly: 1-2-3-fix. back to 1. 1-2-3-4-5-fix. back to 1. 1-2-3-4-5-6-7-8-9.
运维团队运行代码,逐步的将代码发布给所有人
  • FB有大概6w台服务器

  • 发布要分3个阶段:p1=内部发布、p2=小规模外部发布,p3=完全外部发布. 关于一些外围系统比如视频上载什么的被划到了另外6个发布阶段。一共是从p1到p9

  • 最小的发布级别只影响到6台服务器(qyb:我猜这意思是FB只要有6台服务器就可以运行所有的服务)

  • 周二发布就是p1,运维团队观察这6台服务器的运行情况,然后开始向下一个级别进行发布

  • 如果某个发布造成了错误. 整个进程就会中止. 提交相关代码的工程师会被叫过来修补代码,然后,再次从p1开始

* ops team is really well-trained, well-respected, and very business-aware. their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior. E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.

运维团队非常。。。牛B闪闪。。。他们的控制面板上不仅仅有错误日志、系统负载、内存占用,他们还计算用户行为。如果某个发布后导致FB用户的某项行为特征的百分比也有所变化,控制面板上就会显示出来,然后他们就会中止发布,然后去寻找原因

* during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention. not responding to ops team results in public shaming.

在发布过程里,运维团队随时通过IRC去呼叫工程师。没有及时回应运维团队的开发者会被公开批判

* once code has rolled out to level 9 and is stable, then done with weekly push.

一旦发布完成了p9,本周发布就算结束了

* if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.

* getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired. ”it’s a very high performance culture”. people that aren’t productive or aren’t super talented really stick out. Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”. this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.

被svn-blamed(qyb:我猜测svn-blamed的意思是某人提交了一个特别弱智的bug,然后被svn blame命令检出这次提交的作者信息贴在内部邮件组里...也许FB定期公布这些工程师名单,被称之为svn-blamed),被公开批判的,项目常常延期。。。。这些过失都会导致被解雇。"这里有一个高绩效文化",不优秀的生产力不高的会被清除出去。新员工入职半年后如果表现不佳,就会被经理告知"这里不合适你". 甚至对于C级,vp级员工如果没有达到更高的预期也会被立即解雇.(qyb:看起来Mark之下只有4级,A/B/C/VP)

* [CORRECTION, thx epriest] “People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”
"如果只是写出了bug,工程师不会被公开点名。但要是发布出问题被要求支持的时候不在现场或者自己也没能找个替班的人,那就会被点名了"
* [CORRECTION, thx epriest] “Getting blamed will NOT get you fired. We are extremely forgiving in this respect, and most of the senior engineers have pushed at least one horrible thing, myself included. As far as I know, no one has ever been fired for making mistakes of this nature.”
"被svn-blamed的并不会被导致解雇。我们还是很宽大的。即使是资深开发工程师,大多数也避免不了被blamed,包括我。据我所知,没有人因为这种情况而被解雇"

* [CORRECTION, thx fryfrog] “I also don’t know of anyone who has been fired for making mistakes like are mentioned in the article. I know of people who have inadvertently taken down the site. They work hard to fix what ever caused the problem and everyone learns from it. The public shaming is far more effective than fear of being fired, in my opinion.”

Topic: 技术

评论

很值得借鉴。

很好的文章,值得仔细阅读

有一个疑问 “even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.”

这里的C-Level和VP-Level应该是指高级管理层比如CXO和XX-VP

我怀疑正如LZ所说的,mark之前只有4级。C应该就是总监级的。以前听说过fb是扁平化管理的,小team干大活

juicy project, 直译是多汁的项目,通常美国人喜欢吃的肉都是juicy的,所以juicy通常和好吃的东西连在一起....“好吃的项目”? 确实很难翻译哈

jiucy就是好玩的,过瘾的,带劲的

juicy project

---肥差 or 美差 etc

谁能告诉我什么是 juicy project--> juicy就是汁水多的,好吃的,大家都想干的好项目。可以翻成热门项目?

好文,收藏至20ju.com