Went to PyData NYC a couple weeks ago, and figured I ought to write up my thoughts for the benefits of the others on my extended team. Why not publish as a blog post while I’m at it?
This is actually the first conference I’d been to in my capacity as a “data engineer” at Mozilla, a team I joined about a year and a half ago after specializing in the same area on the (now-defunct) a-team. I’ve felt a special affinity for the Python community, particularly its data science offshoots (pandas, numpy, and jupyter notebooks) so it was great to finally go to a conference that specializes in these topics.
Overall, the conference was a bit of a mix between people talking about the status of their projects, theoretical talks on specific statistical approaches to data, general talks on how people are doing “data science” (I would say the largest majority of attendees at the conference were users of python data science tools, rather than developers), and case studies of how people are using python data science tools in their research or work. This being New York, many (probably the majority) were using data science tools in fields like quantitative finance, sales, marketing, and health care.
As a side note, it was really satisfying to be able to tell Mozilla’s story about how we collect and use data without violating the privacy of our users. This is becoming more and more of an issue (especailly in Europe with the GPDR) and it really makes me happy that we have a really positive story to tell, not a bunch of dirty secrets that we need to hide.
In general I found the last two types of talks the most rewarding to go to: most of the work I do at Mozilla currently involves larger-scale data where, I’m sad to say, Python is usually not (currently) an applicable tool, at least not by itself (though maybe iodide will help change that! see below). And I don’t usually find a 60 minute talk really enough time for me to be able to properly absorb new mathematical or statistical concepts, though I can sometimes get little tidbits of information from them that come in handy later.
Some talks that made an impression on me:
- Open source and quantitative finance: Keynote talk, was a great introduction to the paranoia of the world of quantitative finance. I think the main message was that things are gradually moving to a (slightly less) paranoid model where generally-useful modifications done to numerical/ml software as part of a trading platform may now be upstreamed… but my main takeaway is that I’m really glad I’m not working in that industry.
- Words in Space: Introduced an interesting-soundingl library called Yellow Brick for visualizing the results of machine learning models.
- Creating a data-driven product culture: General talk on how to create a positive and useful data science culture at a company. I think Mozilla already checks most of the boxes outlined in the talk.
- What Data Scientists Really Do: Quite entertaining talk on the future of “data science”, by Hugo Bowne-Anderson (who also has a podcast which sounds cool). The most interesting takeaway from the talk was the speculation that within 10 years the term “data scientist” might have the same meaning as the word “webmaster” now. It’s a hyper-generalist job description which will almost inevitably be split into a number of other more specialized roles.
- Master Class: Bayesian Statistics: This falls under the “technical talk which I couldn’t grasp in 60 minutes” category, but I think I finally do understand a little bit more of what people mean when they say “Bayesian Statistics” now. It actually doesn’t have much to do with Baye’s Theorem, rather it seems to be more of a philosophical approach to data analysis which acknowledges the limitations of human capacity to understand the world and asks us to more explicitly state our assumptions when developing models (probably over-simplifying here). I think I can get behind that — want to learn more. They provided a bunch of material to work through, which I’ve been meaning to take a look at.
- Data Science in Health Care: Beyond the Hype: Great presentations in how data science can be used to improve health care outcomes. Lots of relevant insights that I think are also applicable to “product health” here at Mozilla. I particularly liked the way the presenter framed requirements when deciding whether or not to do a type of analysis: “if i knew [information], i would do [intervention], which would have [measurable outcome]”
Of course, this post wouldn’t be complete without a mention of Mike Droettboom’s talk on iodide, a project I’ve been spending some considerable cycles helping with over the last couple of quarters. I need to write some longer thoughts on iodide at some point in the near future, but in a nutshell it’s a scientific notebook environment where the computational kernel lives entirely inside the browser. It was well received and we had a great followup session afterwards with people interested in using it for various things. Being able to show a python environment in the browser which “just works”, with no installation or other steps makes a great tech demo. I’m really excited about the public launch of our server-based environment, which will hopefully be coming in the next couple of months.
Just a quick announcement that the first “production-ready” version of Mission Control just went live yesterday, at this easy-to-remember URL:
For those not yet familiar with the project, Mission Control aims to track release stability and quality across Firefox releases. It is similar in spirit to arewestableyet and other crash dashboards, with the following new and exciting properties:
- Uses the full set of crash counts gathered via telemetry, rather than the arbitrary sample that users decide to submit to crash-stats
- Results are available within minutes of ingestion by telemetry (although be warned initial results for a release always look bad)
- The denominator in our crash rate is usage hours, rather than the probably-incorrect calculation of active-daily-installs used by arewestableyet (not a knock on the people who wrote that tool, there was nothing better available at the time)
- We have a detailed breakdown of the results by platform (rather than letting Windows results dominate the overall rates due to its high volume of usage)
In general, my hope is that this tool will provide a more scientific and accurate idea of release stability and quality over time. There’s lots more to do, but I think this is a promising start. Much gratitude to kairo, calixte, chutten and others who helped build my understanding of this area.
The dashboard itself an easier thing to show than talk about, so I recorded a quick demonstration of some of the dashboard’s capabilities and published it on air mozilla:
Like many, I’ve been a bit worried about the Ontario election, and have been rather obsessively checking a site called the Ontario Poll Tracker.
It has nice and shiny graphs and uses authoritative language and purports to provide a scientific analysis which predicts the election. Despite this, it’s my assertion that this kind of predictive modelling is nothing more than snake oil. I keep on reminding myself that I shouldn’t take it too seriously, but haven’t been too successful so far. This blog post is a reminder to myself on why I should stop reloading that site so much, but maybe it will be helpful to others as well. As a warning, it’s not going to say anything particularly novel. If you have any kind of background in statistics at all, this is probably going to be quite boring.
First, a story. Way back when I had just graduated from university in 2003, I worked briefly at an “opinion research company”, telephoning people for various opinion surveys. It was easily the worst job I ever had, horrible for both the people doing the calling and those who were being called.
The work was mind-numbingly repetitive. Get assigned a poll. Telephone people using an autodialer, work through the script using the DOS-based software the call center was using where they would answer multiple-choice questions. Repeat as many times as you can over the course of an hour. The topics were varied, but roughly 50/50 political parties doing private polling and businesses trying to get marketing data. In either case, the questions were definitely of the “lowest common denominator” type question (i.e. “Which products are you likely to buy in the next 12 months”, “If an election were held today, would you vote for party A, B, or C?”)
One of the few benefits of tedious jobs is that they give you time to think about things. In this case, one of my distinct experiential take aways as that the results that we were getting were incredibly unrepresentative.
For a poll to be valid it is supposed to be “reasonably” reflective of the general population. Over the quantities that we’re talking about, that means anywhere from hundreds of thousands to millions of people. If we were able to truly randomly sample a small number from this group, the results are likely to be “representative of the whole” (within some confidence interval). Let’s write up a small python script to confirm this intuition:
# 100,000 random numbers between 0 and 1
>>> full_population_size = 100000
>>> full_sample = [random.random() for i in range(full_population_size)]
# average over entire result
>>> sum(full_sample) / full_population_size
# pull out 100 randomly selected values from the full sample and
# get their average
>>> random_subset_size = 100
>>> random_subset = [full_sample[i] for i in [int(random.random()*100000) for j in
>>> sum(random_subset) / random_subset_size
Only a small fraction of the total population, but a result within 1% of the true value. Expressing it this way makes random sampling almost like a tautology. You probably learned this in high school. Great right?
Unfortunately, real life always comes in to disturb these assumptions, as it always does. You see, there were all sorts of factors that would affect who we would be talking to and thus get datapoints from. At that time, most of the population still had a land-line telephone but there were a wealth of other factors that meant that we weren’t getting a truly randoms sample of data. Men (at least men under 60 or so) were much less likely to answer a telephone survey than women. For general opinion surveys, we were calling at a specific time of day when most people were likely to be available — but that certainly wouldn’t apply to everyone. Some people would work night shifts, etc., etc. In our example above, this would be like taking out half the results over (say) 0.75 from our sample — the end result would tend to skew much lower than the true value.
Just for fun, let’s try doing that and see how it affects the results:
# if we take away approximately half the results with a value of
# >0.75, the population we are sampling from is reduced proportionally
>>> full_sample_with_half75_removed = [v for v in full_sample if v <= 0.75 or random.random() < 0.5]
# the sampled value is then proportionally skewed downwards (because
# a large percentage of the high values are no longer available)
>>> random_subset = [full_sample_with_half75_removed[i] for i in
[int(random.random()*len(full_sample_with_half75_removed)) for j in
To try and get around this problem, the opinion polling company would try to demographically restrict who we were surveying past a certain point, so that the overall sample of the poll would reasonably reflect the characteristics of the population. This probably helped, but there’s only so much you can do here. For example, if you correct for the fact that men aged 20 to 60 are less likely to answer an opinion survey, your sample is going to now consist of those weird men who do answer opinion surveys. Who knows what effect that’s going to have on your results?
I want to be clear here: this is a methodological problem. Running more opinion polls doesn’t help. Probably some samples will be more affected by errors than others, but the problem remains regardless. Actually, let’s show this trivially for our small example:
>>> skewed_averages = 
>>> for i in range(10):
... full_sample_with_half75_removed = [v for v in full_sample if v <= 0.75 or
random.random() < 0.5]
... random_subset = [full_sample_with_half75_removed[i] for i in
[int(random.random()*len(full_sample_with_half75_removed)) for j in
... skewed_averages += [sum(random_subset)/len(random_subset)]
[0.4585241853943395, 0.4271412530288919, 0.46414511969024697, 0.4360740890986547,
0.4779021127791633, 0.38419133106708714, 0.48688298744651576, 0.41076028280889915,
Each time we resampled from the main population and got a different result, but the end result was still one which was far off from what we know in this case was the true value (0.5). Sampling from bad data doesn’t make up for these problems, it just gives you more bad results.
Now, flash forward to 2018. Almost no one under 50 has a land-line anymore, people who have cell phones most often don’t answer to unknown callers. And don’t even get me started on how to find a representative set of people to participate in an “online panel”. What validity do polls have under these circumstances? I would say very little and probably more importantly we don’t even have a clear idea of how our modern polls are skewed.
There has been no shortage of thinking on how to correct for these problems but in my opinion it’s all just speculative and largely invalid. You can’t definitively solve the kind of uncertainty we’re talking about here by coming up with “just so” stories about how you’ve corrected for it. We might have some ideas about how our data is biased, but short of sampling the entire population and then seeing how our sampling method falls into that superset (which is impossible) there is no way of confirming that our efforts to correct for that bias were effective.
With respect to the Ontario election which I alluded to above, the one thing that I am getting from the data is that support for the NDP (across the highly unrepresentative sample used in the polls) is increasing precipitously and that for the PC’s is decreasing almost as sharply. That seems to be a real phenomenon. We don’t know whether that crosses over to the general population but it doesn’t seem unreasonable to think it does. Exactly how is another question, and I make no assertions there.
tl;dr If you don’t like the idea of Doug Ford in power, there is no reason to panic based on sites like the Ontario Poll Tracker. Spend your time doing something more productive, like convincing your friends and relatives to vote for someone who is not Conservative.
Just wanted to give a quick update on some things that have been happening with the metrics graphics since I stepped up to help with maintainership a few months ago:
- Hamilton’s back as co-maintainer! This has been especially helpful as he understands much of the historical context of metricsgraphics better than I do.
- We’ve merged in a large number of small fixes and improvements into the codebase, thanks to myself and a number of other contributors. Special shout-out to Thomas Champagne, who has contributed a large number of nifty new features.
- We moved the project from Mozilla to its own organization on github. This feels like a much better way forward for a project which is supposed to be useful far outside the bounds of Mozilla, and hopeful makes contributors feel more like the first-class citizens of the project that they actually are.
- We have a GSOC intern! As part of the Mozilla GSOC, Yunhao Zheng is going to be working on adding rich brushing/zooming support to Metrics Graphics, which should be quite useful for visualizing complex data in projects like Perfherder (Project outline, Yunhao’s proposal)
Yep, still working on this project. We’ve shifted gears somewhat from trying to identify problems in a time series of error aggregates to tracking somewhat longer term trends release over release, to fill the needs of the release management team at Mozilla. It’s been a good change, I think. A bit of a tighter focus.
The main motivator for this work is that the ADI (active daily install) numbers that crash stats used to provide as input to a similar service, AreWeStableYet (link requires Mozilla credentials), are going away and we need some kind of replacement. I’ve been learning about this older system worked (this blog post from KaiRo was helpful) and trying to develop a replacement which reproduces some of its useful characteristics while also taking advantage of some of the new features that are provided by the error_aggregates dataset and the mission control user interface.
Some preliminary screenshots of what I’ve been able to come up with:
One of the key things to keep in mind with this dashboard is that by default it shows an adjusted set of rates (defined as total number of events divided by total usage khours), which means we compare the latest release to the previous one within the same time interval.
So if, say, the latest release is “59” and it’s been out for two weeks, we will compare it against the previous release (“58”) in its first two weeks. As I’ve said here before, things are always crashier when they first go out, and comparing a new release to one that has been out in the field for some time is not a fair comparison at all.
This adjusted view of things is still not apples-to-apples: the causality of crashes and errors is so complex that there will always be differences between releases which are beyond our control or even understanding. Many crash reports, for example, have nothing to do with our product but with third party software and web sites beyond our control. That said, I feel like this adjusted rate is still good enough to tell us (broadly speaking) (1) whether our latest release / beta / nightly is ok (i.e. there is no major showstopper issue) and (2) whether our overall error rate is going up or down over several versions (if there is a continual increase in our crash rate, it might point to a problem in our release/qa process).
Interestingly, the first things that we’ve found with this system are not real problems with the product but data collection issues:
Data issues aside, the indications are that there’s been a steady increase in the quality of Firefox over the last few releases based on the main user facing error metric we’ve cared about in the past (main crashes), so that’s good. :)
If you want to play with the system yourself, the development instance is still up. We will probably look at making this thing “official” next quarter.
To attempt to make complex phenomena more understandable, we often use derived measures when representing Telemetry data at Mozilla. For error rates for example, we often measure things in terms of “X per khours of use” (where X might be “main crashes”, “appearance of the slow script dialogue”). I.e. instead of showing a raw count of errors we show a rate. Normally this is a good thing: it allows the user to easily compare two things which might have different raw numbers for whatever reason but where you’d normally expect the ratio to be similar. For example, we see that although the uptake of the newly-released Firefox 58.0.2 is a bit slower than 58.0.1, the overall crash rate (as sampled every 5 minutes) is more or less the same after about a day has rolled around:
On the other hand, looking at raw counts doesn’t really give you much of a hint on how to interpret the results. Depending on the scale of the graph, the actual rates could actually resolve to being vastly different:
Ok, so this simple tool (using a ratio) is useful. Yay! Unfortunately, there is one case where using this technique can lead to a very deceptive visualization: when the number of samples is really small, a few outliers can give a really false impression of what’s really happening. Take this graph of what the crash rate looked like just after Firefox 58.0 was released:
10 to 100 errors per 1000 hours, say it isn’t so? But wait, how many errors do we have absolutely? Hovering over a representative point in the graph with the normalization (use of a ratio) turned off:
We’re really only talking about something between 1 to 40 crashes events over a relatively small number of usage hours. This is clearly so little data that we can’t (and shouldn’t) draw any kind of conclusion whatsoever.
Ok, so that’s just science 101: don’t jump to conclusions based on small, vastly unrepresentative samples. Unfortunately due to human psychology people tend to assume that charts like this are authoritative and represent something real, absent an explanation otherwise — and the use of a ratio obscured the one fact (extreme lack of data) that would have given the user a hint on how to correctly interpret the results. Something to keep in mind as we build our tools.
This is going to sound corny, but helping people really is one of my favorite things at Mozilla, even with projects I have mostly moved on from. As someone who primarily works on internal tools, I love hearing about bugs in the software I maintain or questions on how to use it best.
Given this, you might think that getting in touch with me via irc or slack is the fastest and best way to get your issue addressed. We certainly have a culture of using these instant-messaging applications at Mozilla for everything and anything. Unfortunately, I have found that being “always on” to respond to everything hasn’t been positive for either my productivity or mental health. My personal situation aside, getting pinged on irc while I’m out of the office often results in stuff getting lost — the person who asked me the question is often gone by the time I return and am able to answer.
With that in mind, here’s some notes on my preferred conversation style when making initial contact about an issue:
- Please don’t send context-free pings on irc. It has been explained elsewhere why this doesn’t work that well, so I won’t repeat the argument here.
- If you are at all suspicious that your issue might be a bug in some software maintain, just file a bug and needinfo me. That puts us right on the path to documenting the problem and getting to a resolution — even if something turns out to not be a bug, if you’re seeing an unexpected error it points to a usability issue.
- For everything else, email is best. I do check it quite frequently between bursts of work (i.e. many times a day). I promise I won’t leave you hanging for days on end as long as I’m not on vacation.
These aren’t ironclad rules. If your question pertains to a project I’m actively working on, it might make sense to ping me on irc first (preferably on a channel where other people are around who might also be able to help). If it’s an actual emergency, then of course talk to me on irc straight away (or even call me on my phone) — if I don’t respond, then fall back to filing bug or sending email. Use common sense.
One of my new years resolutions is to also apply these rules to my communications with others at Mozilla as well, so if you see my violating it feel free to point me back at this post. Or just use this handy meme I created:
(this post probably won’t make much sense unless you have been a user of Twitter since/during 2016–2018)
Decided to kill my Twitter account today. Probably most people won’t miss my presence there as I haven’t actively posting on the site in quite some time, but I did check it quite frequently. I suspect it consumed (and I’m ashamed to admit this), at least an hour of my day on average. Twitter when I wake up, Twitter on the streetcar, Twitter every god damn empty moment.
Of all the large social media propertes, I find Twitter by far leads in the amount of negative sentiment it displays, especially in the last year or so. Donald Trump, #GamerGate, Brexit, Nazis, the list goes on. That’s probably why it is so uniquely capable of grabbing my attention in particular, there’s some dark part of me which gravitates towards exactly this type of content.
Now, due to the filter bubble effect, 99% of what I see on that site is a negative response to those things, but so what? The overwhelming of effect of my interaction with the site is simply to make me more more fearful and angry, nothing more. The very format of Twitter (brief bursts of performative text) doesn’t encourage any kind of intelligent or compassionate reaction to what you are seeing or reading. It would be one thing if the negative things I saw on Twitter motivated me to take a positive action of any kind, but if the net effect of hearing bad news is just to make you feel bad, then what’s the point?
I’m not much of a fan of Facebook either, but it has (less often than I would like, but sometimes) been a platform for facilitating worthwhile social and cultural interactions both on the site and off. I could probably count the number of similar exchanges I have had on Twitter on one hand — overwhelmingly my experience with twitter is that engagement (posting) leads only to a regrettable salvo of petty bickering and well actuallys.
So, let’s see what life on the other side of this site is like. The demons which pulled me to Twitter every day are no doubt still there… but hopefully they’ll be easier to handle without the influence of a technology which amplifies their effect.
Just a quick announcement that I’ve taken it upon myself to assume some maintership duties of the popular MetricsGraphics library and have released a new version with some bug fixes (2.12.0). We use this package pretty extensively at Mozilla for visualizing telemetry and other time series data, but its original authors (Hamilton Ulmer and Ali Almossawi) have mostly moved on to other things so there was a bit of a gap in getting fixes and improvements in that I hope to fill.
I don’t yet claim to be an expert in this library (which is quite rich and complex), but I’m sure I’ll learn more as I go along. At least initially, I expect that the changes I make will be small and primarily targetted to filling the needs of the Mission Control project.
Note that this emphatically does not mean I am promising to respond to every issue/question/pull request made against the project. Like my work with mozregression and perfherder, my maintenance work is being done on a best-effort basis to support Mozilla and the larger open source community. I’ll help people out where I can, but there are only so many working hours in a day and I need to spend most of those pushing my team’s immediate projects and deliverables forward! In particular, when it comes to getting pull requests merged, small, self-contained and logical changes with good commit messages will take priority.
Ok, after a series of posts extolling the virtues of my current project, it’s time to take a more critical look at some of its current limitations, and what we might do about them. In my introductory post, I talked about how Mission Control can let us know how “crashy” a new release is, within a short interval of it being released. I also alluded to the fact that things appear considerably worse when something first goes out, though I didn’t go into a lot of detail about how and why that happens.
It just so happens that a new point release (56.0.2) just went out, so it’s a perfect opportunity to revisit this issue. Let’s take a look at what the graphs are saying (each of the images is also a link to the dashboard where they were generated):
ZOMG! It looks like 56.0.2 is off the charts relative to the two previous releases (56.0 and 56.0.1). Is it time to sound the alarm? Mission control abort? Well, let’s see what happens the last time we rolled something new out, say 56.0.1:
We see the exact same pattern. Hmm. How about 56.0?
Yep, same pattern here too (actually slightly worse).
What could be going on? Let’s start by reviewing what these time series graphs are based on. Each point on the graph represents the number of crashes reported by telemetry “main” pings corresponding to that channel/version/platform within a five minute interval, divided by the number of usage hours (how long users have had Firefox open) also reported in that interval. A main ping is submitted under a few circumstances:
- The user shuts down Firefox
- It’s been about 24 hours since the last time we sent a main ping.
- The user starts Firefox after Firefox failed to start properly
- The user changes something about Firefox’s environment (adds an addon, flips a user preference)
A high crash rate either means a larger number of crashes over the same number of usage hours, or a lower number of usage hours over the same number of crashes. There are several likely explanations for why we might see this type of crashy behaviour immediately after a new release:
- A Firefox update is applied after the user restarts their browser for any reason, including their browser crash. Thus a user whose browser crashes a lot (for any reason), is more prone to update to the latest version sooner than a user that doesn’t crash as much.
- Inherently, any crash data submitted to telemetry after a new version is released will have a low number of usage hours attached, because the client would not have had a chance to use it much (because it’s so new).
Assuming that we’re reasonably satisfied with the above explanation, there’s a few things we could try to do to correct for this situation when implementing an “alerting” system for mission control (the next item on my todo list for this project):
- Set “error” thresholds for each crash measure sufficiently high that we don’t consider these high initial values an error (i.e. only alert if there is are 500 crashes per 1k hours).
- Only trigger an error threshold when some kind of minimum quantity of usage hours has been observed (this has the disadvantage of potentially obscuring a serious problem until a large percentage of the user population is affected by it).
- Come up with some expected range of what we expect a value to be for when a new version of firefox is first released and ratchet that down as time goes on (according to some kind of model of our previous expectations).
The initial specification for this project called for just using raw thresholds for these measures (discounting usage hours), but I’m becoming increasingly convinced that won’t cut it. I’m not a quality control expert, but 500 crashes for 1k hours of use sounds completely unacceptable if we’re measuring things at all accurately (which I believe we are given a sufficient period of time). At the same time, generating 20–30 “alerts” every time a new release went out wouldn’t particularly helpful either. Once again, we’re going to have to do this the hard way…
If this sounds interesting and you have some react/d3/data visualization skills (or would like to gain some), learn about contributing to mission control.
Shout out to chutten for reviewing this post and providing feedback and additions.