Scaling to 100 million monthly events: Matomo to Snowplow
At HowdyGo, we have a strong background in building and scaling startups. In a previous EdTech venture, we managed an analytics stack processing over 100 million monthly events from 100,000+ active users. We initially used Matomo, but as the business grew, we switched to using Snowplow to tackle some serious scalability challenges.
Now at HowdyGo, where we help SaaS companies convert prospects through interactive product demos, we needed a robust, enterprise-level analytics solution. Based on our past experience, we chose to use Snowplow for the task.
This article revisits our use of Matomo, the challenges we faced, and why we ultimately adopted Snowplow for both ventures.
Setting the Scene: Matomo vs. Snowplow
Matomo
Matomo is an open-source analytics solution that you can host on your own LAMP (Linux/Apache/PHP/MySQL) stack and can be set up quickly. Matomo provides an inbuilt GUI to let you view reports. It’s a great solution for small companies that wish to have full control of their own data.
Snowplow
Snowplow is an enterprise-grade analytics solution. Unlike Matomo it has a few different components and can be trickier to configure and get up and running. However, Snowplow directly exports to a data warehousing solution giving you full control over your data and more flexibility when looking to scale the solution.
Matomo: The path of least resistance
In 2016, during the EdTech company's early stages, we wanted an analytics solution capable of tracking user interactions such as page views, feature usage, and session durations. Our search resulted in the adoption of Matomo, an open-source platform.
Our initial hosting infrastructure used an AWS EC2 instance, following the guide outlined by Matomo, it was paired with an AWS RDS MySQL instance as the database. Both of these products were eligible for the AWS Free Tier, resulting in a cost-effective hosting solution.
As the application's popularity surged over time, we integrated a Redis queue to efficiently manage the increased load. This was combined with the GeoIP2 database from MaxMind in order to geo-locate our users.
This configuration was effective at recording analytics as we grew to around 1,000 monthly users, all within the AWS Free Tier.
For a diagram of our early architecture, refer to the diagram below.
Problem #1: An Inflexible, Monolithic Architecture
For a couple of years, our data collection proceeded well, until one day we received an alert that the Matomo instance had crashed. An investigation revealed that a query for a range of data using the built-in reporting UI had overloaded the CPU of the EC2 instance.
We prioritised moving our data to Snowflake, an enterprise data warehouse and shifted to using a separate business intelligence (BI) tool to deliver analytics both internally and to our customers.
Unfortunately, as Matomo only supports a MySQL backend there was no way to load event data directly into Snowflake, so we were obliged to use an Extract, Transform and Load (ETL) tool called Stitch to move data from the MySQL instance to Snowflake. Once the data was in Snowflake we could use DBT to transform the data so it could be displayed by a BI tool.
When transferring small amounts of data using Stitch, it was relatively cheap and reliable, but as the business scaled it became problematic.
In addition, our data team spent large amounts of time building the DBT models to convert the Matomo event data into something that could be used for reporting.
Lessons learned: Data extraction time and cost
Stitch doesn’t perform an instantaneous export to Snowflake. It operates on a schedule, we found that a 1-day refresh cycle was appropriate. Unfortunately due to the amount of data stored in MySQL, this could take many hours and would often timeout, this was a significant issue as it was extremely time-consuming to re-run exports and would drastically slow down development.
To alleviate these issues we enabled binlog replication which is recommended by Stitch which improved the situation, and we started to perform exports more regularly, which reduced the time of re-runs if the transfer failed.
Despite this, we would sometimes require over 3 hours in peak periods to export all of the data and we were clearly hitting the limits of ETL tooling. Due to the volume of data we were exporting the Stitch data costs were above $1000/month which represented a significant portion of our app hosting costs.
Why Snowplow: Built-in data warehouse loaders
Since Matomo only supports a MySQL backend, if you want to move the data into an industry-standard data warehouse solution you are left looking for a 3rd party ETL tool (e.g. Stitch).
Thankfully, Snowplow as part of the deployed architecture supports export to a variety of warehousing options, including
Each one of these is relatively easy to configure and deploy compared to paying a 3rd party ETL tool. The cost reduction by removing data transfer costs and the reduced complexity is a significant benefit over Matomo.
Why Snowplow: Ready-made DBT models
With Matomo there aren’t any open-source DBT models to get you started, our team spent hours building, debugging, testing and validating models. This may change in the future however as Airbyte is considering creating a connector, which should involve creating DBT models directly.
Regardless, the ecosystem of DBT users manipulating Matomo data is quite small and you will have to work out everything yourself.
Snowplow, on the other hand, has plenty of DBT packages available to get you up and running and quite a large ecosystem of people using them.
Why Snowplow: Discrete services make for a robust stack
With Matomo the collection of data could be disrupted simply by someone trying to display a graph with a large data range on it.
With Snowplow, by default, each service is running on a separate instance with queues to buffer requests between them. This makes it more resilient from one part of the system impacting another.
Problem #2: Horizontal Scaling
As the company grew in usage we repeatedly increased our EC2 instance size so it would continue to handle the load. Unfortunately, upgrading the instance required us to replace it, stopping data collection until it was manually reconfigured again.
We determined the best way forward was to horizontally scale the service. This would mean we would have multiple instances of Matomo running in parallel behind a load balancer that we could scale up and down depending on usage.
There are some useful guides available on how to scale Matomo which we used as a basis for this approach, but Matomo really doesn’t support this configuration out of the box.
To solve the configuration challenges we created a custom docker container based on the php-fpm nginx base image. This docker image allowed us to install Matomo and all the plugins we were using including the GeoIP2 database then modify the system php variables to improve performance. We also modified it to let us pass runtime environment variables to those plugins so that no manual configuration was required when the container started.
At the same time, we also configured Cloudfront to cache our matomo.js client script to reduce the number of requests hitting the instance.
Unfortunately, AWS Load Balancers changed the source IP address of the end-user and placed the original behind a different header so we needed to adjust the configuration in order to have Matomo pick up the user’s IP addresses.
Why Snowplow: Architected for Horizontal Scalability
With Matomo we spent a lot of time building custom docker images to allow us to horizontally scale the service. Snowplow is designed to be able to do this out of the box using configuration variables.
Conveniently, many plugins are built-in and just need to be configured and/or extended by performing your own logic in SQL or JS, totally removing the need for building custom docker images.
For example, to perform IP address enrichment it’s as simple as putting the IP lookup database into an S3 bucket and then passing a configuration variable. This is so much easier to update as previously it had to be directly installed inside the Docker image which was time-consuming to build and required a new deployment for upgrades.
Another bonus of using Snowplow is that putting the collector behind a load balancer does not require changes in its configuration to support the different headers used.
Problem #3: Ad Blocking
Customers were billed for our product based on the quantity of content that was consumed by the students. The quantity was determined based on a variety of different factors including time on page and how much content was read. This meant that tracking usage purely based on backend events was not accurate enough.
Unfortunately, this also meant that ad blockers, which were often used by the student user base, had significant implications for the business model and our ability to accurately bill our customers.
Modifying Matomo to track users using an Adblocker was no small feat, here’s a short summary of the solution:
- Adjust the import file to remove any reference to ‘Matomo’ or ‘Piwik’ in the name
- Modify the order of the tracked parameters that were sent over to the client using regex in the Docker build container.
- Implement a manual testing procedure to ensure that the ad blockers were not improved to bypass our interventions.
Modifying compiled javascript files as part of our build process was risky, not only due to the fact you are using regex (see comic below) but also because a change in the rules on the Adblocker might stop this working at any moment, it isn’t a robust solution.
Why Snowplow: 1st party mode provides a built-in solution
You can configure your collector to track in 1st party mode in Snowplow. This allows you to collect data from users and compared to the process I described for Matomo above, it requires only 3 easy steps:
- Make your collector endpoint a subdomain of your site domain i.e. spc.acme.com. This is exactly the same as what you have to do with Matomo.
- Update the paths configuration value for the collector
- Update the paths configuration value in the tracker
Steps 2 and 3, compared to writing custom regex in a docker configuration file are a much cleaner and simpler solution.
Problem #4: Debugging and Managing Invalid Events
As we scaled both the number of events and team size, we repeatedly had issues with invalid data getting passed into our Snowflake raw tables. Matomo doesn’t have the concept of event validation so it’s quite easy for invalid data to end up in the data warehouse.
This issue was the result of:
- A communication gap between two key teams: one tasked with event logging on the client side, and another responsible for maintaining the Data Warehouse and DBT models.
- The lack of a robust validation system: Even with a perfectly functioning team, it was always going to be possible for invalid data to reach the data warehouse.
The downstream impact of these issues was enormous, resulting in significant lost time as teams were forced to shift their focus to resolving data quality issues on a regular basis and finding ways to clean up invalid data that could result in missed revenue and billing issues.
Why Snowplow: Event Schemas
As part of the architecture of Snowplow, it has built-in support for event schemas using a registry system called Iglu - and yes the amount of winter-themed components in the data analytics world made my head hurt the first time too.
If an event does match the specific event schema, it is sent into the “bad” event stream and will end up getting directed to the “bad” loader. We have found it useful to load it into your data warehouse but you could write it to a different end-point.
This is really helpful compared to Matomo as the event would most likely have ended up in your data warehouse and you would need to pick it up downstream, with the potential for it to break builds and cause data quality issues.
Problem #5: The Costs and Limits of Scaling Up
After 5 years of consistent scaling we ended up with an architecture that looked something like the following:
Eventually, the “tipping point” for us to move from Matomo to Snowplow was actually the upgrade process for new versions of Matomo.
There were often released security or performance updates and due to the complexity of our horizontally scaled and queued buffer system, the update process looked something like this:
- Update the docker image to bump the version of Matomo, build it and push.
- Turn off the processing of queued tracking to stop writing data to the database and allow the migration to take place.
- Deploy a new version to Kubernetes and trigger the DB migration if required. Which, depending on the number of changes to the schema, could take as long as 1 hour.
- Once db migration is complete, enable queued tracking again and start processing the backlog of queued events stored on Redis.
Unfortunately, even with a large RDS instance, it would still take multiple hours to get through the queue. Needless to say, if we wanted a cache instance to handle the situation of storing all logs for an hour or two it was vastly over-provisioned for typical day-to-day usage, leading to increased costs.
Another significant issue was that this entire process was extremely risky. While the database was updating we had a rapidly filling Redis cache with tracked events, if the Redis instance rebooted all data would be lost. This was a serious business risk, given the billing being associated with this data.
Why Snowplow: Queueing by default
With Snowplow, you have Kinesis streams between each component of the pipeline. This is starkly different to Matomo where a custom plugin is required to support using Redis as a queue. This needs to be installed separately or added to a custom docker image.
With Kinesis, unlike Redis where you need a specific instance, the chance of all of your data being lost by an instance resetting is greatly reduced.
On top of Kinesis there are a few other options available
- Google PubSub
- AWS SQS
- RabbitMQ
- Apache Kafka
- NSQ
- Stdout (if you want to do something custom)
This means, regardless of which cloud provider you are using or if you are self-hosting in your own data centre - it’s relatively easy to get started and much simpler and cheaper to use Snowplow and scale it for large amounts of event data.
Why Snowplow: Lower cost at Scale
Initiating a setup with Matomo typically involves deploying an EC2 instance along with an RDS instance. In contrast, starting with Snowplow requires a more extensive setup. New users might find the initial complexity and the estimated monthly cost of around $200, as suggested by Snowplow's documentation, somewhat daunting.
However, by optimizing the initial deployment size at HowdyGo, we were able to reduce our hosting costs to approximately $60-$70 per month. This setup also offers the flexibility to scale up efficiently as needed by simply adjusting configuration parameters.
The real advantage of Snowplow becomes apparent at scale. The default configuration, priced at about $200 per month, is capable of handling over 200 million events monthly. This capacity is achieved at a fraction of the cost compared to Matomo, where similar volumes in our experience incurred expenses upwards of $2000 per month.
Key Takeaways: Applying Our Lessons at HowdyGo
We built HowdyGo because of the challenges we experienced demonstrating enterprise analytics dashboards to our prospects at the EdTech company referred to in this article.
A critical component of our service is providing analytics to measure the effectiveness of these product demos. As it’s core to our offering we wanted to manage our own analytics stack, additionally we wanted flexibility in the future in regards to our product offering.
To perform this task we deployed a Snowplow pipeline, see an indicative architecture below.
Conclusion
In hindsight, the initial deployment of open-source Matomo offers a relatively straightforward setup in early-stage companies. Its simplicity allows for a single-instance launch in one or two hours, providing a basic yet functional Google Analytics alternative. For those managing smaller-scale operations, we found that Matomo can operate within the AWS Free Tier.
However, as we learned, Matomo's simplicity comes with a variety of challenges, particularly when it comes to scaling. As user traffic grew, Matomo's infrastructure quickly became a bottleneck and required unnecessary complexity to maintain stability.
Conversely, Snowplow, with its modular design, presents a more scalable and maintenance-friendly alternative. Its architecture not only adapts to increasing demands but also proves more cost-effective in the long run, especially at larger scales.
Additionally, Snowplow's robust event validation capabilities are a must for growing teams that are delving into detailed product analytics. By ensuring data accuracy and integrity, Snowplow can save countless hours and prevent the all-too-common late-night debugging sessions.