After rolling out PaperCut Views on the Google Cloud Platform, we realised that the operation cost was more affected by the instance running time than by the number of IO operations, storage, cache, etc. In this article, I will show the problems we found in our original architecture and the changes we made to reduce the operation cost by reducing the instance running time, hoping that our experience could help other teams doing similar products on the GCP.
What is PaperCut Views?
PaperCut Views is our free printing analytics, insights and supply forecast product targeted to small and home offices.
It is hosted in the Google Cloud Platform (GCP) which is a great provider of cloud services, and particularly, it is running on AppEngine, their platform as a service. One of its big benefits for IoT applications such as PaperCut Views, is that it can auto scale horizontally, spinning up new instances as needed according to the load of the application.
Why did we want to refactor?
IoT applications generally have to deal with a lot of events. PaperCut Views at the moment of writing this article was receiving 1.7 million events per day thanks to the 75k registered printers. These numbers are growing around 15k new printers per month representing 300.000 events more per day. Given that this is a free product, we wanted to try and reduce the operation cost so that we could keep the same level of service to our growing customer base.
Our approach to cost reduction
The first natural approach to reduce costs would be to reduce the number of events per printer (which we did), but we soon realised that our architecture also needed to be improved. In the following sections I will show an overview of our initial architecture, the problems we found on it and the changes we made to reduce the costs.
Our initial architecture
The following is a simplified version of the moving parts of the architecture we had at launch time:
In the previous diagram, the clients are installed in our customer’s organisations to capture the information of the printers and send events such as job printed, toner level changed etc to the cloud. These events are processed by our cloud application which updates various metrics such as total pages printed per month, toner and paper forecast etc. The metrics are calculated organisation wide as well as per printer. Then, the users would be able to see the metrics via our web application.
When we went live, our design was oriented to limit the datastore operations per event, hoping that the reduced number of IO actions kept the costs low. Since we needed real time metrics in our dashboard, we decided to calculate them every time we received an event and store them on big datastore documents containing all of the metrics; one document for each organisation and one for each printer. Each document was fetched updated and saved once per event.
Soon after we went live, we realised that the cost was more affected by the AppEngine instances running time than by the IO or storage, so we started to plan the refactor.
Problem 1: High contention spots
As you can see, every time some action happened in any printer in an organisation, the two big documents were updated. This resulted in high contention spots in our storage, which despite we sharded the documents, it caused a lot of retries due to optimistic locking. In other words, if multiple events modify the documents inside a transaction at the same time, the first one to finish will commit, the others would need to retry. This kept the instance busy for a longer period of time trying to fetch, update and save a document since the whole cycle needed to be repeated until all events were calculated.
So, first refactor goal: Keep the storage high contention spots under control, even at the expense of duplicated data and more IO
The approach we took was to store some metrics in different documents. This implied more fetches per event but this also meant that the events that didn’t affect that particular metric, wouldn’t modify the document, therefore less contention and less retries caused by the optimistic locking errors.
After doing this, we had less instances attending the same number of events. However, there were still some cases in which the processing time of an event was long.
Problem 2: Cross-group transactions
It was common that a single event would modify multiple metric documents. For that, we created a transaction around them to keep the modifications consistent (see paper forecast metrics in the previous diagram). Modifying multiple documents in GCP datastore generally* implies a cross-group transaction, which takes longer to commit increasing the chances of optimistic locking and again, more instance uptime.
Second refactor goal: Avoid modifying more than one document per transaction.
For this, we decided to adopt eventual consistency using domain events according to DDD principles. The idea is that the documents could be updated independently and asynchronously, achieving consistency after a short period of time. We rely on Google Pub/Sub for this, since it guarantees the delivery of the events among other things.
With this approach, the total instance time was reduced because the transactions were much faster and also because by modifying one document at a time we reduced the chances of contention. Until now, this has been the most effective change for us when it comes to cost reduction. As a side effect, we ended up with smaller documents and better segregated logic.
Problem 3: Big nested documents
At the beginning, when we decided to store all metrics related to an organisation in one document, we ended up with big nested structures with lots of data in them. Fetching this type of documents takes more time than fetching smaller documents, like for example, a document storing just a few user details.
Third refactor goal: Keep document size small.
This was a side effect of the previous two changes, and while it might not be critical by itself, when combined with reducing contention spots and eventual consistency can produce a significant reduction on instance time which again, is the most expensive item in our monthly invoice from google.
Is that all?
No, there are other areas that are worth checking. In our team, we are currently working on:
Reducing unnecessary liveliness: Do you really need all events to be processed on real time? For PaperCut Views, the answer is no. Some metrics can be recalculated daily or monthly.This would mean that events won’t be processed as they arrive but in batch at the end of the day.
We are currently streaming the events directly into BigQuery, the GCP Data warehouse, and we are working on calculating the non-real-time metrics directly out of there.
Splitting into multiple services: Views at this moment is mostly a monolith. Having multiple services will allow us to tune each instance type according to the kind of load they are handling and so, we will be able to assign less powerful and cheaper instances to the services that are not critical and bigger ones to the ones that process real time data.
Wrapping it up
After analysing the operation cost breakdown, we realised that IO operations are not as critical as instance running time in the GCP, particularly in AppEngine. We aimed our optimisations on reducing the processing time.
We focused our refactor in:
- Reducing the datastore contention spots by sharding, splitting the documents and/or duplicating data.
- Adopting eventual consistency to be able to store one kind of document per transaction.
- Keeping a small document size.
Being the second one the most effective so far for our case.
We are still working on improving PaperCut Views as well as designing some other exciting products that would take advantage of all of our learnings on the Google Cloud Platform. Our team has grown in size and diversity, so I am sure we will have a whole lot of stories and learnings to share.
Stay tuned and thank you for reading!
About the author:
Andres Castano is a senior developer and the team lead of PaperCut Views. He joined PaperCut 2 years ago and has been actively involved in the architecture and technical direction of the product. He comes from Colombia and in his free time he likes to go out for a run, play soccer and attend the different technical meetings happening in Melbourne.
Check out his personal blog at: https://afcastano.github.io or follow him on twitter @afcastano
Posted in technology |
As network techies or SysAdmins, we usually have hundreds of balls in the air at any one time, and often end up postponing the planned work when putting out relentless spot fires. Print servers are usually at the end of the line and become one of the easiest systems to leave for another day, that is until something goes bang!
I know this all too well with a long history in IT, starting as a school admin, then into corporate as an email and System Engineer, Regional Team Lead then on the other side, helping to sell copier solutions in pre-sales.
Now that I support PaperCut and listen to your stories, I wanted to share my top 5 tips for a smooth running print system, combined with my own network admin experience of course.
Tip 1: Print 101 – Don’t forget the basics
Tip 2: Manage driver deployment
Tip 3: Beware the “untrusted printer”
Tip 4: Backup your print server
Tip 5: Monitor print system health
Picture source: 180 Printer and Toner Management
Tip 1: Print 101 – Don’t forget the basics
The support team are usually immersed in the 1’s and 0’s of programs and systems, and can sometimes forget the basics.
As pointed out by a customer on Facebook, it’s often the simple things that get in the way of users.
- Make sure you don’t run out of paper (embarrassing)
- Make sure a few sensible people know where the toner is kept!
- Make sure your users have a basic understanding of how the printer works.
A tale as old as time:
A teacher kept complaining that the printer had no paper in the tray and was adamant that it had enough in it. I went to the printer room and found 200 sheets of ‘paper’ was sitting in the output tray on the top of the MFP, and the A4 tray was in fact empty, to which the teacher remarked “I told you it has plenty of paper”. 🙁 and breathe…………………..Scott Hudson, a tolerant admin in the UK.
Tip 2: Manage print driver deployment
A print driver is just like any other piece of infrastructure software and should be managed with the same degree of importance. In fact, as it’s an OS level program, it’s probably even more important. Consider testing thoroughly before rollout.
A corporate retailer story:
A large retail company (name withheld to protect the implementation team) deployed a new print driver as part of a rollout. Users reported that print jobs were ‘sent’ but never came out of the machine, to the point where the printer would start to make the right noises and then just…stop. However, when IT installed the print drivers directly on a user’s desktop, the problem vanished. It turned out that deploying the driver through Group Policy caused a few dll’s not to be overwritten correctly, and yes they wound up in dll hell.
Tip 3: Beware the “untrusted” printer
Is there a problem printer sitting somewhere in your organization that users have just decided to avoid and not tell you about? Maybe it’s been quite literally dropped off the back of a truck during delivery and is now out of alignment, you know…the one where paper jams all the time and frustrates users, or people print to it and nothing comes out.
Picture source: Gudim
Queues aren’t just for school cafeterias:
If students printed to a particular copier at our school from a Mac workstation, the job wouldn’t be released even though they could see it on the release screen, nothing came out of the copier. Windows users could print fine. Hmm, let’s eliminate the possibilities:
- We have a Windows and Mac print server, and the issue only occurred from the Mac server.
- All students print to the virtual queue and then log into the copier to release their job. A Mac user would print to the virtual and walk up to any of the copiers to release their print job, except this one.
- Looking at PaperCut NG, the virtual queue and device are configured correctly to release jobs.
- This worked last year, however, there were only a few users who’d print from a Mac and then it stopped working for this one copier.
- The copier hasn’t been touched until now and the school want to implement BYOD which means more Macs in the environment.
After logging into the CUPS interface, we noticed there was a job in the queue from 2015 that was in error state! Once this job was deleted from the queue, the problem was resolved, Mac users can now print to this copier. Jason, Victorian school tech.
Tip 4: Backup your print server
That initial print server setup may only take minutes using a virtual machine, but if something goes wrong, restoring a print server that’s had constant new drivers and hundreds of IT hours re-configuring could be a disaster.
The reality is that print servers are always changing, so make sure your print server is backed up!
Disaster strikes an Ivy League university server room:
A large university customer (who asked to remain nameless) had a power supply catch fire in their server room and lost their print server. You’d be correct in thinking that hardware would become the focus of this recovery, but no. A mere three days later, the servers were up and running. But, it did take them days to find the right print driver version combinations that actually worked as they did before! Make sure that your print system is part of your disaster recovery plan.
Tip 5: Monitor print system health
You probably already monitor your network, uptime and other app servers, so why not monitor your print server and print management infrastructure? Print System Health Monitoring, a feature standard in PaperCut NG and PaperCut MF allows IT departments to monitor their entire print environment using industry standard monitoring tools, such as PRTG and Zabbix.
Large print environments can be complex, Chris Maclachlan from Melbourne University talks about their challenges.
It is important to monitor the health and trends of your print environment, especially in a large complex environment. Do not underestimate making small changes, keep a close eye on your environment over the next 24 hours, especially disk I/O and sudden changes in printing and error trends. We have seen large spikes after seemingly small changes, and if we were not monitoring our environment it would have led to outages and long wait times for our users.
We also had an issue once, where we had a high number of errors at exactly 24 hour intervals. We found that an expensive job was running every 24 hours, which happened during peak times in our environment. After changing the interval time of the job, we resolved the errors.
Monitoring our environment has led to a vastly improved experience for our students and staff during these times.
These are just a couple of my own experiences, battle scars I wear, some proudly. I’d love to hear your stories around printing, feel free to add to our stories below, we can only learn from our experiences!
And, if you’re keen to get monitoring to make sure these stories don’t happen to you? Try PaperCut NG free for 40-days, or upgrade to 16.1 through the Administration page of your PaperCut NG or MF instance. For local configuration and support contact a PaperCut Authorized Solution Center in your region.
Posted in General |
Tagged engineering, product|
Leave a comment