Aspera
I heard about Aspera first time today. It is not a file storage, rather a file transfer technology by IBM. But what is the benefit ? and why do you need another technology ? Let’s dig deeper.
Say, you are a production house. You produce movies or videos. The raw content is generally quite large in size. Like 100s of GB and sometimes can even go in TBs. Transferring this much content via network is quite slow. Instead just sending the content via hard disk is faster. Aspera shines for such cases. Despite size of content you get consistent speed. This is what happens internally.
Apache Bench
I wanted to replace python manage.py runserver command with uwsgi for production. But before taking any decision I want it to be assessed with some data. Even if some blog says uwsgi can handle better concurrent load, memory management and caching why should I believe it ? How can I test to be clear.
For starter, I can write a python script that can send n number of requests and I can make them concurrent calls using async / threads. But is there any other readily available option ? When I explored, there is an option: Apache Bench (ab)
Building Jenkins
I have used Jenkins to deploy jobs but that’s the extent of my usage. I was going to deep-dive into Jenkins. But before that I got a thought, why not think of building Jenkins if I had to build one ? If I was to build Jenkins, how would I have done it ?
So, Jenkins for me is an automated deployment tool. So, before automating something, how would I have done the deployment manually ? Say, I want to deploy a Django project from GitHub to production. I would follow these steps:
Sentry - For The Black
Where can I find the logs ? was one of the initial questions that I asked while understanding our system. I came from a background where you need to see logs to find issues. I was taken aback when I heard that they don’t capture logs. I kept wondering how they resolve issues in production.
It’s been 5+ years now working in the system and I have rarely used logs. It brings a smile to my face when some new joinee asks for logs. With Sentry catching errors has become quite easy. Enable sentry on one of the servers and if any case throws error alerts are raised immediately. Whoever sent the buggy code is responsible to fix it immediately. Sentry alert gives you all details you would ever need to fix that bug. Which API ? Which Machine ? Which line ? What error ? What were the local variables ? … Thus, we have been able to scale to millions of users without ever depending on logs. Any bug that creeps in, gets killed in nascent stage.
Maintaining A Database
We use Django Framework in our system. Django comes with its own ORM for relational databases. There are few tables which are being used from 7+ years now, so we decided to look into them.
To check how much space a table is using, we can run following Postgres query:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
pg_size_pretty(pg_indexes_size(schemaname||'.'||tablename)) as indexes_size
FROM pg_tables
WHERE tablename = '{table_name}';
It turned out that the table is using 1.5GB space and its indexes are using 6.5GB space. This came as a surprise to me as I was always under the impression that indexes are smaller than table. Yes, indeed they are smaller but how many indexes you create also matters. It turns out that we had 50+ indexes on our table. Obviously not all of them are being used. So, I checked indexes and their usage with this query:
The Conundrum: Sync Or Async
Suppose I migrate my code from being synchronous to asynchronous what does change ? Let’s try to think through.
Synchronous code execution is very predictable and linear. If you make some network call, it will relax there till the data comes. On the contrary, asynchronous code is quite busy. The poor guy keeps on running in (event) loop juggling multiple tasks concurrently. If it makes any network call it will register an event for that and pause execution there. Meanwhile it (eventloop) will pick and process some other event which is ready.
Understanding Video
Say, I have a video.mp4 file on my system. What exactly does it contain ? The outermost layer is called Container. There are various type of containers mp4, mkv, mov, flv, … Container contains metadata about the file like its title, duration, creation date, … Then it contains streams of data. They can be video streams, audio streams, subtitle streams, … This is pictorially explained in following image:
The Swiss Army Knife
FFmpeg is called the Swiss Army Knife for video and audio. Whatever you want to do with media like compressing, scaling, editing, encoding, decoding, transcoding, streaming, … FFmpeg can do everything. If it’s so special, let’s start using it. You can install FFmpeg on your system.
Let’s download some video video_1.mp4 . I want to get details about this video. I can do so using:
ffprobe -v quiet -print_format json -show_format -show_streams video_1.mp4
Here we used ffprobe (that usually comes with ffmpeg).
DRM
Billions Of Dollars are annually lost by production companies due to piracy. Someone can download your content and distribute it on free channel. Thus, you lose on revenue that user might have paid.
How do you solve this problem ? Or can we even solve it ? If someone is playing a movie on TV and then recording it again through high quality camera (thus recreating it), you can’t block this. This is called Analog Hole. Thus, you can’t protect 100% piracy.
Vector Graphics
If you zoom in on any image on your phone, it starts breaking after 2x - 3x zoom. You can see square pixel blocks. But what happens when you zoom in on google map ? Despite how much you zoom (till you hit last block) it doesn’t seem to break. I observed similar thing with Figma.
What is the magic here ? Why doesn’t it break ? Let’s try to reverse engineer the problem. How would I achieve it if I want to create such google map like functionality.