video_noob

Hanging out in subtitling and video re-editing communities, I see my fair share of novice video editors and video encoders, and see plenty of them make the classic beginner mistakes when it comes to working with videos. A man can only read “Use Handbrake to convert your mkv to an mp4 :)” so many times before losing it, so I am writing this article to channel the resulting psychic damage into something productive.

If you are new to working with videos (or, let’s face it, even if you aren’t), please read through this guide to avoid making mistakes that can cost you lots of computing power, storage space, or video quality.

The Anatomy of a Video File and Remuxing vs. Reencoding

Let’s start out with the most important thing: The mistake I see the most and that causes experienced users the most pain to see.

To efficiently work with video files, you need to know the (extreme) basics of how video files are stored: When you download video files or copy them somewhere, you may come across various types of videos. You’ll probably see file extensions like .mp4 or .mkv (or many others like .webm, .mov, .avi, .m2ts, and so on). As a newcomer to video you might be tempted to think that this file extension is what determines the video format. You might have found an mkv file somewhere and noticed that Vegas or Premiere cannot open it, so you searched for ways to convert your mkv file to an mp4 file.

While this is technically not wrong, it’s far from the full story and can cause lots of misconceptions. In reality, all these formats are so-called container formats. The job of an mkv or mp4 file is not to compress and encode the video, but to take an already compressed video stream and package it in a way that makes it easier for video players to play them. Container formats are responsible for tasks like storing multiple audio or subtitle tracks (or even video tracks!) in the same file, storing metadata like chapters or which tracks have which languages, and various other technical things. However, while they store the video (and audio), they’re not the formats that actually encode it.

Actual video coding formats are formats like H.264 (also known as AVC) or H.265 (also known as HEVC). Sometimes they’re also called codecs, short for “encode, decode”.¹ H.264 and H.265 are the most common coding formats, but you may also run into some others like VP9 and AV1 (e.g. in YouTube rips) or Apple ProRes. These are the formats that handle the actual encoding of the video, which is the much, much, much harder part. A raw video file is massive, so these formats use lots of very clever and complicated tricks to store the video as efficiently as possible while losing as little quality as possible. In particular, this means that these formats are usually lossy, i.e. that video encoding programs will cause slight changes in the video in order to be able to compress it more efficiently. However, figuring out how to make a video as small as possible while sacrificing as little quality as possible is very hard, which is why encoding a video takes a lot of time and computing power. This is why rendering a video takes as long as it does.

Note that H.264 is different from x264, which you may also have heard of. H.264 is the coding format itself, while x264 is a specific program that can encode to H.264. The same is true for H.265 and x265. You will see later on in this article why this distinction matters a lot.

So, to summarize: A video file is actually comprised of a container format (like mkv or mp4), which itself contains an actual video stream. Changing the container format is simple: You just rip out the video stream and stick it into another container. (Well, it’s a little more complicated than that. But the point is: The container format is not the one that encodes the actual video, so you can switch container formats without encoding the video from scratch.) Changing the underlying coding format, however, or recompressing the video to change the file size, is harder and will a) take time and computing power, and b) lose video quality.

The process of decoding a video stream and encoding it again using the same or a different coding format is called reencoding. Changing the surrounding container format, on the other hand, is called remuxing (Deriving from “multiplexing”, which refers to sticking multiple audio or video streams into the same file).

This is extremely important to know when working with videos! If you try to convert your mkv file to an mp4 to open it in Premiere by sticking it into a converter like Handbrake (or, worse, some online conversion tool) without knowing what you’re doing, you may end up reencoding your video instead, which will not only take much, much longer, but also greatly hurt your video’s quality.

Instead, chances are that you can just remux your video to an mp4 instead, leaving the underlying encoded video stream untouched. Now, granted, there are some subtleties here, in particular to do with frame rates (more on this later), but the point is: lots of simple-looking “conversion” methods (like Handbrake, random converter websites, etc.) will actually reencode the video, which you want to avoid as much as possible. Knowing how a video file is structured, and what tools you can use to work with them (again, more on this later) will help you avoid many of these mistakes.

Video Quality

Next, let’s talk about the concept of “video quality”, which I myself already invoked above. I don’t think there is any other concept in video with as many misconceptions about it as video quality, and once again misunderstanding it can cause you to make many avoidable mistakes. This is important for both encoding your own videos and for selecting which source footage you want to work with.

Here is a list of things that people commonly associate with a video’s quality: - Its resolution (1080p/720p/4k/etc.) - Its frame rate (24fps / 60fps / 144fps / etc.) - Its bit depth (8bit / 10bit / etc.) - Its file size or its bitrate (i.e. file size divided by duration) - Its file format (.mkv / .mp4 / etc.) - Its video coding format (H.264 / H.265 / etc.) - The program used to encode the video (x264 / x265 / NVENC / etc.) - The settings used to encode the video - The video’s source (Blu-ray / Web Stream / etc.) - The video’s colors (brightness / contrast / saturation / etc.) - The video’s color space and range (i.e. whether it’s in HDR) - How sharp or blurry the video is

If you’ve paid attention in the previous section, you should know that at least some of these points, like the file format one, cannot be true (but it’s still a misconception I sometimes see!). But, in fact, the truth is that none of these things are necessarily related to a video’s quality! The program used to encode the video combined with the settings used in them gets the closest, but only in specific scenarios.

Why is this? Well, let’s go through them one by one (but in a slightly different order to make things easier to present).

The Encoding Program and its Settings

Like I said, these two combined are what gets closest to being directly related to the video’s “quality”. Why they matter is probably obvious once one mentions them as a variable: Of course different encoding programs can encode a video in different ways, and different settings will make them do it differently. But the real lesson to learn here is that these are even parameters in the first place! This is something that even semi-experienced users sometimes miss (for example, I did so when I was starting out!): It’s easy to think that ffmpeg -i myvideo.mp4 myencodedvideo.mp4 is the only way to reencode a video (maybe sprinkle in -preset slow if you’re feeling like an expert), without realizing that this will use a fixed (low) quality setting that could be adjusted with further settings.

So, I really cannot stress enough that the encoding settings (including the tool used) matter the most when it comes to a video’s quality. This mainly manifests itself in two ways: 1. The tool used. When it comes to encoding H.264 or H.265, the best encoders without any competition² are x264 and x265. When you are in any situation where you can afford it, you should be using one of these encoders. Most video editing programs allow you to select them (and programs like ffmpeg or Handbrake (though ideally you shouldn’t use the latter) use them internally).

My main point here is not really how to use the CRF setting, but mainly that it exists in the first place, and that it above everything else controls the output quality of your video.³

There are lots of other settings in x264/x265 that experts can use to precisely tweak their encodes, but if you don’t know what you’re doing I’d recommend not touching them at all. Once again, my main point here is really just that encoding settings affect output quality.

Now, I said above that these parameters are what gets closest to the video quality, but only in specific scenarios. Why is this? Well, what I mean by this is all that the encoding settings can affect is how closely the encoded video resembles the input video, i.e. how much quality is lost at the encoding step. If your input video is already bad, then reencoding it with perfect settings will not fix it. This may seem obvious, but it highlights how video quality has multiple different facets. Say you are choosing what footage to use as a base for your encode or edit, and have the choice between two sources, where one has a much higher bitrate than the other. Usually, you would choose the source with the higher bitrate, but this only makes sense if the two sources were encoded from the same underlying source (or at least similar ones)! It’s very possible that the higher-bitrate source had some other destructive processing applied to it (say, sharpening, a bad upscale, lowpassing, etc. - more on these later). In cases like these, you may want to choose the lower-bitrate source instead, if it’s at least encoded from a clean base.

So, as a summary, the quality loss of an encode is controlled by the encoding tool and settings, but the quality of an existing video is affected by every single step that happened between it being first recorded or rendered and it arriving on your hard drive.

Interlude: So Then, What is Quality Actually?

I’ve now spent a long time talking about what quality isn’t, as well as what quality is affected by, so it might be time to try to formulate an actual definition of quality.

We already got pretty close with our discussion of encoding some video from a given source, with the goal of getting an output that differs from the input as little as possible. That is what quality is: The quality of an encoded and processed video is a measure of how closely it resembles the source it was created from.

Again, this sounds extremely obvious once you spell it out, but it has huge consequences that may not be clear to everyone! Most importantly, quality can only ever be measured relative to some reference, some kind of ground truth. Without a ground truth, everything becomes subjective.

Secondly, this now says something about the “quality” of videos you may come across in the wild (i.e. ones that weren’t encoded by you): When you have two or more possible sources for the same footage (say, a movie or a show) available, and want to evaluate their quality, what matters is which of them is closer to the original footage they were both created from. In the case of a movie, this would be the original master. Once again, this may sound obvious, but we will see soon how many misconceptions are formed from not understanding this principle.

Finally, I need to talk about the word “closer” in this new definition, which is actually doing a lot of heavy lifting. What “closer” really means here is very complicated, which is why I left it somewhat vague on purpose. There are lots of ways to compare how close two videos are (and that’s assuming that they have the same resolution, frame rate, colors, etc), but none of them are perfect. In particular, there are many automated “objective” metrics (you may have heard of PSNR, SSIM, VMAF, etc.). These are very important for encoding programs to function at all, but it’s important to realize that no automated metric is perfect, and they all have their own strengths and weaknesses.

Because of this, video “quality” will always entail some degree of subjectivity. Still, there are some thing that are almost certainly wrong, and you’ll see some of them in the following sections.

Back to Mythbusting

You should now have a decent idea of what quality actually means, and what it’s determined by. Still, I want to spell out explicitly why various other parameters do not directly correlate to quality, and clear up associated misconceptions. So, let’s go through them one by one.

File Size or Bitrate

This should hopefully be clear from the section on encoding settings. Yes, more bits usually means better encode quality if everything else stays the same, but ultimately the full package of encoding settings (which bitrate can be one of) is what matters. Different encoders or settings will result in different efficiency levels, so you can have two encodes of the same quality and different file sizes or vice-versa. For example, NVENC allows very fast encoding at the expense of larger file sizes, so an x264 encode (with decent settings) will get you much smaller files of the same quality (but will of course take much longer).

Video Coding Format (H.264 / H.265 / AV1 / etc.)

Again, hopefully this should mostly be clear now: What matters is the tool used to encode the video (and its settings), not the format it encodes to. A more advanced format will allow for more techniques to efficiently encode a video, but that only matters if the encoding program properly makes use of them.

In particular, there is an often quoted factoid that “HEVC is 50% more efficient than AVC”, which in reality is just plain wrong. H.265 (that is, current standard H.265 encoders) does usually provide an efficiency gain over H.264 (that is, current standard H.264 encoders), but if it does it’s by far less than 50%. And, as always, the format is just one facet of the full “Encoder and settings used” package. On pirate sites I sometimes see comments like “I want to download this, but it’s AVC. Is there an HEVC version somewhere?”, and I hope that I don’t have to explain anything further about why that makes no sense.

Another important point is that the strengths and weaknesses of encoding tooling can greatly differ based on the level of quality you’re targeting. AV1 is the current new and fancy coding format, and modern AV1 encoders (when used correctly) can yield incredible efficiency gains over x264/5 on low-fidelity encodes. However, for high-quality encodes (i.e. targeting visual transparency), x264 and x265 are still far ahead. It’s for reasons like these that it’s very hard to make blanket statements on the efficiencies of different encoders.

One final thing to mention here is that the coding format will affect how difficult it is to decode your video. Older or smaller devices may struggle to decode more advanced formats like AV1 or even H.265 (or specific profiles of formats like 10bit H.264). This doesn’t directly affect quality, but it may be important to mention for people that plan on making their own encodes: If you’re targeting high player compatibility, you may need to keep this in mind and (for example) release an 8bit H.264 version alongside your main release.

The File Format (.mkv / .mp4)

Hopefully I don’t have to say anything more here. Read the first section again if this is not yet clear to you. But I have seen “This is an mp4 file, can someone upload an mkv file instead?” more than once, which is why I need to spell this out here. If this was you, look through the later sections to see how to fix these things for yourself.

Resolution

This may be the biggest misconception of them all: Many people effectively think that resolution is the only thing that controls a video’s quality. Maybe it’s because of how YouTube and many other streaming platforms expose resolution as the only setting to change “quality”. Either way, this is not the case. We’ve already seen why this is true in general, but let’s go over some specific cases:

So, as a summary, keep in mind that resolution is not the same as quality. A higher resolution may not mean better quality, and lowering the resolution may not be the best way to save file size.

Frame Rate

This is fairly similar to the resolution story, so there’s not much more to say here. Just like AI upscaling just for the sake of upscaling, frame interpolation is bad. There’s not even any nuance here this time, just don’t do it. (Do I need to spell out what “quality” means again?) Movies and TV shows are usually 24fps (well, often they’re actually 23.976fps⁴, but you get the idea), so if you find a source somewhere that has some different frame rate, double-check if that is the correct one.

Bit Depth (8bit / 10bit / etc.)

This is a tricky one, and I am mainly mentioning it to talk about a very specific technique in encoding.

Bit depth is a slightly more niche concept, so I’ll explain it just in case: Bit depth refers to how many color values are possible for each pixel. Almost all images and videos you’ll come across are 8bit. For RGB colors, this would mean 256 red/green/blue color values per pixel, which results in 256 * 256 * 256 = 16777216 total possible RGB color values. In reality, video colors are not actually stored in RGB, and usually do not exhaust their full available range of values, but for getting a basic intuition this is not too important.

However, it’s also possible for videos to have a higher bit depth like 10bit or 12bit. Apart from masters, this is common for HDR video.

In principle, the same rules as for resolution and frame rate apply: Don’t change any aspects of your video without a good reason, so don’t change the bit depth either if you can avoid it. That said, it is common in video encoding to actually encode footage at a bit depth higher than the source’s. This is due to intricacies of video encoding that are too complicated to explain here, but the upshot is that encoding at a higher bit depth can actually result in an increase in efficiency. This is why you may see 10bit encodes of 8bit footage: These do not mean that there was a 10bit source somewhere, they’re just encoded in this way because it was more efficient.

This doesn’t contradict our philosophy of not changing anything without good reason, it just means that there is a “good reason” in this case. In particular, this is feasible here because, unlike with resolution or frame rate, increasing bit depth is not a destructive process (when done correctly)⁵.

(If you’re interested in why encoding at a higher bit depth is more efficient, here’s an attempt at a basic explanation: Intuitively, you might be confused about this, since adding more bits ought to correspond to more bits to store, which results in more required file size. But the important thing to realize is that the “bit depth” in modern video coding formats is not actually what controls the level of precision with which pixel values (or, in reality, DCT coefficients) are stored. That level of precision is controlled by the quantization level, which is a different parameter. (And that is in fact the main knob that encoders turn to regulate bit rate and quality.) Instead, the actual bit depth controls the level of precision at which all mathematical operations (like motion prediction and DCTs) are performed, as well as the allowable scale for the quantization level. Encoding at a higher bit depth means that operations are performed with more precision, which makes certain encoding techniques more precise and hence more efficient, which in turn saves space. However, raising the bit depth also means that slightly more bits need to be spent to encode the actual quantization factor (and other elements), so at some point you do get diminishing returns. Empirically it turns out that encoding at 10bit works pretty well for 8bit content, but that encoding at 12bit is not worth it.)

The Video’s Source (Blu-ray / Web Stream / etc.)

This is another slightly tricky one. Usually, a Blu-ray release of some footage will be better than a web version from the same source, on account of having a much higher bit rate. However, this doesn’t always need to be the case: The fact that various post-processing operations can affect the quality of the video also applies to the authoring stage (that is, the process of taking a show or movie’s master, and putting it onto a Blu-ray, performing all the necessary conversion and compression that this entails), and it is very much possible for a Blu-ray release to have some destructive filtering applied to it that the web releases do not (or for the Blu-ray release to just have terrible encoding settings). Different web streams from different sites, or different Blu-rays from different authoring companies can be different too.

Again, this is especially relevant in anime, where some Blu-ray authoring companies apply a blur to the video before encoding it, which hurts quality⁶.

If you’re just starting out in working with video, it may be hard to judge for yourself which source is better, but the main thing I want to convey here is that “Blu-ray” does not automatically have to mean “better quality”. Always try to manually evaluate sources using your eyes, or ask someone more experienced for advice on which source to pick (see below for some resources on this).

HDR vs. SDR

HDR (High Dynamic Range) is another complicated topic. What I mainly want to convey here is that, once again, HDR does not automatically mean “better than SDR”. If there are HDR and SDR sources of some footage available, it all depends on how they were created, and from what kind of common source (if there is one). It’s possible for the SDR version to be a direct tonemap of the HDR one (in which case the HDR version is the objectively better source) or for the HDR version to have been inverse tonemapped from the SDR one (in which case it’s the other way around), or for them to have both been created from some base source (in which case it depends on how). For example, it is not uncommon for official HDR releases of some footage to never actually reach a brightness above 100 nits, and hence be no better than the SDR version.

In particular, you should be very suspicious of any HDR (or Dolby Vision) source you may find for a video that wasn’t officially released in HDR anywhere. It’s very much possible that this “HDR” version was created artificially from the SDR version by whoever released it, in which case (just like an AI upscale) there’s no reason to use it over the base SDR version.

Again, HDR is a very complex topic and these things can be very hard to evaluate as a newcomer, but the important thing is to know that this subtlety exists in the first place. If the SDR version looks decent, you may just want to save yourself (and your viewers, if there are any) the trouble of dealing with HDR and work with the SDR version.

Colors

As I have already repeated ad nauseam, the goal of video encoding is to change the source as little as possible. Just like you shouldn’t change the resolution or frame rate without a good reason, the same applies to colors. I sometimes see releases where people “improved the colors :)”, and it turns out that what they really did was fiddle with the brightness and saturation sliders until it looked “better” (read: brighter and more vibrant).⁷ But doing this is the opposite of staying true to the source. Color grading is very important for editing photos or raw footage, but when you’re working with footage that was already edited and mastered by the artists, any further “color corrections” go against the artistic intent.

In short, remember that “brighter and more saturated” does not mean “better”.

Finally, while we’re on the topic of colors: When you run an encode, especially from some kind of video editing software, make sure to make a direct comparison of some output frames to the corresponding input frames using good viewing software (i.e. mpv or vs-preview, see below). If you see a noticeable color mismatch, this may be due to some misconfiguration in your editing software or project (like the color matrix or color range) that you will need to look into.

Sharpness

Last but definitely not least, we have another one of the bigger misconceptions. Many people think that “sharp” means “higher quality” and, in particular, that “blurry” means “lower quality”. While it’s true that a lower quality encode can manifest itself in more noise around lines, and that reducing the resolution (which we’ve already established you probably shouldn’t do) will automatically mean that lines can no longer be as sharp, this is far from a one-to-one correspondence.

In reality, the exact same thing as for resolutions, frame rates, or colors applies. You want to stay as close to your original video as possible. If some elements of the original video are comparatively blurry, chances are that they’re meant to be blurry. (Or, at the very least, any kind of sharpening process will not be able to distinguish between elements that are meant to be blurry and ones that aren’t.)

Hence, just like you shouldn’t fiddle with color sliders just to “improve the colors”, you shouldn’t slap a sharpening filter on top of your video just to “make it sharper :)”. This will only take your video further away from the source, not closer.⁸

It’s true that to the layman viewer’s eye, sharper content will look more appealing. But once you know what to look for, you will see that sharpening creates a lot of ugly artifacts like line warping or haloing. Like with upscaling, please just take my word for it when I tell you that prioritizing sharpness above all else is not a good idea.

Summary

Now, that was a lot of text, but unfortunately it was needed. Video is very, very complicated, and this was just the tip of the tip of the iceberg. In case that was too much information to dump on you all at once, let me summarize the most important takeaways:

Learning to Spot Quality Loss

As a novice video encoder, it may be hard to see quality loss in the beginning. You may come across images or comparisons where some experienced encoder says “Oh my god this looks terrible!!” while you’re thinking “Are those the same picture?”.

But don’t worry, this is normal. You have to know what to look for in an image, and you have to train your eyes to look for it. (But know that this is cursed knowledge. Once you learn how to spot artifacts, you can never look at video the same again.) A full guide on how to spot video artifacts would take up an entire second article with many example images, but as a short summary, here is a list of areas you should focus on most:

Keep in mind that what constitutes acceptable quality loss is always in the eye of the beholder, and that that is a two-way street. If you are creating encodes mainly for yourself, and you yourself cannot see any quality loss, then there’s no reason to worry about it even if someone else tells you it’s visible. However, on the other hand, you also shouldn’t criticize anyone for releasing high file size encodes to prevent quality loss just because you can’t see the artifacts they would prevent.

Subtitles

When you’re working on an anime or some other media that is not in your target audience’s language, you will need to add subtitles, in which case there are a couple of things you should know.

The most powerful format for subtitles is Advanced SubStation Alpha, or ASS for short⁹. ASS subtitles not only allow showing subtitles for spoken dialogue but also creating translations for on-screen text that blend in seamlessly with the original video. Even if you do not plan to make subtitles like these themselves, you probably want to ship subtitles you downloaded from somewhere, which will probably be in the ASS format.

One important thing to know is that the only container format that really supports ASS subtitles is mkv. If, for some reason (probably because you’re targeting some kind of streaming), you do not want to release an mkv file in the end, you will need to hardsub. See below for the best way to do this.

Secondly, if your goal is to edit your video, you will have to think about how to match your subtitles to your edit. There is no good automated solution here. Your options are basically: 1. Manually retime the subtitles in a program like Aegisub, or 2. Hardsub the subtitles and edit the hardsubbed video.

In general, you should avoid hardsubbing when possible, since it - involves reencoding, and hence introduces quality loss, - takes time (which may not be a problem when you are only editing your video once, but becomes increasingly annoying if you want to make incremental fixes later on), - makes it much harder for anyone, including yourself, to change some aspect of the subtitles later on.

However, retiming all subtitles yourself for a quick edit is also a lot of effort. In the end, the choice is yours. If you do end up hardsubbing, make sure you do it correctly. Read the later sections for how.

Recommended Tools and Workflows

I’ve now talked a lot about what you shouldn’t do, so what should you do instead? This section contains some useful tools, as well as workflows to do certain things the right way.

Recommended Tools

Tools You Should Not Use

Workflows

If you’ve read the previous sections, you’ll know that reencoding a video will hurt its quality (and reencoding it over and over will hurt its quality even more, since later encodes will spend bits to reproduce the artifacts introduced in the previous encodes). Hence, you should make sure that you only reencode when absolutely necessary, and do all other necessary conversions through remuxing. Ideally, that would mean (lossily) reencoding only once, at the very end of your workflow. If your editing software does not allow encoding using x264/x265, you can export a lossless render from it and then encode that lossless render with ffmpeg.

Sometimes, you cannot easily avoid reencoding an additional time at some other step in your workflow. If this happens, at least make sure that your intermediary encodes are either lossless, or as close to lossless as possible. Unfortunately, I have not yet found a reliable way to encode a lossless file that common editing programs can open (if you know one, let me know!), but at the very least you can make an x264 encode with -crf 1.

To make an actually lossless encode with x264 you can add -qp 0 instead of a -crf argument, but be aware that not all programs will be able to open such a file.

Encoding a Video

The simplest way to encode a video is using ffmpeg. More advanced users will encode using x264 or x265 directly, but ffmpeg is fine for beginners.

As explained in the first section, adjust the CRF to control the quality at the expense of file size. If you’re encoding anime or animation, you may want to bump up the bframes by adding -x264-params bframes=8 (which will save a bit of file size but take longer to encode). Other than that, do not touch any other settings you do not understand. In particular, do not use -tune animation for anime; that tune is targeted towards extremely flat animation, so it will be counterproductive on anime, which usually has a fair amount of grain and texture.

A good way to think of video encoding is as a three-way tradeoff between file size, quality, and encoding speed. You can decrease the file size, but only at the expense of quality or encoding speed, and similarly for the other two factors. The crf setting is used to regulate between quality and (decrease in) file size. The preset setting controls the encoding speed, and hence the efficiency. A faster preset will mean a faster encode, but also a larger and lower-quality one.

Be aware that the visual quality of a given CRF value will depend on the resolution you’re encoding at. CRF 18 at 1080p behaves differently from CRF 18 at 480p. The best way to pick a CRF for your encode is just to run a few sample encodes and compare the results.

Muxing an MKV

Hopefully you can figure this out with the tools linked above (MKVToolNix being the easiest way). All I really want to say here is that if you are muxing in ASS subtitles, you need to add all the fonts used in the subtitles as attachments. Aegisub has a font collector that can collect all the fonts used in a file. If you don’t want to install the fonts, you can use a font manager like FontBase (add the folder with all fonts as a “watched folder”) to temporarily activate them without installing them.

Remuxing to MP4

This is more tricky and the main reason why this section exists. In principle, muxing an mp4 file is easy: Just run ffmpeg -i yourinput.mkv -c copy youroutput.mp4. However, chances are that the reason you are remuxing to an mp4 file is so that you can import your video into your favorite video editing program. In that case, remuxing using ffmpeg can cause some problems with the frame rate.

Most videos you’ll come across have a constant fractional framerate of 24000/1001 (which is approximately 23.976) frames per second. But this is actually a bit of a lie: A lot of times the frame rates aren’t truly constant (and, in fact, in mkv files they often cannot be). For certain technical reasons, frame timestamps often need to be rounded, which causes ever-so-slight deviations from the constant 24000/1001 frames per second. Video players handle this completely fine, so that you’d never even notice it as a normal (or even experienced) user. However, some video editing programs can be extremely picky about these frame rates, and introduce stuttering when the frame rate is not truly constant.

Since mkv files fundamentally cannot have a true constant frame rate of 24000/1001, remuxing to mp4 using ffmpeg will also result in a frame rate that is not truly constant. You can see this in MediaInfo in the Frame rate mode entry.¹¹

There exist a couple of ways to fix this: 1. MkvToMp4 is a GUI application that can remux an mkv to an mp4 file and force a constant frame rate if applicable. While I haven’t audited it in detail myself, I know video editors who have used it for a long time and haven’t had issues with it. 2. With the right incantation, you can also force a constant frame rate in ffmpeg. The best one I could come up with needs two invocations, though:

ffmpeg -i yourinput.mkv -c copy -video_track_timescale 24000 intermediary.mp4    ffmpeg -i intermediary.mp4 -c copy -bsf:v "setts=dts=1001*round(DTS/1001):pts=1001*round(PTS/1001)" out.mp4

If your source video is, say, 30000/1001 fps instead of 24000/1001, replace the 24000 in the first call with the appropriate numerator.

There also exists a tool called mp4fpsmod that can force mp4 frame rates, but I found the ffmpeg call to be more reliable when the first frame does not start at timestamp 0.

Hardsubbing

This will hardsub the track marked as the default, add e.g. --sid=0 or --slang=eng to select a different track. Hardsubbing is an extra encoding step, and like explained above you want to reencode as few times as possible. Hence, either make sure that hardsubbing happens at the end of the workflow from a lossless source, or output a (near) lossless encode when hardsubbing (e.g. by setting the CRF to 1).

Bonus: Interlacing

This is a bonus section meant to prevent some slightly more advanced misconceptions. If you don’t know what the term “interlacing” means, you can safely skip this section.

If you do know what interlacing means, the main thing I want to get across here is that not all interlacing is the same, and in particular that the answer to seeing footage that looks “interlaced” is not always to run a deinterlacer.

When working with movies and TV shows, it is actually much more likely for interlaced-looking footage to really be telecined.¹². What this means exactly is outside the scope of this article, but you can read fieldbased.media or the Wobbly guide for more information. The important takeaway is that telecining can (almost) be losslessly reversed (though it may need manual processing), and that running a deinterlacer on telecined footage will throw away half the vertical resolution while still keeping the frame rate stutters. When you see footage that shows combing, please consult some more experienced person before blindly running a deinterlacer on it.

The Rabbit Hole

The above should cover everything you need to know as a beginner. If you like to suffer and are interested in learning more about multimedia and encoding, the JET Guide can be a good place to start. In particular, there it contains a big list of resources that link other good guides.

Technically the term codec refers to a specific program that can encode and decode a certain format, not the format itself, but almost nobody makes that distinction in practice.↩︎
When targeting quality↩︎
You may be wondering why I am not mentioning bitrate, which is also a setting in x264/x265. This is because setting x264/x265 to some bitrate will make them force the video to that bitrate (when possible), even if it may not be necessary. This will make it waste bits on simple scenes that could be spent on more complex scenes instead. When you are not encoding for live streaming, CRF is the better setting to use, since it will automatically allocate the bits where they’re needed most.↩︎
And that is actually 24000/1001 fps and also constant frame rates are usually a lie anyway but you get the idea.↩︎
Scaling or changing the frame rate can also be nondestructive when done correctly, but they’re much easier to get wrong than the bit depth.↩︎
Why this happens is complicated (and we don’t even fully know ourselves). The technical term is lowpassing, with the idea being to remove high frequencies in advance in order to improve compressibility, but in practice this is just counterproductive. We suspect that certain proprietary authoring software suites have this lowpassing enabled by default, and that authoring studios aren’t aware of it or its negative consequences.↩︎
There are some actual types of errors in encoding that affect colors and can be objectively fixed, like double range compression or mistagged color matrices, but those are not the same thing as fiddling with some sliders, and they once again require you to know exactly what you’re doing.↩︎
Once again, some caveats apply here in specific cases. For example, if you absolutely cannot avoid upscaling your video, you might as well find a “good” way (whatever that means) to upscale it, and try to add as little blurring as possible. But sharpening just for the sake of sharpening is not a good idea.↩︎
Yeah, the jokes never get old.↩︎
This is a starting point for the target audience of this article. Experienced encoders targeting transparency will use very different settings.↩︎
Though I’m not fully sure if MediaInfo is completely reliable here. The best way to know for sure is to use MP4 Inspector and check the moov > trak > mdia > minf > stbl > stts box. If it is truly CFR, there should only be a single entry.↩︎
This is a fairly established term in the encoding community, but it’s actually somewhat incorrect. Outside of the encoding community, you’ll usually see this being referred to as 3:2 pulldown instead.↩︎

What you NEED to Know Before Touching a Video File