Furkan Gözükara

Kohya FLUX Fine Tuning (Full Checkpoints) / DreamBooth Training Full Tutorial For Local Windows and Cloud RunPod and Massed Compute

Added 2025-10-28 23:11:00 +0000 UTC

The very best complete workflow and configurations for full Fine Tuning of FLUX Models with as low as 6 GB VRAM GPUs on Windows and Cloud

Patreon exclusive posts index to find our scripts easily, Patreon scripts updates history to see which updates arrived to which scripts and amazing Patreon special generative scripts list that you can use in any of your task.

Join discord to get help, chat, discuss and also tell me your discord username to get your special rank : SECourses Discord

Please also Star, Watch and Fork our Stable Diffusion & Generative AI GitHub repository and join our Reddit subreddit and follow me on LinkedIn (my real profile)

=======

Full main tutorial : https://youtu.be/FvpWy1x5etM

Latest zip file : Kohya_FLUX_DreamBooth_LoRA_v32.zip

Quick new Massed Compute install (Oct 2025) : https://www.youtube.com/watch?v=Ym9rdfy2VZ0

Multi GPU of DreamBooth requires 80 GB A100 or above GPUs, 48 GB is not sufficient sadly - i didn't test recently though, LoRA working great on 48 GB GPUs
The generated checkpoint files should be put into \SwarmUI\Models\diffusion_models
LoRA training post : https://www.patreon.com/posts/110879657
- LoRA checkpoints into \SwarmUI\Models\Lora
Suggested Axillary Tools:
- Detailed LoRA extraction guide and tests from FLUX fine-tuned models : https://www.patreon.com/posts/112335162
- Ultimate-Batch image pre-processing : https://www.patreon.com/posts/112126955
- Ultra fast upload-download as backup models to Hugging Face notebook : https://www.patreon.com/posts/104672510
- If you want to convert FP16 checkpoints into FP8 with no visible quality loss and save 12 GB disk space per checkpoint, follow this public tutorial : https://www.patreon.com/posts/how-to-convert-114003125
- Style training full details : https://huggingface.co/MonsterMMORPG/3D-Cartoon-Style-FLUX
- Triton packages : https://github.com/woct0rdho/triton/releases
- RunPod CTL : Click to download
Mandatory Tutorials:
- Windows requirements video (Python, CUDA, Git, cuDNN, C++ Tools) : https://youtu.be/DrhUHnYfwC0
- Windows LoRA tutorial (mandatory to learn Kohya) : https://youtu.be/nySGu12Y05k
- Cloud LoRA tutorial (mandatory to learn cloud) : https://youtu.be/-uhL2nW7Ddw
Suggested Tutorials:
- Learn SwarmUI : https://youtu.be/HKX8_F1Er_w
  - Installers : https://www.patreon.com/posts/106135985
  - Rename T5 XXL into : t5xxl_enconly.safetensors
- Learn SwarmUI on Cloud : https://youtu.be/XFUZof6Skkw
- Learn SwarmUI with FLUX : https://youtu.be/bupRePUOA18
- SUPIR Upscaler to Upscale : https://youtu.be/OYxVEvDf284
  - Download SUPIR Installer: https://www.patreon.com/posts/99176057
Using samples during training will increase VRAM usage thus Batch Size 7 will give error on RTX A6000 and you also may get OOM - thus I don't recommend

29 October 2025 Update v32

I have added a new amazing tool called as Image Preprocessing
This tool is extremely important and useful when you do training with bucketing enabled
I recommend to use this tool, preprocess your training images and checkout how your images actually used during training
Just run Windows_Install_or_Update_Kohya.bat to update

2 October 2025 Update

Sadly Bmaltais stopped developing Kohya GUI therefore I forked his repo and now we are going to use myself developed
- One advantage of this that now we are going to use always latest version of SD Scripts of Kohya
I have extremely optimized and significantly improved the installation and therefore now it will be way faster and more accurately installed on Windows, RunPod and Massed Compute
I have updated libraries to Torch 2.8, CUDA 12.9, Accelerate 0.48, xFormers 0.33, Flash Attention 2.8.3, Sage Attention 2.2 and Triton 3.4 on all platforms
- Now it supports all of the GPUs starting from RTX 1000 series to 5000 series + cloud GPUs like RTX A6000, A100, H200, B200 etc
Moreover I made the app to auto recognize FLUX Krea Dev and FLUX SRPO models as FLUX.1 - remember you have to enable that checkbox
- All you need to do is after loading the config, select downloaded FLUX SRPO as a base model not FLUX Dev model in model path
I have trained the new FLUX SRPO model with our existing DreamBooth configs and compared it to FLUX Krea and FLUX Dev base model
I can confidently say that the FLUX SRPO model is perfectly trainable with our config and it is a little bit more realistic than FLUX Dev
- So for realism from now on I recommend FLUX SRPO
Here below base 1024x1024 no face restoration or upscale made results below
- Remember our upscale preset in SwarmUI 100%+ improves quality like in my this sharing : https://www.patreon.com/posts/133166462 (this was on FLUX dev not on SRPO)
- FLUX Dev vs FLUX Krea vs FLUX SRPO full comparison : FLUX_Dev_vs_Krea_vs_SRPO_DreamBooth.jpg
- FLUX Dev vs FLUX SRPO : FLUX_Dev_vs_SRPO_DreamBooth.jpg
FLUX SRPO is an extremely realistic base model compared to FLUX Dev - it is a special fine tune : https://github.com/Tencent-Hunyuan/SRPO
I recommend to get latest zip file and make a fresh install into a new folder if you want to upgrade to the latest version since a lot of installation process changed
The model downloader script upgraded to our special ultra FAST and robust model downloader - like uGet with 16 connections + SHA 256 verification
The Windows_Download_Training_Model_Files.bat will ask you which model you want to download
On RunPod and Massed Compute read the instruction txt files and you will see commands to download any of the models directly

Windows Requirements

Python 3.10.11, FFmpeg, CUDA 12.9, cuDNN 9.12, C++ Tools, MSVC and Git
- Only Python and Git should be sufficient since I precompile libraries but still to be sure i recommend install all
If you get any errors follow below video and its source link
https://youtu.be/DrhUHnYfwC0
https://www.patreon.com/posts/click-to-open-post-used-in-tutorial-111553210

Massed Compute (Recommend Cloud) :

Please register via this link : https://vm.massedcompute.com/signup?linkId=lp_034338&sourceId=secourses&tenantId=massed-compute
- Use our coupon SECourses
- Our coupon works on all GPUs now
  - H100 has amazing price and speed but you can use like RTX A6000 ADA as well
  - Full details here : https://www.patreon.com/posts/26671823
- Then select our image SECourses from Creator dropdown
- Then follow Massed_Compute_Instructions_READ.txt
- Same as my any other Massed Compute installer script
- Example tutorial for learn how to install and use Massed Compute
  - (Starts at 12:58) : https://youtu.be/KW-MHmoNcqo?si=G1WbG-Qw4ujWvOtG&t=778

RunPod (Cloud):

Please register via this link : https://runpod.io?ref=1aka98lq
- Then follow Runpod_Instructions_READ.txt
- Same as my any other RunPod installer script
- Use the template written in Runpod_Instructions_READ.txt file
- Example tutorial for learn how to install and use RunPod
  - (starts at 22:03) : https://youtu.be/KW-MHmoNcqo?si=QN8X8Sjn13ZYu-EU&t=1323

13 August 2025 Update

I have trained FLUX Krea Dev model with our FLUX Dev DreamBooth configs and compared the results - inside DreamBooth_Tab_Fine_Tuning_Best_FLUX_Configs folder
- Our model downloader in zip file now auto downloads FLUX Krea Dev too
- So after loading your config just change base model to FLUX Krea Dev
- FLUX Krea Dev Tutorial here
  - https://youtu.be/8MvvuX4YPeo
    - 15:31 FLUX Krea Dev vs FLUX Dev: A Detailed Side-by-Side Image Comparison
    - 16:26 How to Easily Train Your Own LoRAs on the New FLUX Krea Dev Model
    - 17:02 Complete Workflow for Generating High-Quality Images with FLUX Krea Dev
    - 18:20 The Final Verdict: Side-by-Side Result of FLUX Krea Dev vs FLUX Dev
I feel like FLUX Krea Dev needs a little bit higher learning rate or longer training
- I recommend longer training
You can see full size grid comparisons below - trained on 28_imgs_dataset.png
- 100 Epoch Grid
- 125 Epoch Grid
- 150 Epoch Grid
- 175 Epoch Grid
- I also recommend as usual doing 2x latent upscale
- Our SwarmUI FLUX Dev Official 2x Latent Upscale preset working right away
  - Tutorial for 2x latent upscale here : https://youtu.be/Xbn93GRQKsQ
    - 1:07 Achieving Hyper-Realism with the FLUX 2x Latent Upscale Preset
Our default learning rate right now is 4e-06 (still up-to-date) but I trained FLUX Krea Dev with 2e-06, 4e-06, 6e-06 to compare so you will see how it behaved in all cases
I may also research Chroma model and publish presets for it, currently my focus is Qwen Image which I believe will be better than FLUX Dev in every aspect
Hopefully full tutorial and very easy to use workflows and presets coming soon for Qwen Image model training i am working on Gradio App and presets

13 July 2025 Update

Gradio broken thus added temporary fix : Temp_Fix_Gradio_Error.bat
RunPod and Massed Compute fix auto applied

29 May 2025 Update

32 GB RAM configs added - not VRAM system RAM
- They are inside LoRA_Tab_LoRA_Training_Best_FLUX_Configs inside 32 GB RAM Configs - Not VRAM - RAM folder
- The difference is that you have to use flux1-dev-fp8.safetensors and now the config has enabled fp8 base unet
Windows_Download_Training_Model_Files.bat updated to prevent possible errors during download of models

13 May 2025 Update

Now on RunPod and Massed Compute our installer supports RTX 5000 series as well as older GPUs like RTX 3090, RTX 4090 etc
Upgraded to Torch 2.7 and CUDA 12.8
- I tested on RunPod and it is 3 second / it with RTX 5090 and 48GB_GPU_28200MB_6.3_second_it_Tier_1.json - 0.907 USD per hour with 100 GB

4 May 2025 Update

First run installer and then run Windows_RTX5000_Series_Upgrade_Run_After_Install_Finished.bat
- Now it uses official Torch 2.7, CUDA 12.8, and myself compiled xFormers
- This is required for all GPUs
Training models uploaded to myself hosted XET enabled repo for even faster and more stable downloads : https://huggingface.co/MonsterMMORPG/Kohya_Train/tree/main
All configs are up-to-date with best settings
Amazing 22 special prompts added for woman trainings testing into Test_Prompts folder

20 November 2024 Update

Important bug with Torch 2.5.1 discovered therefore a new .bat file added
Use Windows_Downgrade_To_Torch_2.5.0.bat to downgrade Torch 2.5.0 - this will speed up training hugely - Windows only BUG
Currently you only need to run Windows_Install_Step_1.bat file to install if you are doing a fresh installation and nothing else

17 November 2024 Update

Huge improvements arrived with newest block swapping feature of Kohya
Model downloaders are updated and made super fast compared to before on all platforms like Windows, RunPod and Massed Compute - up to 10 times faster
- On Massed Compute downloading all training models only took 1 minute (over 30 GB)
All configs are updated and please look at the DreamBooth_Tab_Fine_Tuning_Best_FLUX_Configs folder
Pick the config depending on your GPU, the quality you target and the speed you need
Please watch above listed tutorials to fully learn how to use
Update your Kohya to latest via Windows_Install_Torch_2_5_Dev_Huge_Speed_Up.bat or it is better to reinstall Kohya make a fresh install

31 October 2024 Update

xFormers and Torch 2.5.1 fully officially published
Thus use Windows_Install_Torch_2_5_Dev_Huge_Speed_Up.bat file
Massed Compute and RunPod installers also updated for Torch 2.5.1 and xFormers 0.0.28.post3
All configs both Fine-Tuning / DreamBooth and LoRA updated to xFormers instead of SDPA
- I find that xFormers slightly yields better results
Recommend RunPod template changed to below
- RunPod Pytorch 2.2.0
  - runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

14 October 2024 Update

Huge comparison images posted here : https://www.patreon.com/posts/113970485
Kohya installer files for Windows, RunPod and Massed Compute included in the zip file
Newest prompts added into latest zip file, make sure to download and set yolov9 face detector by following How_To_Download_Yolo_Face.txt which is inside Test_Prompts folder
Latest best Fine Tuning and LoRA training configs added to the zip file with respective folders
- DreamBooth / Fine Tuning : DreamBooth_Tab_Fine_Tuning_Best_FLUX_Configs folder
- LoRA : LoRA_Tab_LoRA_Training_Best_FLUX_Configs folder
- Never load LoRA into DreamBooth tab or Dreambooth into LoRA tab
- Don't use Fine Tuning tab, use DreamBooth tab in Kohya
All of the Fine Tuning / DreamBooth experiments have been completed and checkpoints listed here : https://huggingface.co/MonsterMMORPG/Best_FLUX_Fine_Tunings_Comparisons/tree/main
I have done the following trainings and compared all
Click below links and on opened page download to see full original sizes
Training used 15 images dataset : 15_Images_Dataset.png
Training used 256 images dataset : 256_Images_Dataset.png
- 15 Images Dataset, Batch Size 1 Fine Tuning Training : 15_imgs_BS_1_Realism_Epoch_Test.jpg , 15_imgs_BS_1_Style_Epoch_Test.jpg
- 15 Images Dataset, Batch Size 7 Fine Tuning Training : 15_imgs_BS_7_Realism_Epoch_Test.jpg , 15_imgs_BS_7_Style_Epoch_Test.jpg
- 256 Images Dataset, Batch Size 1 Fine Tuning Training : 256_imgs_BS_1_Realism_Epoch_Test.jpg , 256_imgs_BS_1_Stylized_Epoch_Test.jpg
- 256 Images Dataset, Batch Size 7 Fine Tuning Training : 256_imgs_BS_7_Realism_Epoch_Test.jpg , 256_imgs_BS_7_Style_Epoch_Test.jpg
- 15 Images Dataset, Batch Size 1 LoRA Training : 15_imgs_LORA_BS_1_Realism_Epoch_Test.jpg , 15_imgs_LORA_BS_1_Style_Epoch_Test.jpg
- 15 Images Dataset, Batch Size 7 LoRA Training : 15_imgs_LORA_BS_7_Realism_Epoch_Test.jpg , 15_imgs_LORA_BS_7_Style_Epoch_Test.jpg
- 256 Images Dataset, Batch Size 1 LoRA Training : 256_imgs_LORA_BS_1_Realism_Epoch_Test.jpg , 256_imgs_LORA_BS_1_Style_Epoch_Test.jpg
- 256 Images Dataset, Batch Size 7 LoRA Training : 256_imgs_LORA_BS_7_Realism_Epoch_Test.jpg , 256_imgs_LORA_BS_7_Style_Epoch_Test.jpg

A new tutorial hopefully coming soon for this research and Fine Tuning / DreamBooth tutorial
Current tutorials are as below:
- Windows requirements CUDA, Python, cuDNN, and such : https://youtu.be/DrhUHnYfwC0
- How to use SwarmUI : https://youtu.be/HKX8_F1Er_w
- How to use FLUX on SwarmUI : https://youtu.be/bupRePUOA18
- How to use Kohya GUI for FLUX training : https://youtu.be/nySGu12Y05k
- How to use Kohya GUI for FLUX training on Cloud (RunPod and Massed Compute) : https://youtu.be/-uhL2nW7Ddw

Comparisons

Fine Tuning / DreamBooth 15 vs 256 images and Batch Size 1 vs 7 for Realism : Fine_Tuning_15_vs_256_imgs_BS1_vs_BS7.jpg
Fine Tuning / DreamBooth 15 vs 256 images and Batch Size 1 vs 7 for Style : 15_vs_256_imgs_BS1_vs_BS7_Fine_Tuning_Style_Comparison.jpg
LoRA Training 15 vs 256 images vs Batch Size 1 vs 7 for Realism : LoRA_15_vs_256_imgs_BS1_vs_BS7.jpg
LoRA Training 15 vs 256 images vs Batch Size 1 vs 7 for Style : 15_vs_256_imgs_BS1_vs_BS7_LoRA_Style_Comparison.jpg
Testing smiling expression for LoRA Trainings : LoRA_Expression_Test_Grid.jpg
Testing smiling expression for Fine Tuning / DreamBooth Trainings : Fine_Tuning_Expression_Test_Grid.jpg

Fine Tuning / DreamBooth vs LoRA Comparisons

15 Images Fine Tuning vs LoRA at Batch Size 1 : 15_imgs_BS1_LoRA_vs_Fine_Tuning.jpg
15 Images Fine Tuning vs LoRA at Batch Size 7 : 15_imgs_BS7_LoRA_vs_Fine_Tuning.jpg
256 Images Fine Tuning vs LoRA at Batch Size 1 : 256_imgs_BS1_LoRA_vs_Fine_Tuning.jpg
256 Images Fine Tuning vs LoRA at Batch Size 7 : 256_imgs_BS7_LoRA_vs_Fine_Tuning.jpg
15 vs 256 Images vs Batch Size 1 vs 7 vs LoRA vs Fine Tuning : 15_vs_256_imgs_BS1_vs_BS7_LoRA_vs_Fine_Tuning_Style_Comparison.jpg

Best Found Epochs and Step Counts and Durations

In the zip file check out the Different_Config_Training_Logs_Durations folder to see full training logs for 15 vs 256 images vs Batch size 1 vs 7 and Fine Tuning / DreamBooth vs LoRA - total 8 different trainings
All trainings are done on a single RTX A6000 GPU on Massed Compute, thus 31 cents per hour with SECourses coupon code
But of course you can locally train all Fine Tunings or LoRAs. For example single RTX 4090 is almost same speed as RTX A6000 for Fine Tuning (normally 2x faster but it is like this due to VRAM optimization), for LoRA training it is like 2x Faster
The best checkpoints I have found are as below
- Fine Tuning / DreamBooth: 15 Training Images & Batch Size is 1 : Best Epoch is 160 = 15 x 160 = 2400 steps : Duration is 4 hours 18 minutes = around 1.5 USD cost
- Fine Tuning / DreamBooth: 15 Training Images & Batch Size is 7 : Best Epoch is 140 = 15 x 140 / 7 = 420 steps : Duration is 2 hours 35 minutes = around 1 USD cost
- Fine Tuning / DreamBooth: 256 Training Images & Batch Size is 1 : Best Epoch is 70 = 256 x 70 = 17920 steps : Duration is 30 hours 57 minutes = around 10 USD cost
- Fine Tuning / DreamBooth: 256 Training Images & Batch Size is 7 : Best Epoch is 40 = 256 x 40 / 7 = 1480 steps : Duration is 11 hours 56 minutes = around 4 USD cost
- LoRA : 15 Training Images & Batch Size is 1 : Best Epoch is 160 = 15 x 160 = 2400 steps : Duration is 5 hours 57 minutes = around 2 USD cost
- LoRA : 15 Training Images & Batch Size is 7* : Best Epoch is 140 = 15 x 140 / 7 = 420 steps : Duration is 4 hours 53 minutes = around 1.5 USD cost
- LoRA : 256 Training Images & Batch Size is 1 : Best Epoch is 50 = 256 x 50 = 12800 steps : Duration is 31 hours 25 minutes = around 10 USD cost
- LoRA : 256 Training Images & Batch Size is 7* : Best Epoch is 50 = 256 x 50 / 7 = 1850 steps : Duration is 29 hours 25 minutes = around 9 USD cost
- *For Lora Batch size, instead of Batch Size, Gradient Accumulation steps size used since batch size was not fitting into 48 GB GPUs, exactly same quality results but not speed result - thus for LoRAs, instead rent multiple GPUs and get almost linear speed up

Conclusions

When the results grids are carefully analyzed, we see that Batch Size 7 slightly yields worse results than Batch Size 1 in both Fine Tuning and LoRA training for realism
However, when it comes to stylized images, Batch size 7 yields slightly better than Batch size 1 training
Because of these reasons even when realism desired, for speed, Batch size 7 can be used on RTX A6000 when doing Fine Tuning or you can use multiple GPUs to almost linear speed up LoRA training
Moreover, Fine Tuning quality is always better than LoRA and especially in Stylized outputs, it is many times better than LoRA
Furthermore, 256 images dataset always yields better results for more realism, details, lesser overfit and stylized images and of course for emotions and expressions
Even though 15 images dataset has 0 emotions, Fine Tuning is able to generate some emotions but LoRA fails to do
Finally, to obtain best LoRA results, you need minimum 48 GB GPU to train in 16-bit, yes 24 GB 8-bit training also good enough, but for Fine Tuning, you need minimum 6 GB GPU and quality of 6 GB GPU is equal to 48 GB GPU, only speed differs
- So Fine Tuning is a way to go all the way
- Only negative side of Fine Tuning is each checkpoint is 24 GB
- However, you can extract LoRA : https://www.patreon.com/posts/112335162
If you want speed, of course you can always do LoRA training, but results will be inferior

16 September 2024 Update

Today I have got way better results and the configs
These configs will be loaded into Dreambooth tab of the kohya not the LoRA tab
If you load config for once into LoRA tab, discard it and get a new one from the attached zip file
The same is exactly same as using in the below tutorials
Windows main tutorial - https://youtu.be/nySGu12Y05k
Cloud tutorial - https://youtu.be/-uhL2nW7Ddw
The saved checkpoints size will be exactly 23.8 GB - no way to reduce at the moment
I have selected 4e-06 as best learning rate at the moment but you can go with 5e-06 and 3e-06 as well
Multi GPU training not investigated yet
All ranks are currently equal, only the speed differs

Why Fine Tuning Better Than LoRA

Fine-tuning fully trains the model not just certain layers
Also we are using 12.5 times lower learning rate, thus able to learn more details with lesser overfitting
Way better 256 images dataset itself and LoRA results here :
- https://www.patreon.com/posts/trained-myself-112073170

What Is Not Useful

Additive timestep and Block_Wise_Fused_Optimizer and Sigma are all yielding bad results.
FLUX_Shift is worse
Apply T5 Attention mask seems like bringing very little improvement - so tiny
Uniform inferior than others