Fine-tuning an image generation model

Fine-tuning explained by retraining a GenAI diffusion model on a bunch of images of myself

# artifical-intellegence

Project overview

Technologies

Flux.1 - Open weights GenAI diffusion model
Fine-tuning

Tools

Replicate servers
HuggingFace Transformers

Key features

Fine-tuning a model with my own data

Contributors

Tom Holmes

A practical example of fine-tuning

Show the versatility and customisability of augmenting GenAI models with your own data
Give a visual example of what fine-tuning does to a model with explainers

Methodology

Prepping your data

The model needs structured data to be retrained on. In the case of a diffusion model like Flux.1 this would be a collection of several images with captions describing the image itself so the model an use inference on any prompt you give it after fine-tuning.

There is also a "trigger word" that will cause the model to then reproduce what it has seen in the training data, in this case that word is my name: Tom.

Although I retrospectively discovered this is not the best word to use as a trigger word. You actually would want a string of letters and numbers that does not exist in the English language like THGH96 or HAZZA473.

For an LLM thats text-to-text like Llama or Phi you would do pretty much the same thing but for textual content like important docs with static data like HR policy, or accounting laws... I detail more in my insights article.

Running the fine-tuning

Then I rented some Nvidia GPUs from the cloud to run my training set. This process is fairly quick, I used the mid sized Flux.1 dev model to fine-tune as this would hopefully lead to better results than the smaller Schnell model.

This process basically retrains the model adjusting its weights to create a LoRA (low-rank adaptation of LLMs), essentially the name of a fine-tuned model.

Once the LoRA has been created, I can invoke the fine-tuned model.

The result: Me and my doppelgängers playing poker in Las Vegas

I’d trained the model on mainly pictures of me but some were of my pals too, so this was meant to be me and my mates but the model seemed to generate all four of us as me (the two on the right look most like me IRL)… This could likely be a side effect of the method of fine-tuning applied - I'd used a quick out the box solution for this LAB post.

Like mentioned, this technique can also be applied to more "serious" use cases too like if you wanted to fine-tune a SLM on you internal processes or even a diffusion model like this to quickly test out new assets in your brand's style. See the insights post I wrote for more detail.

More valuable uses for fine-tuning

Internal chatbot

A organisations scale and processes become more complex, its harder for the individual employee to keep track of all the moving parts, especially if they find themselves dealing with a department they don't have to usually.

An internal chatbot fine-tuned on company processes could be incredibly valuable here, and a massive time saver. All they would need to do is ask it in plain English what information they want and the bot returns the relevant information instantly. Saving them having to ping relevant people, using up multiple people's time and saving them from context switching.

Design asset generator

Similar to this experiment, you could fine-tune a diffusion model on your brand's design system, so whenever you wanted to quick mock a concept before approaching a designer you could just prompt the model.

This will give you a quick indication of what could work and what doesn't and would help you communicate more effectively to a designer what it is you want as you'll be able to instantly see what you get from your description.

If it's not what you want you can adjust the prompt until it is desirable then that that to the designer, if it is what you want you can just show the designer, far more effective than just using words.

Contributor notes

I am unfortunately not a famous person in real life (at least at the time of writing!) so the base model obviously has no idea who I am or what I look like, but when fine-tuned with captioned images of me, it adjusts the model weights so that suddenly it does know what “Tom” looks like.

Hopefully this gives a clear picture (pun not intended) of what fine-tuning is and does to a base model, as well as how this can be applied to other use cases that are more business-orientated as detailed above.

Tom Holmes, Software Engineer