Thanks for reading my article.
The purpose of this article is to show how we can fine-tune LLM on any custom dataset. I have shown just a way to measure the learning improvement of fine-tuned LLM over some sample datasets and not over the entire datasets. The improvement metrics are not the focus of this article.
What I can think of is in your case sample datasets that you have selected might be different than that in this article.
Also, check the output of the fine-tuned model for some sample inputs. What I have observed during this experimentation is that the output format of fine-tuned LLM is not consistent. You might need to change the parsing logic accordingly.