Some ideas for practice and self-study:
Following this thread on PyTorch optimisation by [current Hugging Face ML Research Intern] Nouamane Tazi, which ends with links to Pytorch's performance tuning guide and Nvidia's Best Practices guide, take an example network and optimise it!
Follow Andrej Karpathy's videocasts on micrograd and makemore and dig into/reproduce/make a variation on a particular part
Look at the difference between an earlier and later version of an architecture, e.g. LayoutLM vs. LayoutLMv3 (which could help to indicate a path forward for other architectures like XDoc)