2022-06-11 (Sa) [長年日記]
_ rocm で pytorch
地道にビルドしていったんだけど、pytorch がうまくビルドできなくて諦めた。
コンテナイメージ使お。
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/pytorch
で、
git clone https://github.com/pytorch/examples.git
cd examples/mnist
pip3 install -r requirements.txt
HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
root@luna:/var/lib/jenkins/examples/mnist# HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py:138: UserWarning: An output with one or more elements was resized since it had shape [50176], which does not match the required output shape [64, 1, 28, 28].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/native/Resize.cpp:17.)
return torch.stack(batch, 0, out=out)
Train Epoch: 1 [0/60000 (0%)] Loss: 2.306316
Train Epoch: 1 [640/60000 (1%)] Loss: 1.604445
Train Epoch: 1 [1280/60000 (2%)] Loss: 0.955038
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.632662
Train Epoch: 1 [2560/60000 (4%)] Loss: 0.476444
Train Epoch: 1 [3200/60000 (5%)] Loss: 0.513411
Train Epoch: 1 [3840/60000 (6%)] Loss: 0.262893
お、うまくいってるいってる。すげー
と思ったんだけど、
/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py:138: UserWarning: An output with one or more elements was resized since it had shape [25088], which does not match the required output shape [32, 1, 28, 28].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/native/Resize.cpp:17.)
return torch.stack(batch, 0, out=out)
/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py:138: UserWarning: An output with one or more elements was resized since it had shape [784000], which does not match the required output shape [1000, 1, 28, 28].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /var/lib/jenkins/pytorch/aten/src/ATen/native/Resize.cpp:17.)
return torch.stack(batch, 0, out=out)
Traceback (most recent call last):
File "main.py", line 137, in <module>
main()
File "main.py", line 129, in main
test(model, device, test_loader)
File "main.py", line 61, in test
output = model(data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "main.py", line 24, in forward
x = self.conv2(x)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 447, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 444, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: HIP out of memory. Tried to allocate 142.00 MiB (GPU 0; 512.00 MiB total capacity; 103.83 MiB already allocated; 258.00 MiB free; 126.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
root@luna:/var/lib/jenkins/examples/mnist#
GPU メモリ不足らしい。GPU 0
は GPU の ID かな。
512.00 MiB しかないんじゃ無理か 😔
お! --test-batch-size=100
を付けたらいけた!!
テスト精度の計算で時間かかってるけど。
よし、いけそう。
time HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py --test-batch-size=100
time HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py --test-batch-size=100 --no-cuda
↑これで時間を計測。
前者。
Test set: Average loss: 0.0263, Accuracy: 9925/10000 (99%)
real 16m32.497s
user 21m0.749s
sys 0m21.085s
後者。
Test set: Average loss: 0.0252, Accuracy: 9918/10000 (99%)
real 13m43.941s
user 100m27.014s
sys 3m51.267s
がーん。GPU 負けてるやんけww
たぶん、テストセットで精度を計算してるところで時間食ってるからだろうなぁ... 学習中は明らかに GPU の方が速いもんな。
↓rocm-smi の実行結果。
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
ERROR: GPU[0] : sclk clock is unsupported
================================================================================
ERROR: 2 GPU[0]:RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 63.0c 0.002W None 800Mhz 0% auto Unsupported 67% 89%
================================================================================
============================= End of ROCm SMI Log ==============================
うんうん、しっかり使ってるな。
luna:~ % sudo journalctl -b |grep ENVY
6月 11 20:08:42 luna kernel: DMI: HP HP ENVY x360 Convertible 13-ay0xxx/876E, BIOS F.20 07/30/2021
luna:~ %
root@luna:/var/lib/jenkins# rocminfo | grep gfx
Name: gfx90c
Name: amdgcn-amd-amdhsa--gfx90c:xnack-
root@luna:/var/lib/jenkins#
こういうお試し環境がコンテナイメージで提供できるのは、 すごくいいなぁ。今更な感想だけど。
地道にインストールしていったものは全部(たぶん)アンインストールした。
_ YouTube Premium
Picture In Picture めっちゃいい。素晴らしい。
時々邪魔だけど、動かせるし。
background 再生 (というか画面 OFF でも再生) もいい。
[ツッコミを入れる]