很多初學(xué) python 的同學(xué)會使用 print 或 log 調(diào)試程序,但是這只在小規(guī)模的程序下調(diào)試很方便,更好的調(diào)試應(yīng)該是在一邊運行的時候一邊檢查里面的變量和方法。
感興趣的可以去了解 pycharm 的 debug 模式,功能也很強大,能夠滿足一般的需求,這里不多做贅述,我們這里介紹一個更適用于 pytorch 的一個靈活的 pdb 交互式調(diào)試工具。
Pdb 是一個交互式的調(diào)試工具,集成與 Python 標準庫中,它能讓你根據(jù)需求跳轉(zhuǎn)到任意的 Python 代碼斷點、查看任意變量、單步執(zhí)行代碼,甚至還能修改變量的值,而沒有必要去重啟程序。
ipdb 則是一個增強版的 pdb,它提供了調(diào)試模式下的代碼自動補全,還有更好的語法高亮和代碼溯源,以及更好的內(nèi)省功能,最重要的是它和 pdb 接口完全兼容,可以通過 pip install ipdb 安裝。
首先看一個例子,要使用 ipdb 的話,只需要在想要進行調(diào)試的地方插入 ipdb.set_trace(),當代碼運行到這個地方時,就會自動進入交互式調(diào)試模式。
99it [00:17, 6.07it/s]loss: 0.22854854568839075
119it [00:21, 5.79it/s]loss: 0.21267264398435753
139it [00:24, 5.99it/s]loss: 0.19839374726372108
> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py(80)train()
79 loss_meter.reset()
---> 80 confusion_matrix.reset()
81 for ii, (data, label) in tqdm(enumerate(train_dataloader)):
ipdb> break 88 # 在第88行設(shè)置斷點,當程序運行到此處進入debug模式
Breakpoint 1 at e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py:88
ipdb> # 打印所有參數(shù)及其梯度的標準差
for (name,p) in model.named_parameters(): \
print(name,p.data.std(),p.grad.data.std())
model.features.0.weight tensor(0.2615, device='cuda:0') tensor(0.3769, device='cuda:0')
model.features.0.bias tensor(0.4862, device='cuda:0') tensor(0.3368, device='cuda:0')
model.features.3.squeeze.weight tensor(0.2738, device='cuda:0') tensor(0.3023, device='cuda:0')
model.features.3.squeeze.bias tensor(0.5867, device='cuda:0') tensor(0.3753, device='cuda:0')
model.features.3.expand1x1.weight tensor(0.2168, device='cuda:0') tensor(0.2883, device='cuda:0')
model.features.3.expand1x1.bias tensor(0.2256, device='cuda:0') tensor(0.1147, device='cuda:0')
model.features.3.expand3x3.weight tensor(0.0935, device='cuda:0') tensor(0.1605, device='cuda:0')
model.features.3.expand3x3.bias tensor(0.1421, device='cuda:0') tensor(0.0583, device='cuda:0')
model.features.4.squeeze.weight tensor(0.1976, device='cuda:0') tensor(0.2137, device='cuda:0')
model.features.4.squeeze.bias tensor(0.4058, device='cuda:0') tensor(0.1798, device='cuda:0')
model.features.4.expand1x1.weight tensor(0.2144, device='cuda:0') tensor(0.4214, device='cuda:0')
model.features.4.expand1x1.bias tensor(0.4994, device='cuda:0') tensor(0.0958, device='cuda:0')
model.features.4.expand3x3.weight tensor(0.1063, device='cuda:0') tensor(0.2963, device='cuda:0')
model.features.4.expand3x3.bias tensor(0.0489, device='cuda:0') tensor(0.0719, device='cuda:0')
model.features.6.squeeze.weight tensor(0.1736, device='cuda:0') tensor(0.3544, device='cuda:0')
model.features.6.squeeze.bias tensor(0.2420, device='cuda:0') tensor(0.0896, device='cuda:0')
model.features.6.expand1x1.weight tensor(0.1211, device='cuda:0') tensor(0.2428, device='cuda:0')
model.features.6.expand1x1.bias tensor(0.0670, device='cuda:0') tensor(0.0162, device='cuda:0')
model.features.6.expand3x3.weight tensor(0.0593, device='cuda:0') tensor(0.1917, device='cuda:0')
model.features.6.expand3x3.bias tensor(0.0227, device='cuda:0') tensor(0.0160, device='cuda:0')
model.features.7.squeeze.weight tensor(0.1207, device='cuda:0') tensor(0.2179, device='cuda:0')
model.features.7.squeeze.bias tensor(0.1484, device='cuda:0') tensor(0.0381, device='cuda:0')
model.features.7.expand1x1.weight tensor(0.1235, device='cuda:0') tensor(0.2279, device='cuda:0')
model.features.7.expand1x1.bias tensor(0.0450, device='cuda:0') tensor(0.0100, device='cuda:0')
model.features.7.expand3x3.weight tensor(0.0609, device='cuda:0') tensor(0.1628, device='cuda:0')
model.features.7.expand3x3.bias tensor(0.0132, device='cuda:0') tensor(0.0079, device='cuda:0')
model.features.9.squeeze.weight tensor(0.1093, device='cuda:0') tensor(0.2459, device='cuda:0')
model.features.9.squeeze.bias tensor(0.0646, device='cuda:0') tensor(0.0135, device='cuda:0')
model.features.9.expand1x1.weight tensor(0.0840, device='cuda:0') tensor(0.1860, device='cuda:0')
model.features.9.expand1x1.bias tensor(0.0177, device='cuda:0') tensor(0.0033, device='cuda:0')
model.features.9.expand3x3.weight tensor(0.0476, device='cuda:0') tensor(0.1393, device='cuda:0')
model.features.9.expand3x3.bias tensor(0.0058, device='cuda:0') tensor(0.0030, device='cuda:0')
model.features.10.squeeze.weight tensor(0.0872, device='cuda:0') tensor(0.1676, device='cuda:0')
model.features.10.squeeze.bias tensor(0.0484, device='cuda:0') tensor(0.0088, device='cuda:0')
model.features.10.expand1x1.weight tensor(0.0859, device='cuda:0') tensor(0.2145, device='cuda:0')
model.features.10.expand1x1.bias tensor(0.0160, device='cuda:0') tensor(0.0025, device='cuda:0')
model.features.10.expand3x3.weight tensor(0.0456, device='cuda:0') tensor(0.1429, device='cuda:0')
model.features.10.expand3x3.bias tensor(0.0070, device='cuda:0') tensor(0.0021, device='cuda:0')
model.features.11.squeeze.weight tensor(0.0786, device='cuda:0') tensor(0.2003, device='cuda:0')
model.features.11.squeeze.bias tensor(0.0422, device='cuda:0') tensor(0.0069, device='cuda:0')
model.features.11.expand1x1.weight tensor(0.0690, device='cuda:0') tensor(0.1400, device='cuda:0')
model.features.11.expand1x1.bias tensor(0.0138, device='cuda:0') tensor(0.0022, device='cuda:0')
model.features.11.expand3x3.weight tensor(0.0366, device='cuda:0') tensor(0.1517, device='cuda:0')
model.features.11.expand3x3.bias tensor(0.0109, device='cuda:0') tensor(0.0023, device='cuda:0')
model.features.12.squeeze.weight tensor(0.0729, device='cuda:0') tensor(0.1736, device='cuda:0')
model.features.12.squeeze.bias tensor(0.0814, device='cuda:0') tensor(0.0084, device='cuda:0')
model.features.12.expand1x1.weight tensor(0.0977, device='cuda:0') tensor(0.1385, device='cuda:0')
model.features.12.expand1x1.bias tensor(0.0102, device='cuda:0') tensor(0.0032, device='cuda:0')
model.features.12.expand3x3.weight tensor(0.0365, device='cuda:0') tensor(0.1312, device='cuda:0')
model.features.12.expand3x3.bias tensor(0.0038, device='cuda:0') tensor(0.0026, device='cuda:0')
model.classifier.1.weight tensor(0.0285, device='cuda:0') tensor(0.0865, device='cuda:0')
model.classifier.1.bias tensor(0.0362, device='cuda:0') tensor(0.0192, device='cuda:0')
ipdb> opt.lr # 查看學(xué)習(xí)率
0.001
ipdb> opt.lr = 0.002 # 更改學(xué)習(xí)率
ipdb> for p in optimizer.param_groups: \
p['lr'] = opt.lr
ipdb> model.save() # 保存模型
'checkpoints/squeezenet_20191004212249.pth'
ipdb> c # 繼續(xù)運行,直到第88行暫停
222it [16:38, 35.62s/it]> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py(88)train()
87 optimizer.zero_grad()
1--> 88 score = model(input)
89 loss = criterion(score, target)
ipdb> s # 進入model(input)內(nèi)部,即model.__call__(input)
--Call--
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(537)__call__()
536
--> 537 def __call__(self, *input, **kwargs):
538 for hook in self._forward_pre_hooks.values():
ipdb> n # 下一步
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(538)__call__()
537 def __call__(self, *input, **kwargs):
--> 538 for hook in self._forward_pre_hooks.values():
539 result = hook(self, input)
ipdb> n # 下一步
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(544)__call__()
543 input = result
--> 544 if torch._C._get_tracing_state():
545 result = self._slow_forward(*input, **kwargs)
ipdb> n # 下一步
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(547)__call__()
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
ipdb> s # 進入forward函數(shù)內(nèi)容
--Call--
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\loss.py(914)forward()
913
--> 914 def forward(self, input, target):
915 return F.cross_entropy(input, target, weight=self.weight,
ipdb> input # 查看input變量值
tensor([[4.5005, 2.0725],
[3.5933, 7.8643],
[2.9086, 3.4209],
[2.7740, 4.4332],
[6.0164, 2.3033],
[5.2261, 3.2189],
[2.6529, 2.0749],
[6.3259, 2.2383],
[3.0629, 3.4832],
[2.7008, 8.2818],
[5.5684, 2.1567],
[3.0689, 6.1022],
[3.4848, 5.3831],
[1.7920, 5.7709],
[6.5032, 2.8080],
[2.3071, 5.2417],
[3.7474, 5.0263],
[4.3682, 3.6707],
[2.2196, 6.9298],
[5.2201, 2.3034],
[6.4315, 1.4970],
[3.4684, 4.0371],
[3.9620, 1.7629],
[1.7069, 7.8898],
[3.0462, 1.6505],
[2.4081, 6.4456],
[2.1932, 7.4614],
[2.3405, 2.7603],
[1.9478, 8.4156],
[2.7935, 7.8331],
[1.8898, 3.8836],
[3.3008, 1.6832]], device='cuda:0', grad_fn=AsStridedBackward>)
ipdb> input.data.mean() # 查看input的均值和標準差
tensor(3.9630, device='cuda:0')
ipdb> input.data.std()
tensor(1.9513, device='cuda:0')
ipdb> u # 跳回上一層
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(547)__call__()
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
ipdb> u # 跳回上一層
> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py(88)train()
87 optimizer.zero_grad()
1--> 88 score = model(input)
89 loss = criterion(score, target)
ipdb> clear # 清除所有斷點
Clear all breaks? y
Deleted breakpoint 1 at e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py:88
ipdb> c # 繼續(xù)運行,記得先刪除"debug/debug.txt",否則很快又會進入調(diào)試模式
59it [06:21, 5.75it/s]loss: 0.24856307208538073
76it [06:24, 5.91it/s]
當我們想要進入 debug 模式,修改程序中某些參數(shù)值或者想分析程序時,就可以通過創(chuàng)建 debug 標識文件,此時程序會進入調(diào)試模式,調(diào)試完成之后刪除這個文件并在 ipdb 調(diào)試接口輸入 c 繼續(xù)運行程序。如果想退出程序,也可以使用這種方式,先創(chuàng)建 debug 標識文件,然后輸入 quit 在退出 debug 的同時退出程序。這種退出程序的方式,與使用 Ctrl + C 的方式相比更安全,因為這能保證數(shù)據(jù)加載的多進程程序也能正確地退出,并釋放內(nèi)存、顯存等資源。
PyTorch 調(diào)用 CuDNN 報錯時,報錯信息諸如 CUDNN_STATUS_BAD_PARAM,從這些報錯內(nèi)容很難得到有用的幫助信息,最后先利用 PCU 運行代碼,此時一般會得到相對友好的報錯信息,例如在 ipdb 中執(zhí)行 model.cpu() (input.cpu()), PyTorch 底層的 TH 庫會給出相對比較詳細的信息。
此外,可能還會經(jīng)常遇到程序正常運行、沒有報錯,但是模型無法收斂的問題。例如對于二分類問題,交叉熵損失一直徘徊在 0.69 附近(ln2),或者是數(shù)值出現(xiàn)溢出等問題,此時可以進入 debug 模式,用單步執(zhí)行查看,每一層輸出的均值和方差,觀察從哪一層的輸出開始出現(xiàn)數(shù)值異常。還要查看每個參數(shù)梯度的均值和方差,查看是否出現(xiàn)梯度消失或者梯度爆炸等問題。一般來說,通過再激活函數(shù)之前增加 BatchNorm 層、合理的參數(shù)初始化、使用 Adam 優(yōu)化器、學(xué)習(xí)率設(shè)為0.001,基本就能確保模型在一定程度收斂。
本章帶同學(xué)們從頭實現(xiàn)了一個 Kaggle 上的經(jīng)典競賽,重點講解了如何合理地組合安排程序,同時介紹了一些在PyTorch中調(diào)試的技巧,下章將正式的進入編程實戰(zhàn)之旅,其中一些細節(jié)不會再講的如此詳細,做好心理準備。
以上就是PyTorch的Debug指南的詳細內(nèi)容,更多關(guān)于PyTorch Debug的資料請關(guān)注腳本之家其它相關(guān)文章!