PDNS-Recursor源码分析之dns server的选择原理
Contact me
或者用邮件交流 jacky.wucheng@foxmail.com
author: 燃烧 jacky.wucheng@gmail.com
本代码分析基于pdns-recursor-3.5.3,编写于2014-1-5
如何挑选最快的Authoritative DNS Server?
```
SyncRes::doResolve
↘︎
SyncRes::doResolveAt
↘︎
doCNAMECacheCheck
doCacheCheck
shuffleInSpeedOrder
↘︎
random_shuffle(rnameservers.begin(),rnameservers.end(), dns_random);
speedOrder so(speeds);
stable_sort(rnameservers.begin(),rnameservers.end(), so);
```
解释
在shuffleInSpeedOrder中,利用stable_sort函数对authoritative dns server按照so(speeds) 的规则进行排序。so(speeds)实现的就是根据dns查询速度进行比较。
代码见
```
syncres.cc
758:
struct speedOrder
{
speedOrder(map<string,double> &speeds) : d_speeds(speeds) {}
bool operator()(const string &a, const string &b) const
{
return d_speeds[a] < d_speeds[b];
}
map<string, double>& d_speeds;
};
```
```
syncres.cc
769:
inline vector<string> SyncRes::shuffleInSpeedOrder(set<string, CIStringCompare> &tnameservers, const string &prefix)
{
vector<string> rnameservers;
rnameservers.reserve(tnameservers.size());
BOOST_FOREACH(const string& str, tnameservers) {
rnameservers.push_back(str);
}
map<string, double> speeds;
BOOST_FOREACH(const string& val, rnameservers) {
double speed;
speed=t_sstorage->nsSpeeds[val].get(&d_now);
speeds[val]=speed;
}
random_shuffle(rnameservers.begin(),rnameservers.end(), dns_random);
speedOrder so(speeds);
stable_sort(rnameservers.begin(),rnameservers.end(), so);
if(doLog()) {
LOG(prefix<<"Nameservers: ");
for(vector<string>::const_iterator i=rnameservers.begin();i!=rnameservers.end();++i) {
if(i!=rnameservers.begin()) {
LOG(", ");
if(!((i-rnameservers.begin())%3)) {
LOG(endl<<prefix<<" ");
}
}
LOG(*i<<"(" << (boost::format("%0.2f") % (speeds[*i]/1000.0)).str() <<"ms)");
}
LOG(endl);
}
return rnameservers;
}
```
SyncRes::doResolveAt中如下代码就按照排好序的rnameservers进行查询。
```
syncres.cc
838:
vector<string > rnameservers = shuffleInSpeedOrder(nameservers, doLog() ? (prefix+qname+": ") : string() );
for(vector<string >::const_iterator tns=rnameservers.begin();;++tns) {
...
}
```
查询结束后将本次的查询时间更新到全局速度表中
```
syncres.cc
1001: t_sstorage->nsSpeeds[*tns].submit(*remoteIP, lwr.d_usec, &d_now);
```
如果Authoritative DNS Server查询失败,怎么处理?
pdns recursor中有一个很重要的数据结构,记录了一些全局信息。
```
syncres.hh
365:
struct StaticStorage {
negcache_t negcache;
nsspeeds_t nsSpeeds; # 速度列表
ednsstatus_t ednsstatus;
throttle_t throttle; # 请求控制
domainmap_t* domainmap;
};
```
在错误处理方面,有一个重要的技术就是 throttle
,作用是来控制发往某些特定的 authoritative dns server 的请求,也即错误处理,或者叫请求控制。
每次开始dns查询的时候都会进行一次判断, 目标authoritative dns server是否是throttle的,如果是直接continue到下一个authoritative dns server进行查询,按照pdns官方的说法,这样可以节省时间和流量。代码如下:
```
if(t_sstorage->throttle.shouldThrottle(d_now.tv_sec, make_tuple(*remoteIP, qname,qtype.getCode()))) {
LOG(prefix<<qname<<": query throttled "<<endl);
s_throttledqueries++; d_throttledqueries++;
continue;
}
```
关于throttle的解释可以参考 what-s-in-the-throttle-map1 , pdns recursor design2。
pdns recursor错误处理的逻辑都在如下代码中。
从中可知,resolveret=1
时表示query成功,其他情况均为失败,失败后都能continue到下一个authoritative dns server进行查询,同时利用t_sstorage->throttle.throttle()
方法更新throttle中的信息。
```
syncres.cc
926:
s_outqueries++; d_outqueries++;
TryTCP:
if(doTCP) {
LOG(prefix<<qname<<": using TCP with "<< remoteIP->toStringWithPort() <<endl);
s_tcpoutqueries++; d_tcpoutqueries++;
}
resolveret=asyncresolveWrapper(*remoteIP, qname, qtype.getCode(),
doTCP, sendRDQuery, &d_now, &lwr); // <- we go out on the wire!
if(resolveret != 1) {
if(resolveret==0) {
LOG(prefix<<qname<<": timeout resolving "<< (doTCP ? "over TCP" : "")<<endl);
d_timeouts++;
s_outgoingtimeouts++;
}
else if(resolveret==-2) {
LOG(prefix<<qname<<": hit a local resource limit resolving"<< (doTCP ? " over TCP" : "")<<", probable error: "<<stringerror()<<endl);
g_stats.resourceLimits++;
}
else {
s_unreachables++; d_unreachables++;
LOG(prefix<<qname<<": error resolving"<< (doTCP ? " over TCP" : "") <<", possible error: "<<strerror(errno)<< endl);
}
//! returns -2 for OS limits error, -1 for permanent error that has to do with remote **transport**, 0 for timeout, 1 for success
if(resolveret!=-2) { // don't account for resource limits, they are our own fault
{
LOG(prefix<< "<<<2 " << remoteIP->toString());
LOG(prefix<< " <<3 " << *tns);
t_sstorage->nsSpeeds[*tns].submit(*remoteIP, 1000000, &d_now); // 1 sec
}
LOG(prefix<< "<<< " << remoteIP->toString());
if(resolveret==-1)
t_sstorage->throttle.throttle(d_now.tv_sec, make_tuple(*remoteIP, qname, qtype.getCode()), 60, 100); // unreachable, 1 minute or 100 queries
else
t_sstorage->throttle.throttle(d_now.tv_sec, make_tuple(*remoteIP, qname, qtype.getCode()), 10, 5); // timeout
}
continue;
}
if(lwr.d_rcode==RCode::ServFail || lwr.d_rcode==RCode::Refused) {
LOG(prefix<<qname<<": "<<*tns<<" returned a "<< (lwr.d_rcode==RCode::ServFail ? "ServFail" : "Refused") << ", trying sibling IP or NS"<<endl);
t_sstorage->throttle.throttle(d_now.tv_sec,make_tuple(*remoteIP, qname, qtype.getCode()),60,3); // servfail or refused
continue;
}
break; // this IP address worked!
wasLame:; // well, it didn't
LOG(prefix<<qname<<": status=NS "<<*tns<<" ("<< remoteIP->toString() <<") is lame for '"<<auth<<"', trying sibling IP or NS"<<endl);
t_sstorage->throttle.throttle(d_now.tv_sec, make_tuple(*remoteIP, qname, qtype.getCode()), 60, 100); // lame
}
```
throttle的类定义如下, 每次shouldThrottle方法被调用都会清理掉5分钟之前的记录信息,然后查询make_tuple(remoteIP, qname, qtype.getCode()))这个key是否存在,存在则不向此dns server发送query请求, continue到下一个,如果不存在,则进行查询流程。如果查询失败,则会将此make_tuple(remoteIP, qname, qtype.getCode()))信息新加到throttle中去,
```
syncres.hh
39:
template<class Thing> class Throttle : public boost::noncopyable
{
public:
Throttle()
{
d_limit=3;
d_ttl=60;
d_last_clean=time(0);
}
bool shouldThrottle(time_t now, const Thing& t)
{
if(now > d_last_clean + 300 ) {
d_last_clean=now;
for(typename cont_t::iterator i=d_cont.begin();i!=d_cont.end();) {
//std::cout << "===shouldThrottle " << *i << std::endl;
if( i->second.ttd < now) { //删除时间早于now的元素
d_cont.erase(i++);
}
else
++i;
}
}
typename cont_t::iterator i=d_cont.find(t);
if(i==d_cont.end())
return false;
if(now > i->second.ttd || i->second.count-- < 0) { //throttle中的ttl小于now,表示过期的,删除之。或者被throttle block过的次数用完了,也删除之。表示本次请求可以不被block了。
d_cont.erase(i);
return false;
}
return true; // still listed, still blocked
}
// typedef Throttle<tuple<ComboAddress,string,uint16_t> > throttle_t;
void throttle(time_t now, const Thing& t, unsigned int ttl=0, unsigned int tries=0)
{
typename cont_t::iterator i=d_cont.find(t);
entry e={ now+(ttl ? ttl : d_ttl), tries ? tries : d_limit}; //从now开始计时的ttl秒内,被block tries次内,再来新的查询请求,会被直接block掉。
if(i==d_cont.end()) {
d_cont[t]=e;
}
else if(i->second.ttd > e.ttd || (i->second.count) < e.count) //使用较小的ttl,或者使用较大的被block次数
d_cont[t]=e;
}
unsigned int size()
{
return (unsigned int)d_cont.size();
}
private:
int d_limit;
int d_ttl;
time_t d_last_clean;
struct entry
{
time_t ttd;
int count;
};
typedef map<Thing,entry> cont_t;
cont_t d_cont;
};
```
如下的代码表示的含义
- 如果目标dns server unreachable,那么接下来的查询请求,在60秒内或者100次尝试内,会直接continue到下一个dns server。
- 如果目标dns server 超时,那么接下来的查询请求在10秒内或者5次尝试内,会直接continue到下一个dns server。
- 如果目标dns server servfail或者refused,那么接下来的查询请求在60秒内或者3次尝试内,会直接continue到下一个dns server。
- 其他的含义以此类推。
```
if(resolveret==-1)
t_sstorage->throttle.throttle(d_now.tv_sec, make_tuple(*remoteIP, qname, qtype.getCode()), 60, 100); // unreachable, 1 minute or 100 queries
else
t_sstorage->throttle.throttle(d_now.tv_sec, make_tuple(*remoteIP, qname, qtype.getCode()), 10, 5); // timeout
}
continue;
}
t_sstorage->throttle.throttle(d_now.tv_sec,make_tuple(*remoteIP, qname, qtype.getCode()),60,3); // servfail or refused
```
Authoritative DNS Server速度变慢之后又变快,速度列表如何更新?
dns server 速度变慢之后,在shuffleInSpeedOrder排序的时候就会被排到后面,这样被query到的几率就会变小。当此dns server速度又变快之后,什么时候会被重新查询到呢?
nsspeeds是一个这样的map
```
typedef map<string, DecayingEwmaCollection, CIStringCompare> nsspeeds_t;
```
```
double DecayingEwmaCollection::get(struct timeval* now)
{
if(d_collection.empty())
return 0;
double ret=std::numeric_limits<double>::max();
double tmp;
for(collection_t::iterator pos=d_collection.begin(); pos != d_collection.end(); ++pos) {
// pos : pair<ComboAddress, DecayingEwma>
if((tmp=pos->second.get(now)) < ret) {
ret=tmp;
d_best=pos->first;
}
}
return ret;
}
typedef vector<pair<ComboAddress, DecayingEwma> > collection_t;
collection_t d_collection;
```
其中最重要的是
```
syncres.hh: 109
/** Class that implements a decaying EWMA指数加权移动平均.
This class keeps an exponentially weighted moving average which, additionally, decays over time.
The decaying is only done on get.
*/
class DecayingEwma {
void submit(int val, struct timeval* tv)
{
struct timeval now=getOrMakeTime(tv);
if(d_needinit) {
d_last=now;
d_lastget=now;
d_needinit=false;
d_val = val;
}
else {
float diff= makeFloat(d_last - now);
d_last=now;
double factor=exp(diff)/2.0; // might be '0.5', or 0.0001
d_val=(float)((1-factor)*val+ (float)factor*d_val);
// 越是最近submit的,factor越接近0.5,越是很久没有submit过的,越小于0.5, 那么最近一次的val的值的权重就越大。
}
}
double get(struct timeval* tv)
{
struct timeval now=getOrMakeTime(tv);
float diff=makeFloat(d_lastget-now);
d_lastget=now;
float factor=exp(diff/60.0f); // is 1.0 or less
return d_val*=factor; //如果频繁对每个dns的DecayingEwma进行get函数调用,那么factor基本接近于1. 越是长时间没有被get,越是小于1。
}
double peek(void)
{
return d_val;
}
bool stale(time_t limit) const
{
return limit > d_lastget.tv_sec;
}
private:
struct timeval d_last; // stores time
struct timeval d_lastget; // stores time
float d_val;
bool d_needinit;
};
}
```
根据官方文档的描述,如果一个dns server被排到很后面而长时间没有被query到,那么随着时间的推移,该函数将此dns server的响应时间会变小(速度变快),使得它有机会被重新query到。参考 [pdns recursor design]
原因是,根据SyncRes::doResolve函数的查询逻辑,会先查cache,miss之后再去查询dns server。所以,只要cache命中,那么shuffleInSpeedOrder函数就不会被调用,等到ttl过期之后cache失效,shuffleInSpeedOrder再次被调用的时候,根据get函数的逻辑,factor必然小于1,那么d_val的值必然会比上一次小,随着时间的推移,除非前面的dns server速度加快的速度一直能够保持领先于d_val变小的速度,否则,后面的dns server肯定会有机会被query到。这也符合pdns设计的初衷,使得最快的dns server总是排在最前面。
对DecayingEwma算法的总结:
- submit方法的作用是实现EWMA算法,使得对某个dns server速度的评价主要参考最近的速度值。
- get方法的的作用是使得dns server的响应时间随着时间的推移进行衰减,也就是速度变快。
另外,有一个情况,如果ttl的值设置的非常长的话,那么后面dns server很久都不会经历速度排序,也就没有d_val的衰减。所以,pdns还有一个houseKeeping机制,当整个dns接受的query超过2万次后(见下面代码中的注释),就启动houseKeeping线程去做速度列表的清理,默认清理最近5分钟内speed值没有被get过的dns server。那么当下次shuffleInSpeedOrder函数被调用的时候,get得到的speed值就是0(见DecayingEwmaCollection::get),也就是最快的值,那么这些dns server就有机会被query到了。
```
pdns_recursor.cc
1866:
void* recursorThread(void* ptr){
counter=0; // used to periodically execute certain tasks
for(;;) {
while(MT->schedule(&g_now)); // MTasker letting the mthreads do their thing
if(!(counter%500)) { //cout代表的是total dns查询数,当查询次数正好模以500,就启动houseKeeping线程
MT->makeThread(houseKeeping, 0);
}
if(!(counter%55)) {
typedef vector<pair<int, FDMultiplexer::funcparam_t> > expired_t;
expired_t expired=t_fdm->getTimeouts(g_now);
for(expired_t::iterator i=expired.begin() ; i != expired.end(); ++i) {
shared_ptr<TCPConnection> conn=any_cast<shared_ptr<TCPConnection> >(i->second);
if(g_logCommonErrors)
L<<Logger::Warning<<"Timeout from remote TCP client "<< conn->d_remote.toString() <<endl;
t_fdm->removeReadFD(i->first);
}
}
counter++;
}
1136:
static void houseKeeping(void *){
std::cout << "cleanCounter " << cleanCounter << std::endl;
if(!((cleanCounter++)%40)) { // this is a full scan! 当houseKeeping线程启动的次数正好可以模以40,就清理nsSpeeds列表。
time_t limit=now.tv_sec-300; //5分钟前
std::cout << " housekeep work " << std::endl;
for(SyncRes::nsspeeds_t::iterator i = t_sstorage->nsSpeeds.begin() ; i!= t_sstorage->nsSpeeds.end(); )
if(i->second.stale(limit)){
t_sstorage->nsSpeeds.erase(i++);
std::cout << " housekeep clean " << i->first << std::endl;
}
else
++i;
}
// L<<Logger::Warning<<"Spent "<<dt.udiff()/1000<<" msec cleaning"<<endl;
last_prune=time(0);
}
}
```