更加标准安全地构造 Go 语言中 HTTP 请求 URL

陪她去流浪桃子 2022年04月05日阅读次数：19226

在 Review 历届公司、大小厂同事写的发起外部 HTTP 请求的调用代码中，我几乎很少见到比较标准（或者说正确、安全）地构造出 HTTP 请求的 URL。你若要问我何为标准的做法？我可能没法准确地告知你。但是，我自己有以下几个简单的评判标准：

协议：不带 http:// 时能正确请求吗？
路径：结尾带 / 时能正确拼接出 Path 吗？
查询：查询参数正确处理转码问题了吗？

预备知识

什么是 URL？以下的结构来自 Go 语言 URL 官方文档：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// A URL represents a parsed URL (technically, a URI reference).
//
// The general form represented is:
//
//	[scheme:][//[userinfo@]host][/]path[?query][#fragment]
//
// URLs that do not start with a slash after the scheme are interpreted as:
//
//	scheme:opaque[?query][#fragment]
//

正确处理协议

Go 的 url 包不支持不带协议（Scheme）的 URL，由于 http 内部也是用的 url 包解析，所以下面这个请求就是错误的：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


package main

import (
	"log"
	"net/http"
)

func main() {
	u := `example.com`
	rsp, err := http.Get(u)
	if err != nil {
		log.Fatalln(err)
	}
	defer rsp.Body.Close()
}

报错如下：

1
2
3


$ go run main.go
2022/04/05 18:28:20 Get "example.com": unsupported protocol scheme ""
exit status 1

我不知道这算不算 url 包的 Bug，这里先不理论，但是我总是习惯下地先用以下方式处理一下：

1
2
3
4


u := `example.com`
if !strings.Contains(u, `://`) {
	u = `http://` + u
}

注意我这里判断的是 ://，而不是 http:// 或 https://。有几个原因：

简单。不需要同时判断 http:// 或 https:// 这两种情况；
不用管大小写。URL 中的协议（scheme）部分是不区分大小写的。HTTP://example.com 和 http://example.com 是等价的。如果硬要判断前缀，也应该：
1 2 3

if !strings.HasPrefix(strings.ToLower(u), `http://`) { u = `http://` + u }
这样写也太繁琐了吧！这还是仅 http 的写法。

正确处理路径

形如 http://example.com/path/to/file.txt 中 /path/to/file.txt 这一部分被称作 Path，即路径。能正确理解和构造出路径更是犯错的重灾区。

命名问题

首当其冲的是命名问题。很多人假定 API 请求只能发送给 / 路径，比如 http://example.com/v1/posts 这样一个 API 接口，/v1/posts 这一部分是固定的，而前面的 http://example.com 是配置文件中的。所以他们把这一部分取名叫作 host（或者更甚 host_port）。一眼看我以为只能配 example.com 这一部分呢（因为这一部分就叫作 host、hostname 或 host_port）。

所以万一我哪天要测试一个被代理的 API，前缀变了，比如现在是http://example.com/proxy/v1/posts。那么现在配置文件里面应该写 http://example.com/proxy 这一部分。这还叫 host？

你问我纠结这个干嘛？我也不想，我根本没想到这种东西能成为我们的争端。我以为大家都遵守规范。

你问哪里有这种场景？遍地都是，比如 Grafana 使用 Server 模式（相对于 Direct）下对数据源的请求。

那取个啥名儿好呢？我见过的：endpoint，prefix，url，address 等。前端有个标签叫 <base>，专门用来给页面内的相对链接设定基础地址的，这个可以说相当好。

要不要最后的 `/` 问题

因为他的代码是基于配置然后追加（没错，就是 +） API 路径，所以：

如果配置的是 http://example.com，那么将得到：http://example.com/v1/posts；
如果配置的是 http://example.com/，那么将得到：http://example.com//v1/posts；

看到了吗？他们根本处理不好是否以 / 结尾的情况，甚至一度口头要求你不要带最后的 /。

究其根因，在于他们的 URL 是手动拼接的：

1
2


prefix := `http://example.com/`
api := prefix + `/v1/posts`

不是所有的服务器都兼容支持自动把 // 变成 /，出错在所难免。

所以怎么做呢？使用 path 包。（这个包与我们的常见变量名字path挺冲突的，略有点难受。）

与之相应的还有一个叫 filepath 的包。这两个包的区别主要在于：前者适用于正斜杠相关的路径，后者适用于与操作系统相关的路径。比如在 Linux 上用 / 分隔路径，在 Windows 上用 \ 分隔路径。这在包文档里面已经清楚地说明了：

path

Package path implements utility routines for manipulating slash-separated paths.

The path package should only be used for paths separated by forward slashes, such as the paths in URLs. This package does not deal with Windows paths with drive letters or backslashes; to manipulate operating system paths, use the path/filepath package.

filepath

Package filepath implements utility routines for manipulating filename paths in a way compatible with the target operating system-defined file paths.

The filepath package uses either forward slashes or backslashes, depending on the operating system. To process paths such as URLs that always use forward slashes regardless of the operating system, see the path package.

怎么使用 path 包？就一个方法即可：path.Join：

func Join(elem ...string) string

Join joins any number of path elements into a single path, separating them with slashes. Empty elements are ignored. The result is Cleaned. However, if the argument list is empty or all its elements are empty, Join returns an empty string.

测试代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


package main

import (
	"fmt"
	"log"
	"net/url"
	"path"
	"strings"
)

func api(prefix string) string {
	if !strings.Contains(prefix, `://`) {
		prefix = "http://" + prefix
	}
	u, err := url.Parse(prefix)
	if err != nil {
		log.Fatalln(err)
	}
	u.Path = path.Join(u.Path, `/v1/posts`)
	return u.String()
}

func main() {
	fmt.Println(api(`example.com`))
	fmt.Println(api(`http://example.com`))
	fmt.Println(api(`http://example.com/`))
	fmt.Println(api(`http://example.com/proxy`))
	fmt.Println(api(`http://example.com/proxy/`))
}

输出结果：

1
2
3
4
5
6


$ go run main.go
http://example.com/v1/posts
http://example.com/v1/posts
http://example.com/v1/posts
http://example.com/proxy/v1/posts
http://example.com/proxy/v1/posts

如果使用的是 go1.18 以后的版本，url 库已经自带这个能力了，见：net/url: add JoinPath, URL.JoinPath。但是要⚠️注意这两个方法有区别：path.Join 会移除最后的 /，URL.JoinPath 会保留。估计，在它们设计之初的考量是：path 主要是针对操作系统/文件系统，而 net/url.URL 如其名，主要是针对浏览器。

路径编码问题

上述 path.Join 会自动处理编码的问题：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


package main

import (
	"fmt"
	"log"
	"net/url"
	"path"
	"strings"
)

func api(prefix string) string {
	if !strings.Contains(prefix, `://`) {
		prefix = "http://" + prefix
	}
	u, err := url.Parse(prefix)
	if err != nil {
		log.Fatalln(err)
	}
	u.Path = path.Join(u.Path, `/v1/posts/新文章`)
	return u.String()
}

func main() {
	prefix := `http://example.com`
	fmt.Println(prefix + `/v1/posts/新文章`)
	fmt.Println(api(prefix))
}

输出结果：

1
2
3


$ go run main.go
http://example.com/v1/posts/新文章
http://example.com/v1/posts/%E6%96%B0%E6%96%87%E7%AB%A0

现在很多服务器或比较现代的后端应该能正确处理没有编码的字符，并且在 API 中使用数字字母以外的字符比较少见。编码问题不是特别严重。

你认为我 /v1/posts/新文章 是自己没编码写错了吗？抱歉，没有错。path.Join join 的是编码之前的路径（segments），url.String() 方法得到的用来传输的最终 URL。完全不冲突。

要不要 Clean 的问题

Web UI - apparent path traversal vulnerability #18618

正确处理查询

查询，简单说就是 URL 中问号后面的那一部分。比如：http://example.com/v1/posts?page_no=1&a=b 中，page_no=1&a=b 就叫作查询（query 或 query_string）。

我相信肯定每个人都见过别人手动拼接这个查询的代码，或者自己亲手写过（我也不例外）。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


package main

import "fmt"

func main() {
	api := `http://example.com/v1/posts`
	api += `?`
	api += `page_no=1`
	api += `&`
	api += `a=b`
	fmt.Println(api)

	api += `&`
	api += fmt.Sprintf("text=%s", `text with spaces`)
	fmt.Println(api)

	api += `&`
	api += fmt.Sprintf(`chinese=%s`, `桃子`)
	fmt.Println(api)
}

以下是结果：

1
2
3
4


$ go run main.go
http://example.com/v1/posts?page_no=1&a=b
http://example.com/v1/posts?page_no=1&a=b&text=text with spaces
http://example.com/v1/posts?page_no=1&a=b&text=text with spaces&chinese=桃子

写这么多 hardcode shit 难看、难受不？带空格的 URL 你应该没有怎么见过吧？带中文的可能也不怎么规范吧？我心里很苦。

以下是我认为比较规范安全的写法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


package main

import (
	"fmt"
	"log"
	"net/url"
)

func main() {
	u, err := url.Parse(`http://example.com/v1/posts?existed=query`)
	if err != nil {
		log.Fatalln(err)
	}
	q := u.Query()
	q.Set(`page_no`, `1`)
	q.Set(`a`, `b`)
	q.Set(`text`, `text with spaces`)
	q.Set(`chinese`, `桃子`)
	u.RawQuery = q.Encode()
	fmt.Println(u.String())
}

输出结果：

1
2


$ go run main.go
http://example.com/v1/posts?a=b&chinese=%E6%A1%83%E5%AD%90&existed=query&page_no=1&text=text+with+spaces

文末

我不知道大家有没有犯类似的错误，反正我以前是犯过多次，踩了多次坑，所以有了今天这样的总结。

我也不知道其它语言有没有类似的问题是，至少我曾经写 C、C++、Lua、PHP、Javascript 等都有类似的问题。

如果还有其它我没有指出的常犯的错误，欢迎指出。

标签：HTTP · Go · URL

更加标准安全地构造 Go 语言中 HTTP 请求 URL

预备知识

正确处理协议

正确处理路径

命名问题

要不要最后的 / 问题

路径编码问题

要不要 Clean 的问题

正确处理查询

文末

要不要最后的 `/` 问题