New Release 1.2.0 (#26)

## Features * Can now parse files line by line in a fixed-size reading mode * Now has an option to ignore quotes when parsing ## Improvements * Speed increases in vanilla Lua and LuaJIT (benchmarks updated!) * Refactored code for easier maintenance ## Bugfixes * Better handling of multiple escaped quotes in vanilla lua (thanks @fredrikj83 #25)
2025-01-20 09:54:23 +00:00 · 2020-04-04 13:47:24 -05:00 · 2020-04-04 13:47:24 -05:00 · 86686314e0
commit 86686314e0
parent 705d3a589a
10 changed files with 866 additions and 428 deletions
--- a/ERRORS.md
+++ b/ERRORS.md
@ -1,9 +1,9 @@
-#Error Handling
+# Error Handling
 Below you can find a more detailed explanation of some of the errors that can be encountered while using ftcsv. For parsing, examples of these files can be found in /spec/bad_csvs/



-##Parsing
+## Parsing
 Note: `[row_number]` indicates the row number of the parsed lua table. As such, it will be one off from the line number in the csv. However, for header-less files, the row returned *will* match the csv line number.

 | Error Message  | Detailed Explanation |
@ -12,4 +12,9 @@ Note: `[row_number]` indicates the row number of the parsed lua table. As such,
 | ftcsv: Cannot parse a file which contains empty headers | If a header field contains no information, then it can't be parsed <br> (ex: `Name,City,,Zipcode`) |
 | ftcsv: too few columns in row [row_number]    | The number of columns is less than the amount in the header after transformations (renaming, keeping certain fields, etc) |
 | ftcsv: too many columns in row [row_number]   | The number of columns is greater than the amount in the header after transformations. It can't map the field's count with an existing header. |
-| ftcsv: File not found at [path]           | When loading, lua can't open the file at [path] | 
+| ftcsv: File not found at [path]           | When loading, lua can't open the file at [path] | 
+| ftcsv: fieldsToKeep only works with header-less files when using the 'rename' functionality | when dealing with header-less files, you can only use the fieldsToKeep if you use rename. The fields are limited after the renaming happens |
+| ftcsv: bufferSize needs to be larger to parse this file  | The buffer size selected is too small to parse the file. It must be at least the length of the longest row (but, for performance, should probably be a bit larger). |
+| ftcsv: parseLine currently doesn't support loading from string | `parseLine` relies on reading a file a few bytes at a time and currently doesn't work on strings |
+| ftcsv: bufferSize can only be specified using 'parseLine'. When using 'parse', the entire file is read into memory | bufferSize can't be specified for parse, it can only be specified for parseLine |
+
--- a/README.md
+++ b/README.md
@ -1,15 +1,9 @@
 # ftcsv
 [![Build Status](https://travis-ci.org/FourierTransformer/ftcsv.svg?branch=master)](https://travis-ci.org/FourierTransformer/ftcsv) [![Coverage Status](https://coveralls.io/repos/github/FourierTransformer/ftcsv/badge.svg?branch=master)](https://coveralls.io/github/FourierTransformer/ftcsv?branch=master)

-ftcsv is a fast pure lua csv library.
-
-It works well for CSVs that can easily be fully loaded into memory (easily up to a hundred MB) and correctly handles `\n` (LF), `\r` (CR) and `\r\n` (CRLF) line endings. It has UTF-8 support, and will strip out the BOM if it exists. ftcsv can also parse headerless csv-like files and supports column remapping, file or string based loading, and more!
-
-Currently, there isn't a "large" file mode with proper readers for ingesting large CSVs using a fixed amount of memory, but that is in the works in [another branch!](https://github.com/FourierTransformer/ftcsv/tree/parseLineIterator)
-
-It's been tested with LuaJIT 2.0/2.1 and Lua 5.1, 5.2, and 5.3
-
+ftcsv is a fast csv library written in pure Lua. It's been tested with LuaJIT 2.0/2.1 and Lua 5.1, 5.2, and 5.3

+It features two parsing modes, one for CSVs that can easily be loaded into memory (up to a few hundred MBs depending on the system), and another for loading files using an iterator - useful for manipulating large files or processing during load. It correctly handles most csv (and csv-like) files found in the wild, from varying line endings (Windows, Linux, and OS9), UTF-8 BOM support, and odd delimiters. There are also various options that can tweak how a file is loaded, only grabbing a few fields, renaming fields, and parsing header-less files!

 ## Installing
 You can either grab `ftcsv.lua` from here or install via luarocks:
@ -20,9 +14,11 @@ luarocks install ftcsv


 ## Parsing
-### `ftcsv.parse(fileName, delimiter [, options])`
+There are two main parsing methods: `ftcv.parse` and `ftcsv.parseLine`.
+`ftcsv.parse` loads the entire file and parses it, while `ftcsv.parseLine` is an iterator that parses one line at a time.

-ftcsv will load the entire csv file into memory, then parse it in one go, returning a lua table with the parsed data and a lua table containing the column headers. It has only two required parameters - a file name and delimiter (limited to one character). A few optional parameters can be passed in via a table (examples below).
+### `ftcsv.parse(fileName, delimiter [, options])`
+`ftcsv.parse` will load the entire csv file into memory, then parse it in one go, returning a lua table with the parsed data and a lua table containing the column headers. It has only two required parameters - a file name and delimiter (limited to one character). A few optional parameters can be passed in via a table (examples below).

 Just loading a csv file:
 ```lua
@ -30,11 +26,28 @@ local ftcsv = require('ftcsv')
 local zipcodes, headers = ftcsv.parse("free-zipcode-database.csv", ",")
 ```

-### Options
-The following are optional parameters passed in via the third argument as a table. For example if you wanted to `loadFromString` and not use `headers`, you could use the following:
+### `ftcsv.parseLine(fileName, delimiter, [, options])`
+`ftcsv.parseLine` will open a file and read `options.bufferSize` bytes of the file. `bufferSize` defaults to 2^16 bytes (which provides the fastest parsing on most unix-based systems), or can be specified in the options. `ftcsv.parseLine` is an iterator and returns one line at a time. When all the lines in the buffer are read, it will read in another `bufferSize` bytes of a file and repeat the process until the entire file has been read.
+
+If specifying `bufferSize` there are a couple of things to remember:
+ * `bufferSize` must be at least the length of the longest row.
+ * If `bufferSize` is too small, an error is returned. 
+ * If `bufferSize` is the length of the entire file, all of it will be read and returned one line at a time (performance is roughly the same as `ftcsv.parse`).
+
+Parsing through a csv file:
 ```lua
-ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false})
+local ftcsv = require("ftcsv")
+for zipcode in ftcsv.parseLine("free-zipcode-database.csv", ",") do
+    print(zipcode.Zipcode)
+    print(zipcode.State)
+end
 ```
+
+
+### Options
+The options are the same for `parseLine` and `parse`, with the exception of `loadFromString` and `bufferSize`. `loadFromString` only works with `parse` and `bufferSize` can only be specified for `parseLine`.
+
+The following are optional parameters passed in via the third argument as a table.
 - `loadFromString`

 	If you want to load a csv from a string instead of a file, set `loadFromString` to `true` (default: `false`)
@ -64,6 +77,17 @@ ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false})
 	local actual = ftcsv.parse("a,b,c\r\napple,banana,carrot\r\n", ",", options)
 	```

+ 	Also Note: If you apply a function to the headers via headerFunc, and want to select fields from fieldsToKeep, you need to have what the post-modified header would be in fieldsToKeep.
+
+ - `ignoreQuotes`
+
+	If `ignoreQuotes` is `true`, it will leave all quotes in the final parsed output. This is useful in situations where the fields aren't quoted, but contain quotes, or if the CSV didn't handle quotes correctly and you're trying to parse it.
+	
+	```lua
+	local options = {loadFromString=true, ignoreQuotes=true}
+	local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", options)
+	```
+
 - `headerFunc`

 	Applies a function to every field in the header. If you are using `rename`, the function is applied after the rename.
@ -92,13 +116,17 @@ ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false})

 	In the above example, the first field becomes 'a', the second field becomes 'b' and so on.

-For all tested examples, take a look in /spec/feature_spec.lua and /spec/dynamic_features_spec.lua
+For all tested examples, take a look in /spec/feature_spec.lua

+The options can be string together. For example if you wanted to `loadFromString` and not use `headers`, you could use the following:
+```lua
+ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false})
+```

 ## Encoding
 ### `ftcsv.encode(inputTable, delimiter[, options])`

-ftcsv can also take a lua table and turn it into a text string to be written to a file. It has two required parameters, an inputTable and a delimiter. You can use it to write out a file like this:
+`ftcsv.encode` takes in a lua table and turns it into a text string that can be written to a file. It has two required parameters, an inputTable and a delimiter. You can use it to write out a file like this:
 ```lua
 local fileOutput = ftcsv.encode(users, ",")
 local file = assert(io.open("ALLUSERS.csv", "w"))
@ -116,54 +144,53 @@ file:close()
 	```


-## Error Handling
-ftcsv returns a bunch of errors when passed a bad csv file or incorrect parameters. You can find a more detailed explanation of the more cryptic errors in [ERRORS.md](ERRORS.md)

+## Error Handling
+ftcsv returns a litany of errors when passed a bad csv file or incorrect parameters. You can find a more detailed explanation of the more cryptic errors in [ERRORS.md](ERRORS.md)

 ## Benchmarks
 We ran ftcsv against a few different csv parsers ([PIL](http://www.lua.org/pil/20.4.html)/[csvutils](http://lua-users.org/wiki/CsvUtils), [lua_csv](https://github.com/geoffleyland/lua-csv), and [lpeg_josh](http://lua-users.org/lists/lua-l/2009-08/msg00020.html)) for lua and here is what we found:

-### 20 MB file, every field is double quoted (ftcsv optimal lua case\*)
+### 20 MB file, every field is double quoted

 | Parser    | Lua                | LuaJIT             |
 | --------- | ------------------ | ------------------ |
-| PIL/csvutils  | 3.939 +/- 0.565 SD | 1.429 +/- 0.175 SD |
-| lua_csv   | 8.487 +/- 0.156 SD | 3.095 +/- 0.206 SD |
-| lpeg_josh | **1.350 +/- 0.191 SD** | 0.826 +/- 0.176 SD |
-| ftcsv     | 3.101 +/- 0.152 SD | **0.499 +/- 0.133 SD** |
+| PIL/csvutils  | 1.754 +/- 0.136 SD | 1.012 +/- 0.112 SD |
+| lua_csv   | 4.191 +/- 0.128 SD | 2.382 +/- 0.133 SD |
+| lpeg_josh | **0.996 +/- 0.149 SD** | 0.725 +/- 0.083 SD |
+| ftcsv     | 1.342 +/- 0.130 SD | **0.301 +/- 0.099 SD** |

-\* see Performance section below for an explanation

 ### 12 MB file, some fields are double quoted

 | Parser    | Lua                | LuaJIT             |
 | --------- | ------------------ | ------------------ |
-| PIL/csvutils  | 2.868 +/- 0.101 SD | 1.244 +/- 0.129 SD |
-| lua_csv   | 7.773 +/- 0.083 SD | 3.495 +/- 0.172 SD |
-| lpeg_josh | **1.146 +/- 0.191 SD** | 0.564 +/- 0.121 SD |
-| ftcsv     | 3.401 +/- 0.109 SD | **0.441 +/- 0.124 SD** |
+| PIL/csvutils  | 1.456 +/- 0.083 SD | 0.691 +/- 0.071 SD |
+| lua_csv   | 3.738 +/- 0.072 SD | 1.997 +/- 0.075 SD |
+| lpeg_josh | **0.638 +/- 0.070 SD** | 0.475 +/- 0.042 SD |
+| ftcsv     | 1.307 +/- 0.071 SD | **0.213 +/- 0.062 SD** |

 [LuaCSV](http://lua-users.org/lists/lua-l/2009-08/msg00012.html) was also tried, but usually errored out at odd places during parsing.

 NOTE: times are measured using `os.clock()`, so they are in CPU seconds. Each test was run 30 times in a randomized order. The file was pre-loaded, and only the csv decoding time was measured.

-Benchmarks were run under ftcsv 1.1.6
+Benchmarks were run under ftcsv 1.2.0

 ## Performance
-We did some basic testing and found that in lua, if you want to iterate over a string character-by-character and look for single chars, `string.byte` performs faster than `string.sub`. This is especially true for LuaJIT. As such, in LuaJIT, ftcsv iterates over the whole file and does byte compares to find quotes and delimiters. However, for pure lua, `string.find` is used to find quotes but `string.byte` is used everywhere else as the CSV format in its proper form will have quotes around fields. If you have thoughts on how to improve performance (either big picture or specifically within the code), create a GitHub issue - I'd love to hear about it!
+I did some basic testing and found that in lua, if you want to iterate over a string character-by-character and compare chars, `string.byte` performs faster than `string.sub`. As such, ftcsv iterates over the whole file and does byte compares to find quotes and delimiters and then generates a table from it. When using vanilla lua, it proved faster to use `string.find` instead of iterating character by character (which is faster in LuaJIT), so ftcsv accounts for that and will perform the fastest option that is availble. If you have thoughts on how to improve performance (either big picture or specifically within the code), create a GitHub issue - I'd love to hear about it!


 ## Contributing
 Feel free to create a new issue for any bugs you've found or help you need. If you want to contribute back to the project please do the following:

- 0. If it's a major change (aka more than a quick bugfix), please create an issue so we can discuss it!
- 1. Fork the repo
- 2. Create a new branch
- 3. Push your changes to the branch
- 4. Run the test suite and make sure it still works
- 5. Submit a pull request
- 6. Wait for review
- 7. Enjoy the changes made!
+ 1. If it's a major change (aka more than a quick bugfix), please create an issue so we can discuss it!
+ 2. Fork the repo
+ 3. Create a new branch
+ 4. Push your changes to the branch
+ 5. Run the test suite and make sure it still works
+ 6. Submit a pull request
+ 7. Wait for review
+ 8. Enjoy the changes made!



--- a/ftcsv-1.1.6-1.rockspec
+++ b/ftcsv-1.1.6-1.rockspec
@ -1,30 +0,0 @@
-package = "ftcsv"
-version = "1.1.6-1"
-
-source = {
-	url = "git://github.com/FourierTransformer/ftcsv.git",
-	tag = "1.1.6"
-}
-
-description = {
-	summary = "A fast pure lua csv library (parser and encoder)",
-	detailed = [[
-    ftcsv works well for CSVs that can easily be fully loaded into memory (easily up to a hundred MB) and correctly handles `\n` (LF), `\r` (CR) and `\r\n` (CRLF) line endings. It has UTF-8 support, and will strip out the BOM if it exists. ftcsv can also parse headerless csv-like files and supports column remapping, file or string based loading, and more!
-
-    Note: Currently it cannot load CSV files where the file can't fit in memory.
-  ]],
-	homepage = "https://github.com/FourierTransformer/ftcsv",
-	maintainer = "Shakil Thakur <shakil.thakur@gmail.com>",
-	license = "MIT"
-}
-
-dependencies = {
-	"lua >= 5.1, <5.4",
-}
-
-build = {
-	type = "builtin",
-	modules = {
-		["ftcsv"] = "ftcsv.lua"
-	},
-}
--- a/ftcsv-1.2.0-1.rockspec
+++ b/ftcsv-1.2.0-1.rockspec
@ -0,0 +1,35 @@
+package = "ftcsv"
+version = "1.2.0-1"
+
+source = {
+	url = "git://github.com/FourierTransformer/ftcsv.git",
+	tag = "1.2.0"
+}
+
+description = {
+	summary = "A fast pure lua csv library (parser and encoder)",
+	detailed = [[
+   ftcsv is a fast and easy to use csv library for lua. It can read in CSV files,
+   do some basic transformations (rename fields) and can create the csv format.
+   It supports UTF-8, header-less CSVs, and maintaining correct line endings for
+   multi-line fields.
+
+   It supports loading an entire CSV file into memory and parsing it as well as
+   buffered reading of a CSV file.
+  ]],
+	homepage = "https://github.com/FourierTransformer/ftcsv",
+	maintainer = "Shakil Thakur <shakil.thakur@gmail.com>",
+	license = "MIT"
+}
+
+dependencies = {
+	"lua >= 5.1, <5.4",
+}
+
+build = {
+	type = "builtin",
+	modules = {
+		["ftcsv"] = "ftcsv.lua"
+	},
+}
+
--- a/ftcsv.lua
+++ b/ftcsv.lua
--- a/spec/dynamic_features_spec.lua
+++ b/spec/dynamic_features_spec.lua
@ -460,4 +460,32 @@ describe("csv features", function()
        end
    end

+    for bom, i in pairs(BOM) do
+        for newline, j in pairs(newlines) do
+            for _, endline in ipairs(endlines) do
+                local name = "should handle ignoring quotes (%s + %s) EOF: %s"
+                it(name:format(bom, newline, endline), function()
+                    local expectedHeaders = {"a", "b", "c"}
+                    local expected = {}
+                    expected[1] = {}
+                    expected[1].a = '"apple"'
+                    expected[1].b = '"banana"'
+                    expected[1].c = '"carrot"'
+
+                    local defaultString = '%sa,b,c%s"apple","banana","carrot"%s'
+
+                    if endline == "NONE" then
+                        defaultString = defaultString:format(i, j, "")
+                    else
+                        defaultString = defaultString:format(i, j, j)
+                    end
+
+                    local options = {loadFromString=true, ignoreQuotes=true}
+                    local actual, actualHeaders = ftcsv.parse(defaultString, ",", options)
+                    assert.are.same(expected, actual)
+                    assert.are.same(expectedHeaders, actualHeaders)
+                end)
+            end
+        end
+    end
 end)
--- a/spec/error_spec.lua
+++ b/spec/error_spec.lua
@ -41,4 +41,31 @@ it("should error out when you want to encode a table and specify a field that do
 	end

 	assert.has_error(test, "ftcsv: the field 'c' doesn't exist in the inputTable")
-end)
+end)
+
+describe("parseLine features small, nonworking buffer size", function()
+    it("should error out when trying to load from string", function()
+        local test = function()
+            local parse = {}
+            for i, line in ftcsv.parseLine("a,b,c\n1,2,3", ",", {loadFromString=true}) do
+                parse[i] = line
+            end
+            return parse
+        end
+        assert.has_error(test, "ftcsv: parseLine currently doesn't support loading from string")
+    end)
+end)
+
+it("should error when dealing with quotes", function()
+	local test = function()
+		local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", {loadFromString=true})
+	end
+	assert.has_error(test, "ftcsv: can't find closing quote in row 1. Try running with the option ignoreQuotes=true if the source incorrectly uses quotes.")
+end)
+
+it("should error if bufferSize is set when parsing entire files", function()
+	local test = function()
+		local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", {loadFromString=true, bufferSize=34})
+	end
+	assert.has_error(test, "ftcsv: bufferSize can only be specified using 'parseLine'. When using 'parse', the entire file is read into memory")
+end)
--- a/spec/feature_spec.lua
+++ b/spec/feature_spec.lua
@ -61,6 +61,16 @@ describe("csv features", function()
 		assert.are.same(expected, actual)
 	end)

+        it("should handle escaped doublequotes", function()
+                local expected = {}
+                expected[1] = {}
+                expected[1].a = 'A"B""C'
+                expected[1].b = 'A""B"C'
+                expected[1].c = 'A"""B""C'
+                local actual = ftcsv.parse('a;b;c\n"A""B""""C";"A""""B""C";"A""""""B""""C"', ";", {loadFromString=true})
+                assert.are.same(expected, actual)
+        end)
+
 	it("should handle renaming a field", function()
 		local expected = {}
 		expected[1] = {}
@ -308,4 +318,24 @@ describe("csv features", function()
 		assert.are.same(expected, actual)
 	end)

-end)
+	it("should handle headers attempting to escape", function()
+		local expected = {}
+		expected[1] = {}
+		expected[1]["]] print('hello')"] = "apple"
+		expected[1].b = "banana"
+		expected[1].c = "carrot"
+		local actual = ftcsv.parse("]] print('hello'),b,c\napple,banana,carrot", ",", {loadFromString=true})
+		assert.are.same(expected, actual)
+	end)
+
+	it("should handle ignoring the single quote", function()
+		local expected = {}
+		expected[1] = {}
+		expected[1].a = '"apple'
+		expected[1].b = "banana"
+		expected[1].c = "carrot"
+		local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", {loadFromString=true, ignoreQuotes=true})
+		assert.are.same(expected, actual)
+	end)
+
+end)
--- a/spec/parseLine_spec.lua
+++ b/spec/parseLine_spec.lua
@ -0,0 +1,76 @@
+local ftcsv = require('ftcsv')
+local cjson = require('cjson')
+
+local function loadFile(textFile)
+    local file = io.open(textFile, "r")
+    if not file then error("File not found at " .. textFile) end
+    local allLines = file:read("*all")
+    file:close()
+    return allLines
+end
+
+describe("parseLine features small, working buffer size", function()
+    it("should handle correctness", function()
+        local json = loadFile("spec/json/correctness.json")
+        json = cjson.decode(json)
+        local parse = {}
+        for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=52}) do
+            assert.are.same(json[i], line)
+            parse[i] = line
+        end
+        assert.are.same(#json, #parse)
+        assert.are.same(json, parse)
+    end)
+end)
+
+describe("parseLine features small, nonworking buffer size", function()
+    it("should handle correctness", function()
+        local test = function()
+            local parse = {}
+            for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=63}) do
+                parse[i] = line
+            end
+            return parse
+        end
+        assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file")
+    end)
+end)
+
+describe("parseLine features smaller, nonworking buffer size", function()
+    it("should handle correctness", function()
+        local test = function()
+            local parse = {}
+            for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=50}) do
+                parse[i] = line
+            end
+            return parse
+        end
+        assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file")
+    end)
+end)
+
+describe("smaller bufferSize than header and incorrect number of fields", function()
+    it("should handle correctness", function()
+        local test = function()
+            local parse = {}
+            for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=23}) do
+                parse[i] = line
+            end
+            return parse
+        end
+        assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file")
+    end)
+end)
+
+describe("smaller bufferSize than header, but with correct field numbers", function()
+    it("should handle correctness", function()
+        local test = function()
+            local parse = {}
+            for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=30}) do
+                parse[i] = line
+            end
+            return parse
+        end
+        assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file")
+    end)
+end)
--- a/spec/parse_encode_spec.lua
+++ b/spec/parse_encode_spec.lua
@ -43,6 +43,22 @@ describe("csv decode", function()
 	end
 end)

+describe("csv parseLine decode", function()
+	for _, value in ipairs(files) do
+		it("should handle " .. value, function()
+			local json = loadFile("spec/json/" .. value .. ".json")
+			json = cjson.decode(json)
+			local parse = {}
+			for i, v in ftcsv.parseLine("spec/csvs/" .. value .. ".csv", ",") do
+				parse[i] = v
+				assert.are.same(json[i], v)
+			end
+			assert.are.same(#json, #parse)
+			assert.are.same(json, parse)
+		end)
+	end
+end)
+
 describe("csv decode from string", function()
 	for _, value in ipairs(files) do
 		it("should handle " .. value, function()
@ -70,4 +86,4 @@ describe("csv encode", function()
 			assert.are.same(jsonDecode, reEncoded)
 		end)
 	end
-end)
+end)