2018-07-27 22:21:05 -07:00
|
|
|
#pragma once
|
|
|
|
|
|
2020-02-21 19:32:03 -08:00
|
|
|
#include <cerrno>
|
2018-07-27 22:21:05 -07:00
|
|
|
#include <cstdio>
|
|
|
|
|
#include <cstring>
|
2020-02-21 19:32:03 -08:00
|
|
|
#include <fstream>
|
2018-09-28 07:41:26 -07:00
|
|
|
#include <istream>
|
|
|
|
|
#include <ostream>
|
2018-07-27 22:21:05 -07:00
|
|
|
|
2018-11-27 12:43:24 -08:00
|
|
|
#include <c10/core/Allocator.h>
|
2018-12-03 21:48:46 -08:00
|
|
|
#include <c10/core/Backend.h>
|
2018-10-26 12:04:57 -07:00
|
|
|
|
2019-01-04 22:47:35 -08:00
|
|
|
#include "caffe2/serialize/istream_adapter.h"
|
|
|
|
|
#include "caffe2/serialize/read_adapter_interface.h"
|
2020-12-30 15:32:16 -08:00
|
|
|
#include "caffe2/serialize/versions.h"
|
2018-10-26 12:04:57 -07:00
|
|
|
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
extern "C" {
|
|
|
|
|
typedef struct mz_zip_archive mz_zip_archive;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// PyTorch containers are a special zip archive with the following layout
|
|
|
|
|
// archive_name.zip contains:
|
|
|
|
|
// archive_name/
|
|
|
|
|
// version # a file with a single decimal number written in ascii,
|
|
|
|
|
// # used to establish the version of the archive format
|
|
|
|
|
// model.json # overall model description, this is a json output of
|
|
|
|
|
// # ModelDef from torch.proto
|
|
|
|
|
// # the following names are by convention only, model.json will
|
|
|
|
|
// # refer to these files by full names
|
|
|
|
|
// tensors/
|
|
|
|
|
// 0 # flat storage for tensor data, meta-data about shapes, etc. is
|
|
|
|
|
// # in model.json
|
|
|
|
|
// 1
|
|
|
|
|
// ...
|
|
|
|
|
// # code entries will only exist for modules that have methods attached
|
|
|
|
|
// code/
|
2020-02-21 19:32:03 -08:00
|
|
|
// archive_name.py # serialized torch script code (python syntax, using
|
|
|
|
|
// PythonPrint) archive_name_my_submodule.py # submodules have separate
|
|
|
|
|
// files
|
2018-07-27 22:21:05 -07:00
|
|
|
//
|
2020-02-21 19:32:03 -08:00
|
|
|
// The PyTorchStreamWriter also ensures additional useful properties for these
|
|
|
|
|
// files
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
// 1. All files are stored uncompressed.
|
|
|
|
|
// 2. All files in the archive are aligned to 64 byte boundaries such that
|
|
|
|
|
// it is possible to mmap the entire file and get an aligned pointer to
|
|
|
|
|
// tensor data.
|
|
|
|
|
// 3. We universally write in ZIP64 format for consistency.
|
|
|
|
|
|
|
|
|
|
// The PyTorchStreamReader also provides additional properties:
|
|
|
|
|
// 1. It can read zip files that are created with common
|
|
|
|
|
// zip tools. This means that even though our writer doesn't compress files,
|
|
|
|
|
// the reader can still read files that were compressed.
|
|
|
|
|
// 2. It provides a getRecordOffset function which returns the offset into the
|
2020-02-21 19:32:03 -08:00
|
|
|
// raw file where file data lives. If the file was written with
|
|
|
|
|
// PyTorchStreamWriter it is guaranteed to be 64 byte aligned.
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
|
|
|
|
|
// PyTorchReader/Writer handle checking the version number on the archive format
|
|
|
|
|
// and ensure that all files are written to a archive_name directory so they
|
|
|
|
|
// unzip cleanly.
|
|
|
|
|
|
2018-07-27 22:21:05 -07:00
|
|
|
// When developing this format we want to pay particular attention to the
|
|
|
|
|
// following use cases:
|
|
|
|
|
//
|
|
|
|
|
// -- Reading --
|
|
|
|
|
// 1) Reading with full random access
|
|
|
|
|
// a) Reading with file api's such as fread()
|
|
|
|
|
// b) mmaping the file and jumping around the mapped region
|
|
|
|
|
// 2) Reading with 1-pass sequential access
|
|
|
|
|
// -> A reader will need to build up a data structure of parsed structures
|
|
|
|
|
// as it reads
|
|
|
|
|
//
|
|
|
|
|
// -- Writing --
|
|
|
|
|
// 1) Writing with full random access
|
|
|
|
|
// 2) Writing with 1-pass sequential access
|
|
|
|
|
// -> We must take care not to require updating values that have already
|
|
|
|
|
// been written. We place the variable-length index at the end and do
|
|
|
|
|
// not put any indicies into the header to fulfill this constraint.
|
|
|
|
|
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
// The model.json, which contains all the metadata information,
|
2020-02-21 19:32:03 -08:00
|
|
|
// should be written as the last file. One reason is that the size of tensor
|
|
|
|
|
// data is usually stable. As long as the shape and type of the tensor do not
|
|
|
|
|
// change, the size of the data won't change. On the other sied, the size of the
|
2018-10-26 12:04:57 -07:00
|
|
|
// serialized model is likely to change, so we store it as the last record, and
|
|
|
|
|
// we don't need to move previous records when updating the model data.
|
|
|
|
|
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
// The zip format is sufficiently flexible to handle the above use-case.
|
|
|
|
|
// it puts its central directory at the end of the archive and we write
|
|
|
|
|
// model.json as the last file when writing after we have accumulated all
|
|
|
|
|
// other information.
|
2018-07-27 22:21:05 -07:00
|
|
|
|
2019-01-04 22:47:35 -08:00
|
|
|
namespace caffe2 {
|
|
|
|
|
namespace serialize {
|
2018-07-27 22:21:05 -07:00
|
|
|
|
2020-12-18 10:53:11 -08:00
|
|
|
class TORCH_API PyTorchStreamReader final {
|
2018-07-27 22:21:05 -07:00
|
|
|
public:
|
2019-01-04 22:47:35 -08:00
|
|
|
explicit PyTorchStreamReader(const std::string& file_name);
|
|
|
|
|
explicit PyTorchStreamReader(std::istream* in);
|
2020-12-06 18:08:07 -08:00
|
|
|
explicit PyTorchStreamReader(std::shared_ptr<ReadAdapterInterface> in);
|
2018-10-26 12:04:57 -07:00
|
|
|
|
|
|
|
|
// return dataptr, size
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
std::tuple<at::DataPtr, size_t> getRecord(const std::string& name);
|
|
|
|
|
size_t getRecordOffset(const std::string& name);
|
2019-08-22 11:44:53 -07:00
|
|
|
bool hasRecord(const std::string& name);
|
2019-11-22 12:28:49 -08:00
|
|
|
std::vector<std::string> getAllRecords();
|
2018-10-26 12:04:57 -07:00
|
|
|
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
~PyTorchStreamReader();
|
2019-10-16 22:45:50 -07:00
|
|
|
uint64_t version() const {
|
|
|
|
|
return version_;
|
|
|
|
|
}
|
2020-02-21 19:32:03 -08:00
|
|
|
|
2018-07-27 22:21:05 -07:00
|
|
|
private:
|
2019-01-04 22:47:35 -08:00
|
|
|
void init();
|
|
|
|
|
size_t read(uint64_t pos, char* buf, size_t n);
|
2019-10-01 17:09:06 -07:00
|
|
|
void valid(const char* what, const char* info = "");
|
2019-08-22 11:44:53 -07:00
|
|
|
size_t getRecordID(const std::string& name);
|
2019-01-04 22:47:35 -08:00
|
|
|
|
|
|
|
|
friend size_t
|
|
|
|
|
istream_read_func(void* pOpaque, uint64_t file_ofs, void* pBuf, size_t n);
|
|
|
|
|
std::unique_ptr<mz_zip_archive> ar_;
|
|
|
|
|
std::string archive_name_;
|
2019-10-18 07:34:27 -07:00
|
|
|
std::string archive_name_plus_slash_;
|
2020-12-06 18:08:07 -08:00
|
|
|
std::shared_ptr<ReadAdapterInterface> in_;
|
2019-10-16 22:45:50 -07:00
|
|
|
int64_t version_;
|
2018-07-27 22:21:05 -07:00
|
|
|
};
|
|
|
|
|
|
2020-12-18 10:53:11 -08:00
|
|
|
class TORCH_API PyTorchStreamWriter final {
|
2018-07-27 22:21:05 -07:00
|
|
|
public:
|
Make JIT Serialization support arbitrary std::function<> IO (#28039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28039
Right now, torch::save() uses std::ostream, which results in unnecessary
data copies in practice. Similar for torch::load().
Adding a std::function<size_t(const void*, size_t)> as an output option,
parallel to the existing filename and std::ostream apis, gives users the
flexibility to emit directly to a backing store.
For a simple case of appending the output to a std::string, we observe
significant benchmark savings (on order of -50%), even with the
minor std::function<> dispatch overhead. The main reason is that
std::ostringstream effectively requires 2 extra copies of the data
beyond a simple string.append lambda.
We also provide a parallel api for the load(), though this one is
slightly more complex due to the need to do arbitrary position reads.
Test Plan:
buck test mode/dev-nosan caffe2/test/...
(Basic serialization test in caffe2/test/cpp/api/serialize.cpp)
Benchmark in experimental/jeremyl/c2/SerializationBench.cpp, with D17823443
(1M time goes from 90ms -> 40ms, albeit with crc patch applied)
Differential Revision: D17939034
fbshipit-source-id: 344cce46f74b6438cb638a8cfbeccf4e1aa882d7
2019-10-15 22:08:17 -07:00
|
|
|
explicit PyTorchStreamWriter(std::string archive_name);
|
|
|
|
|
explicit PyTorchStreamWriter(
|
|
|
|
|
const std::function<size_t(const void*, size_t)>& writer_func);
|
|
|
|
|
|
2020-06-24 12:39:42 -07:00
|
|
|
void setMinVersion(const uint64_t version);
|
|
|
|
|
|
Make JIT Serialization support arbitrary std::function<> IO (#28039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28039
Right now, torch::save() uses std::ostream, which results in unnecessary
data copies in practice. Similar for torch::load().
Adding a std::function<size_t(const void*, size_t)> as an output option,
parallel to the existing filename and std::ostream apis, gives users the
flexibility to emit directly to a backing store.
For a simple case of appending the output to a std::string, we observe
significant benchmark savings (on order of -50%), even with the
minor std::function<> dispatch overhead. The main reason is that
std::ostringstream effectively requires 2 extra copies of the data
beyond a simple string.append lambda.
We also provide a parallel api for the load(), though this one is
slightly more complex due to the need to do arbitrary position reads.
Test Plan:
buck test mode/dev-nosan caffe2/test/...
(Basic serialization test in caffe2/test/cpp/api/serialize.cpp)
Benchmark in experimental/jeremyl/c2/SerializationBench.cpp, with D17823443
(1M time goes from 90ms -> 40ms, albeit with crc patch applied)
Differential Revision: D17939034
fbshipit-source-id: 344cce46f74b6438cb638a8cfbeccf4e1aa882d7
2019-10-15 22:08:17 -07:00
|
|
|
void writeRecord(
|
|
|
|
|
const std::string& name,
|
|
|
|
|
const void* data,
|
|
|
|
|
size_t size,
|
|
|
|
|
bool compress = false);
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
void writeEndOfFile();
|
2018-10-26 12:04:57 -07:00
|
|
|
|
|
|
|
|
bool finalized() const {
|
|
|
|
|
return finalized_;
|
|
|
|
|
}
|
|
|
|
|
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
const std::string& archiveName() {
|
|
|
|
|
return archive_name_;
|
2018-07-27 22:21:05 -07:00
|
|
|
}
|
2018-10-26 12:04:57 -07:00
|
|
|
|
Use a zip archive as our container format (#14521)
Summary:
After consulting with Owen, who pointed out the existence of the miniz library, I decided to take one last shot at using zip as our container format.
miniz makes this surprisingly feasible and I think the benefits of using zip are large enough that we should do it.
This replaces our custom container format with a zip archive, preserving all of the
desirable features of our custom format, such as append-oriented writing, and
mmap'able tensor data while adding a bunch of debugging advantages:
1. You can unzip and explore the container to debug what is going on with a model.
2. You can edit the model using a text editor (e.g. change the definition of a method,
or editing the json-serialized meta-data), re-zip the file use OSX's native 'Compress'
option, and re-load the result into pytorch. Note: this enables you to, e.g., print-debug
serialized models.
3. We can easily enable features like compression in the future.
4. Stock python , without pytorch installed, and other programming languages
can reasonably consume this format,using json and zipfile packages, which enables
people to build tools like visualizers without those visualizers depending on pytorch.
This will be especially useful if you want to, for instance, write a visualizer in javascript.
Notes:
* This add miniz (https://github.com/richgel999/miniz) as a dependency. miniz is a self-contained
library for reading/writing zipfiles that unlike other zip libraries also includes libz
compatible compress/decompress support. It is a single header and a single C file without
any other dependencies. Note that the instructions for miniz explicitly state:
> Please use the files from the releases page in your projects. Do not use the git checkout directly!
So we have checked in the 'release' source. Miniz supports zip64, and its API is amenable
to doing zip-align style things to align data.
* Removes 'size' from RecordRef. This allows you to edit files in the zip archive without
editing the meta-data file. Very important if you want to print-debug serialized models.
* PyTorchStreamReader/PyTorchStreamWriter keep mostly the same API (though keys become strings)
However, their implementation is completely swapped out to use miniz.
* Code exists to check for the old magic number to give a decent warning to our preview users
after we change the format.
* Container version information is now put in a stand-alone 'version' file in the archive
and serves a similar purpose to the other container version info.
* All files in the zip archive start at 64-byte boundaries, using an approach similar to
zip-align. Tests check that this property remains true. While the writer does this,
the reader doesn't depend on it, allowing user-created archives that can use compression,
and do not have to align data.
* Added test to check for > 4GB files and archives. Disabled by default because it takes
almost 2 minutes to run.
* torchscript files are now optional: if a submodule does not have methods, it will
not be written.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14521
Reviewed By: jamesr66a
Differential Revision: D13252945
Pulled By: zdevito
fbshipit-source-id: 01209294c0f6543d0fd716f85a38532249c52f8c
2018-11-30 19:15:09 -08:00
|
|
|
~PyTorchStreamWriter();
|
2018-09-28 07:41:26 -07:00
|
|
|
|
|
|
|
|
private:
|
2020-02-21 19:32:03 -08:00
|
|
|
void setup(const std::string& file_name);
|
2019-10-01 17:09:06 -07:00
|
|
|
void valid(const char* what, const char* info = "");
|
|
|
|
|
size_t current_pos_ = 0;
|
|
|
|
|
std::unique_ptr<mz_zip_archive> ar_;
|
|
|
|
|
std::string archive_name_;
|
Make JIT Serialization support arbitrary std::function<> IO (#28039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28039
Right now, torch::save() uses std::ostream, which results in unnecessary
data copies in practice. Similar for torch::load().
Adding a std::function<size_t(const void*, size_t)> as an output option,
parallel to the existing filename and std::ostream apis, gives users the
flexibility to emit directly to a backing store.
For a simple case of appending the output to a std::string, we observe
significant benchmark savings (on order of -50%), even with the
minor std::function<> dispatch overhead. The main reason is that
std::ostringstream effectively requires 2 extra copies of the data
beyond a simple string.append lambda.
We also provide a parallel api for the load(), though this one is
slightly more complex due to the need to do arbitrary position reads.
Test Plan:
buck test mode/dev-nosan caffe2/test/...
(Basic serialization test in caffe2/test/cpp/api/serialize.cpp)
Benchmark in experimental/jeremyl/c2/SerializationBench.cpp, with D17823443
(1M time goes from 90ms -> 40ms, albeit with crc patch applied)
Differential Revision: D17939034
fbshipit-source-id: 344cce46f74b6438cb638a8cfbeccf4e1aa882d7
2019-10-15 22:08:17 -07:00
|
|
|
std::string archive_name_plus_slash_;
|
2019-11-08 15:01:53 -08:00
|
|
|
std::string padding_;
|
2019-10-01 17:09:06 -07:00
|
|
|
std::ofstream file_stream_;
|
Make JIT Serialization support arbitrary std::function<> IO (#28039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28039
Right now, torch::save() uses std::ostream, which results in unnecessary
data copies in practice. Similar for torch::load().
Adding a std::function<size_t(const void*, size_t)> as an output option,
parallel to the existing filename and std::ostream apis, gives users the
flexibility to emit directly to a backing store.
For a simple case of appending the output to a std::string, we observe
significant benchmark savings (on order of -50%), even with the
minor std::function<> dispatch overhead. The main reason is that
std::ostringstream effectively requires 2 extra copies of the data
beyond a simple string.append lambda.
We also provide a parallel api for the load(), though this one is
slightly more complex due to the need to do arbitrary position reads.
Test Plan:
buck test mode/dev-nosan caffe2/test/...
(Basic serialization test in caffe2/test/cpp/api/serialize.cpp)
Benchmark in experimental/jeremyl/c2/SerializationBench.cpp, with D17823443
(1M time goes from 90ms -> 40ms, albeit with crc patch applied)
Differential Revision: D17939034
fbshipit-source-id: 344cce46f74b6438cb638a8cfbeccf4e1aa882d7
2019-10-15 22:08:17 -07:00
|
|
|
std::function<size_t(const void*, size_t)> writer_func_;
|
2020-06-24 12:39:42 -07:00
|
|
|
uint64_t version_ = kProducedFileFormatVersion;
|
2019-10-01 17:09:06 -07:00
|
|
|
bool finalized_ = false;
|
Make JIT Serialization support arbitrary std::function<> IO (#28039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28039
Right now, torch::save() uses std::ostream, which results in unnecessary
data copies in practice. Similar for torch::load().
Adding a std::function<size_t(const void*, size_t)> as an output option,
parallel to the existing filename and std::ostream apis, gives users the
flexibility to emit directly to a backing store.
For a simple case of appending the output to a std::string, we observe
significant benchmark savings (on order of -50%), even with the
minor std::function<> dispatch overhead. The main reason is that
std::ostringstream effectively requires 2 extra copies of the data
beyond a simple string.append lambda.
We also provide a parallel api for the load(), though this one is
slightly more complex due to the need to do arbitrary position reads.
Test Plan:
buck test mode/dev-nosan caffe2/test/...
(Basic serialization test in caffe2/test/cpp/api/serialize.cpp)
Benchmark in experimental/jeremyl/c2/SerializationBench.cpp, with D17823443
(1M time goes from 90ms -> 40ms, albeit with crc patch applied)
Differential Revision: D17939034
fbshipit-source-id: 344cce46f74b6438cb638a8cfbeccf4e1aa882d7
2019-10-15 22:08:17 -07:00
|
|
|
bool err_seen_ = false;
|
2019-10-01 17:09:06 -07:00
|
|
|
friend size_t ostream_write_func(
|
|
|
|
|
void* pOpaque,
|
|
|
|
|
uint64_t file_ofs,
|
|
|
|
|
const void* pBuf,
|
|
|
|
|
size_t n);
|
2018-09-28 07:41:26 -07:00
|
|
|
};
|
|
|
|
|
|
2020-08-25 02:31:02 -07:00
|
|
|
namespace detail {
|
|
|
|
|
// Writer-specific constants
|
|
|
|
|
constexpr uint64_t kFieldAlignment = 64;
|
|
|
|
|
|
|
|
|
|
// Returns a record to be appended to the local user extra data entry in order
|
|
|
|
|
// to make data beginning aligned at kFieldAlignment bytes boundary.
|
|
|
|
|
size_t getPadding(
|
|
|
|
|
size_t cursor,
|
|
|
|
|
size_t filename_size,
|
|
|
|
|
size_t size,
|
|
|
|
|
std::string& padding_buf);
|
|
|
|
|
}
|
|
|
|
|
|
2019-01-04 22:47:35 -08:00
|
|
|
} // namespace serialize
|
|
|
|
|
} // namespace caffe2
|